What is data modeling?
Data modeling is the process of creating a simplified diagram of a software system and the data elements it contains, using text and symbols to represent the data and how it flows. Data models provide a blueprint for designing a new database or reengineering a legacy application. Overall, data modeling helps an organization use its data effectively to meet business needs for information.
A data model can be thought of as a flowchart that illustrates data entities, their attributes and the relationships between entities. It enables data management and analytics teams to document data requirements for applications and identify errors in development plans before any code is written.
Alternatively, data models can be created through reverse-engineering efforts that extract them from existing systems. That's done to document the structure of relational databases that were built on an ad hoc basis without upfront data modeling and to define schemas for sets of raw data stored in data lakes or NoSQL databases to support specific analytics applications.
Why is data modeling done?
Data modeling is a core data management discipline. By providing a visual representation of data sets and their business context, it helps pinpoint information needs for different business processes. It then specifies the characteristics of the data elements that will be included in applications and in the database or file system structures used to process, store and manage the data.
This article is part of
Data modeling can also help establish common data definitions and internal data standards, often in connection with data governance programs. In addition, it plays a big role in data architecture processes that document data assets, map how data moves through IT systems and create a conceptual data management framework. Data models are a key data architecture component, along with data flow diagrams, architectural blueprints, a unified data vocabulary and other artifacts.
Traditionally, data models have been built by data modelers, data architects and other data management professionals with input from business analysts, executives and users. But data modeling is also now an important skill for data scientists and analysts involved in developing business intelligence applications and more complex data science and advanced analytics ones.
What are the different types of data models?
Data modelers use three types of models to separately represent business concepts and workflows, relevant data entities and their attributes and relationships, and technical structures for managing the data. The models typically are created in a progression as organizations plan new applications and databases. These are the different types of data models and what they include:
- Conceptual data model. This is a high-level visualization of the business or analytics processes that a system will support. It maps out the kinds of data that are needed, how different business entities interrelate and associated business rules. Business executives are the main audience for conceptual data models, to help them see how a system will work and ensure that it meets business needs. Conceptual models aren't tied to specific database or application technologies.
- Logical data model. Once a conceptual data model is finished, it can be used to create a less-abstract logical one. Logical data models show how data entities are related and describe the data from a technical perspective. For example, they define data structures and provide details on attributes, keys, data types and other characteristics. The technical side of an organization uses logical models to help understand required application and database designs. But like conceptual models, they aren't connected to a particular technology platform.
- Physical data model. A logical model serves as the basis for the creation of a physical data model. Physical models are specific to the database management system (DBMS) or application software that will be implemented. They define the structures that the database or a file system will use to store and manage the data. That includes tables, columns, fields, indexes, constraints, triggers and other DBMS elements. Database designers use physical data models to create designs and generate schema for databases.
Data modeling techniques
Data modeling emerged in the 1960s as databases became more widely used on mainframes and then minicomputers. It enabled organizations to bring consistency, repeatability and disciplined development to data processing and management. That's still the case, but the techniques used to create data models have evolved along with the development of new types of databases and computer systems.
These are the data modeling approaches used most widely over the years, including several that have largely been supplanted by newer techniques.
1. Hierarchical data modeling
Hierarchical data models organize data in a treelike arrangement of parent and child records. A child record can have only one parent, making this a one-to-many modeling method. The hierarchical approach originated in mainframe databases -- IBM's Information Management System (IMS) is the best-known example. Although hierarchical data models were mostly superseded by relational ones beginning in the 1980s, IMS is still available and used by many organizations. A similar hierarchical method is also used today in XML, formally known as Extensible Markup Language.
2. Network data modeling
This was also a popular data modeling option in mainframe databases that isn't used as much now. Network data models expanded on hierarchical ones by allowing child records to be connected to multiple parent records. The Conference on Data Systems Languages, a now-defunct technical standards group commonly called CODASYL, adopted a network data model specification in 1969. Because of that, the network technique is often referred to as the CODASYL model.
3. Relational data modeling
The relational data model was created as a more flexible alternative to hierarchical and network ones. First described in a 1970 technical paper by IBM researcher Edgar F. Codd, the relational model maps the relationships between data elements stored in different tables that contain sets of rows and columns. Relational modeling set the stage for the development of relational databases, and their widespread use made it the dominant data modeling technique by the mid-1990s.
4. Entity-relationship data modeling
A variation of the relational model that can also be used with other types of databases, entity-relationship (ER) models visually map entities, their attributes and the relationships between different entities. For example, the attributes of an employee data entity could include last name, first name, years employed and other relevant data. ER models provide an efficient approach for data capture and update processes, making them particularly suitable for transaction processing applications.
5. Dimensional data modeling
Dimensional data models are primarily used in data warehouses and data marts that support business intelligence applications. They consist of fact tables that contain data about transactions or other events and dimension tables that list attributes of the entities in the fact tables. For example, a fact table could detail product purchases by customers, while connected dimension tables hold data about the products and customers. Notable types of dimensional models are star schemas, which connect a fact table to different dimension tables, and snowflake schemas, which include multiple levels of dimension tables.
6. Object-oriented data modeling
As object-oriented programming advanced in the 1990s and software vendors developed object databases, object-oriented data modeling also emerged. The object-oriented approach is similar to the ER method in how it represents data, attributes and relationships, but it abstracts entities into objects. Different objects that have the same attributes and behaviors can be grouped into classes, and new classes can inherit the attributes and behaviors of existing ones. But object databases remain a niche technology for particular applications, which has limited the use of object-oriented modeling.
7. Graph data modeling
The graph data model is a more modern offshoot of network and hierarchical models. Typically paired with graph databases, it's often used to describe data sets that contain complex relationships. For example, graph data modeling is a popular approach in social networks, recommendation engines and fraud detection applications. Property graph data models are a common type -- in them, nodes that represent data entities and document their properties are connected by relationships, also known as edges or links, that define how different nodes are related to one another.
What is the data modeling process?
Ideally, conceptual, logical and physical data models are created in a sequential process that involves members of the data management team and business users. Input from business executives and workers is especially important during the conceptual and logical modeling phases. Otherwise, the data models may not fully capture the business context of data or meet an organization's information needs.
Typically, a data modeler or data architect initiates a modeling project by interviewing business stakeholders to gather requirements and details about business processes. Business analysts may also help design both the conceptual and logical models. At the end of the project, the physical data model is used to communicate specific technical requirements to database designers.
Peter Aiken, a data management consultant and associate professor of information systems at Virginia Commonwealth University, listed the following six steps for designing a data model during a 2019 Dataversity webinar:
- Identify the business entities that are represented in the data set.
- Identify key properties for each entity to differentiate between them.
- Create a draft entity-relationship model to show how entities are connected.
- Identify the data attributes that need to be incorporated into the model.
- Map the attributes to entities to illustrate the data's business meaning.
- Finalize the data model and validate its accuracy.
Even after that, the process typically isn't finished: Data models often must be updated and revised as an organization's data assets and business needs change.
Benefits and challenges of data modeling
Well-designed data models help an organization develop and implement a data strategy that takes full advantage of its data. Effective data modeling also helps ensure that individual databases and applications include the right data and are designed to meet business requirements on data processing and management.
Other benefits that data modeling provides include the following:
- Internal agreement on data definitions and standards. Data modeling supports efforts to standardize data definitions, terminology, concepts and formats enterprise-wide.
- Increased involvement in data management by business users. Because data modeling requires business input, it encourages collaboration between data management teams and business stakeholders, which ideally results in better systems.
- More efficient database design at a lower cost. By giving database designers a detailed blueprint to work from, data modeling streamlines their work and reduces the risk of design missteps that require revisions later in the process.
- Better use of available data assets. Ultimately, good data modeling enables organizations to use their data more productively, which can lead to better business performance, new business opportunities and competitive advantages over rival companies.
However, data modeling is a complicated process that can be difficult to do successfully. These are some of the common challenges that can send data modeling projects off track:
- A lack of organizational commitment and business buy-in. If corporate and business executives aren't on board about the need for data modeling, it's hard to get the required level of business participation. That means data management teams must secure executive support upfront.
- A lack of understanding by business users. Even if business stakeholders are fully committed, data modeling is an abstract process that can be hard for people to grasp. To help avoid that, conceptual and logical data models should be based on business terminology and concepts.
- Modeling complexity and scope creep. Data models often are big and complex, and modeling projects can become unwieldy if teams continue to create new iterations without finalizing the designs. It's important to set priorities and stick to an achievable project scope.
- Undefined or unclear business requirements. Particularly with new applications, the business side may not have fully formed information needs. Data modelers often must ask a series of questions to gather or clarify requirements and identify the necessary data.