What is data validation?
Data validation is the practice of checking the integrity, accuracy and structure of data before it is used for a business operation. Data validation operation results can provide data used for data analytics, business intelligence or training a machine learning model. It can also be used to ensure the integrity of data for financial accounting or regulatory compliance.
Data can be examined as part of a validation process in a variety of ways, including data type, constraint, structured, consistency and code validation. Each type of data validation is designed to make sure the data meets the requirements to be useful.
Data validation is related to data quality. Data validation can be a component to measure data quality, which ensures that a given data set is supplied with information sources that are of the highest quality, authoritative and accurate.
Data validation is also used as part of application workflows, including spell checking and rules for strong password creation.
Why validate data?
For data scientists, data analysts and others working with data, validating it is very important. The output of any given system can only be as good as the data the operation is based on. These operations can include machine learning or artificial intelligence models, data analytics reports and business intelligence dashboards. Validating the data ensures that the data is accurate, which means all systems relying on a validated given data set will be as well.
Data validation is also important for data to be useful for an organization or for a specific application operation. For example, if data is not in the right format to be consumed by a system, then the data can't be used easily, if at all.
As data moves from one location to another, different needs for the data arise based on the context for how the data is being used. Data validation ensures that the data is correct for specific contexts. The right type of data validation makes the data useful.
What are the different types of data validation?
Multiple types of data validation are available to ensure that the right data is being used. The most common types of data validation include the following:
- Data type validation is common and confirms that the data in each field, column, list, range or file matches a specified data type and format.
- Constraint validation checks to see if a given data field input fits a specified requirement within certain ranges. For example, it verifies that a data field has a minimum or maximum number of characters.
- Structured validation ensures that data is compliant with a specified data format, structure or schema.
- Consistency validation makes sure data styles are consistent. For example, it confirms that all values are listed to two decimal points.
- Code validation is similar to a consistency check and confirms that codes used for different data inputs are correct. For example, it checks a country code or North American Industry Classification System (NAICS) codes.
How to perform data validation
Among the most basic and common ways that data is used is within a spreadsheet program such as Microsoft Excel or Google Sheets. In both Excel and Sheets, the data validation process is a straightforward, integrated feature. Excel and Sheets both have a menu item listed as Data > Data Validation. By selecting the Data Validation menu, a user can choose the specific data type or constraint validation required for a given file or data range.
ETL (Extract, Transform and Load) and data integration tools typically integrate data validation policies to be executed as data is extracted from one source and then loaded into another. Popular open source tools, such as dbt, also include data validation options and are commonly used for data transformation.
Data validation can also be done programmatically in an application context for an input value. For example, as an input variable is sent, such as a password, it can be checked by a script to make sure it meets constraint validation for the right length.