How to preprocess different types of data for AI workloads
Data preparation for AI differs by data type. However, common themes include improving data quality, ensuring consistency, reducing computing demands and enhancing model performance.
The oldest axiom of computer science is "garbage in, garbage out." There's even an acronym for it: GIGO. The IT industry has seen countless examples of GIGO in action since the dawn of the computer age, but nowhere has this classic truism resonated more than in today's AI era.
The key to good AI isn't in algorithms. It's in the data. Bad data can ruin the best and most nuanced AI algorithms -- along with a business's reputation and revenue. AI needs good data, and lots of it. This puts the focus squarely on data quality to ensure data completeness, accuracy, fairness, relevance and timeliness.
However, even quality data doesn't work well in its raw state. Data teams must preprocess data before it can be used to train and tune AI models, and the preparation process can differ across major data types. Given the importance of AI data preprocessing, it's time to take a closer look at the preprocessing steps for four major data types: structured, unstructured, semistructured and sensor data.
Why does AI need preprocessed data?
Data needs preparation before it can be delivered to the machine learning (ML) models of an AI system. This is always true -- whether training and validating an AI model or using the platform for inference. Preprocessing is a central element of AI data quality efforts, and brings several important benefits to the AI system:
Data quality assurance
Preprocessing is the last significant opportunity for a business to examine data and address errors, reduce noise, handle missing data and gauge the value of data elements. Examination can involve multiple tools, such as Anomalo, Ataccama, Informatica and Monte Carlo. Proper tools -- and some human intervention -- improve model accuracy and outcomes.
Data consistency
ML model algorithms require consistent data structures, such as numerical formats, to properly identify data patterns. Inconsistent data structures can significantly complicate pattern ID and recognition, reducing output accuracy. Preprocessing organizes data into an established, normalized format. For example, distance data might include both imperial and metric units. Normalization ensures that all distance data use the same units and precision.
Noise reduction
Raw data inevitably contains noise: random errors, inaccuracies, outlying values or meaningless information. Noise can obscure or distort valuable data patterns, reducing model accuracy and impairing outcomes. Noise is common in live data, such as data streams from manufacturing sensors and IoT devices from autonomous vehicles. Noisy data is often spotted and removed near the edge, but it's checked during preprocessing.
Bias reduction
ML models and AI systems are facing increasing ethical scrutiny, including concerns about bias and fairness. Preprocessing is an opportunity for a business to evaluate data for content that might result in unfair, inaccurate or otherwise skewed outcomes. Numerous tools can assist in bias detection, such as Amazon SageMaker Clarify, Credo AI, Fiddler AI, Google What-If Tool and IBM AI Fairness 360.
Preprocessing structured data for AI
Structured data includes strict data types that are formatted into well-established row-and-column structures. For example, structured data typically involves numbers, fixed-length alphanumeric strings and dates formatted for use in CRM systems, databases and spreadsheets.
AI works well with structured data because its formalized structure can be processed, searched and analyzed quickly for efficient decision-making. This makes structured data important for predictive tasks such as forecasting and anomaly detection.
While structured data might be well-suited to AI, the data contained within the structure can be far from ideal. Preprocessing structured data for AI typically emphasizes the following practices:
Cleaning. Check for missing values, remove duplicate entries and outliers, and address other data inconsistencies.
Normalization. Adjust numerical data so that every entry uses the same scale and precision. This is essential when data is collected from different devices or sources.
Encoding. Convert non-numerical data into numerical data so ML models can process it.
Feature engineering. Add or manipulate data to better represent the intended problem or situation. For example, prices and area data can be preprocessed to calculate price per square foot, which can be added to the data set to augment cost information.
Data reduction. Data sets can be vast, and computing time is expensive. Use data reduction techniques, such as principal component analysis, to reduce the overall amount of data while maintaining the characteristics of the data set.
Preprocessing unstructured data for AI
Unstructured data is any information that lacks the well-defined row-and-column formats of structured data. Unstructured data is far more common than structured data, and includes text, images, audio and video.
Unstructured data can be extremely difficult for AI to work with. Its chaotic nature makes it challenging to organize, search and analyze. However, unstructured data is crucial for advancing AI tasks, such as image recognition and generative AI.
The unstructured data needed for many AI applications comes in various forms.
The challenge with unstructured data is converting the raw information into data types and formats that an AI can learn and understand. Preprocessing unstructured data for AI can include the following steps:
Format standardization. To ensure consistency as raw data is ingested, clean unstructured data so that each element uses the same file format, resolution or sampling rate.
Deduplication. Unstructured data sets can be enormous, so deduplication is a data reduction step used to find and remove duplicate data because multiple copies of the same, or almost the same, data are unnecessary for AI training.
Noise reduction. An additional cleaning step separates useful data from useless data. Noise reduction identifies and removes unnecessary content, leaving only useful content for ingestion or future processing.
Ingestion. Unstructured data must be converted into a format that an ML model can process. Conversion involves the following: natural language processing for text; optical character recognition for extracting text from varied data sources; speech-to-text conversion to convert audio to text; and image conversion into numerical vectors interpreted using neural networks.
Feature engineering. Ingested data is often augmented. For example, sensitive or personally identifiable information can be tagged, as can other important elements of the data. Text data can also be tagged with various metadata to further support searches and categorization.
Preprocessing semistructured data for AI
While semistructured data occupies a middle ground between structured and unstructured data, it isn't a mix of the two. Semistructured data doesn't fit within a rigid relational database of traditional rows and columns, but it also contains labels and hierarchies for structure.
Examples of semistructured data include JSON, HTML, YAML and XML -- all important data types often used in programming, log files and sensor data that AI must ingest and understand.
The challenges with semistructured data for AI include translating flexible, hierarchical and inconsistent data into a standardized numerical or tabular format for ML models. Preprocessing semistructured data for AI can involve the following:
Unified schema. Identify the nuances of each semistructured data file and implement conversions or translations to operate on each file type, enabling a unified schema for all file types.
Flattened hierarchies. Parse and translate nested and hierarchical data structures within a file, such as nested JSON, into a standardized tabular format, such as columns and rows.
Cleaning. Examine translated data for missing and outlying values, as well as missing and malformed records. Due to potential variations in file types and conversions, consider generating and reviewing detailed cleaning logs to validate cleaning and maintain data governance.
Normalization. Normalize and scale converted data so that all data uses a consistent scale, range and other features. Apply other data transformations as needed.
Feature engineering. Manipulate or enhance translated data to better represent the intended problem or enhance model training. For example, create new data from timestamps included with the data. Carefully log all feature engineering for transparency and review. Tools for manipulating data include NumPy and Python's pandas.
Preprocessing sensor data for AI
Sensor data is raw information collected in real time by devices positioned throughout an environment, such as a manufacturing floor. This often includes real-world data such as light, location, temperature and vibration. Sensor data is particularly valuable for AI pattern recognition, automation, prediction and decision-making.
Sensor data is typically highly structured, so it benefits from the preprocessing approaches for structured data. But sensor data exhibits three noteworthy traits that need special attention:
Volume and time. Sensor data is time-sensitive and generated at high volumes, demanding particular care in network and storage to manage latency and mitigate disruption. Real-time data also demands adequate computational resources -- there's no point in gathering time-sensitive data if it can't be used immediately. This is a key reason why sensor data is frequently preprocessed at the edge.
Data quality. Sensor data poses inherent data quality problems. Sensors can fail due to environmental hazards. Sensor data can be interrupted by signal loss due to obstructions or by entries being abandoned when the preprocessing system can't keep up with the incoming data flow. These types of issues result in sensor data containing missing and outlying values, as well as noise, which must be addressed in real time.
Time stamps. Sensor data never exists in isolation. Time is always a factor, and time stamps play a central role in sensor data quality. Time stamps are a major factor in data alignment, synchronizing data from all sensors and other data sources to create a complete picture of real-world conditions at any given moment. An ML model must make full use of time stamps during the learning process.
Documentation is often overlooked in AI data preprocessing.
Document AI data preprocessing practices
Documentation is often overlooked in AI data preprocessing. Comprehensive data documentation serves four vital functions:
1. Identifies the AI data workflow
Proper documentation outlines the entire AI data workflow, including acquisition, storage and security, discrete preparation steps, such as profiling, cleaning, reduction, transformation, enrichment, validation, model training and testing, and data retention guidance. These details are important for AI transparency, explainability and reproducibility.
2. Supports accountability and compliance
AI systems are increasingly subject to legal and regulatory pressure. Documenting AI data sources and processing workflows can help businesses provide clear, definitive reporting on where data comes from, how it's prepared and how it's used to train and test models. Documentation also supports more traditional concerns, such as data storage, security, privacy and retention.
3. Establishes consistency and reproducibility
Documenting a process creates a tangible template for consistent and reproducible data outcomes. This carries two benefits for AI data. First, documentation shows that all data is handled and processed consistently every time. Second, it ensures that the same data, processed in the same way, is reproducible by yielding the same outcomes.
4. Enhances staff training and support
Documentation removes much of the opacity that can occur when automation and large tool sets perform the heavy lifting across a business. Documentation helps with training, ensuring the entire AI team can see, understand and work within the data preparation process.
Many data processing tools, such as OpenRefine, Scrub AI and Zoho DataPrep, include documentation features. However, some platforms, such as Alteryx, Dataiku and Datameer, provide highly automated end-to-end documentation. The AI team must review any resulting documentation and share it with business leaders and AI platform stakeholders.
Stephen J. Bigelow, senior technology editor at TechTarget, has more than 30 years of technical writing experience in the PC and technology industry.