What is raw data?
Raw data (sometimes called source data, atomic data or primary data) is data that has not been processed for use. A distinction is sometimes made between data and information to the effect that information is the end product of data processing. Raw data that has undergone processing is sometimes referred to as cooked data.
Although raw data has the potential to become "information," it requires selective extraction, organization and sometimes analysis and formatting for presentation. Because of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.
How raw data works
Tremendous amounts of raw data surround us and are produced every day. The human brain is incredibly good at taking in raw data, processing it and using it to make decisions.
For example, imagine you are trying to cross a busy road. The eyes capture raw data as flashes of light and dark. Then the brain takes these flashes and resolves them into objects such as street signs and cars. The working memory can tell you if that car is sitting still, getting bigger as it comes toward you, or getting smaller as it drives away. Meanwhile, the ears take in raw information in the form of vibrations in the air, which the brain translates into sounds that can be interpreted as the wind, voices or a car engine. Finally, all this processed data that came in through the eyes, ears and memory helps you make the informed decision to cross the street or not.
Computers cannot intuitively process raw data like a human mind can, however, and raw data is generally not useful on its own. Extra processing is required to turn it into useful information. Additionally, the final data from one system may be used as raw data in another.
For example, imagine a simple home thermostat. Its raw data source is a temperature probe -- usually read as an analog voltage level. The system takes this voltage level as raw data and turns it into a temperature reading. It can then use this processed data to meet a predetermined desired temperature for turning on and off a heater or air conditioner.
Furthermore, the system may feed this temperature reading and the current time into another climate control system as that system's raw data. Then the data is stored and analyzed over time to produce a predictive modeling algorithm to help make better heating and cooling decisions.
How to process raw data
Many sources can produce raw data. How it is processed and stored depend on its source and intended use, though. Examples of raw data can be financial transactions from a point of sale (POS) terminal, computer logs or even participant eye tracking data in a research project. Applications and devices can save raw data in various formats, but the most common format for interchanging raw data between systems is as a comma separated values (CSV) file.
In many instances, users must clean raw data before it can be used. Cleaning raw data may require parsing the data for easier ingestion into a computer, removing outliers or spurious results and, occasionally, reformatting or translating the data -- a process sometimes called massaging or crunching the data.
There are many ways to process raw data, ranging from simple to complex. A spreadsheet such as Microsoft Excel or Google sheets allows users to format, organize and graph data to reveal simple trends and help summarize data. More complicated systems such as business intelligence (BI) programs may use raw data for financial trending or forecasting purposes. Advanced systems may use raw data for alerting purposes or with machine learning to build models of the data and its behavior.
Value of raw data
The primary value in data is after it has been processed and interpreted. There is generally not much value in holding onto raw data without a way to use it, but as the cost of storage decreases, organizations are finding more and more value in collecting raw data for additional processing -- if not right away, then later.
Raw data may contain personally identifiable information (PII). This may make an organization liable for storing or transmitting it. Therefore, it may use data anonymization to remove PII from the raw data or data controls and implement data retention policies to limit the risk of data leaks.
Organizations can feed raw data into a database or a data warehouse (one of several kinds of data repositories -- see image above), which can collect raw data from many sources for automatic or manual correlating and processing. An analysist can then query the data using BI tools to produce useful information from the data.
Many large businesses today recognize the value of raw data. Consumer data is a hot commodity that they can buy and sell to build profiles of users or target a specific audience, for example. Businesses can also store operational and logging data for use in performance metrics and to streamline business practices, while they can use access logs and the like to identify computer breaches and track what data may have been accessed by hackers.