What is garbage in, garbage out (GIGO)?
Garbage in, garbage out, or GIGO, refers to the idea that in any system, the quality of output is determined by the quality of the input. For example, if a mathematical equation is improperly stated, the answer is unlikely to be correct. Similarly, if incorrect data is used as input into a computer program, the output is unlikely to be correct or informative.
GIGO simply means that the output a system produces depends on the input it receives. If you put garbage in, chances are high that you will get garbage out. This fact holds even if the program's logic is accurate. Thus, while logic is important, the correct input is equally -- if not more -- important to generate the correct and useful output.
The idea of GIGO is commonly used in mathematics and computer science, particularly in software development. However, it can be extended to any decision-making system or process where precise, accurate data is essential to generate correct results that can be used to make the right decisions.
History of garbage in, garbage out
The first recorded use of the phrase "garbage in, garbage out" dates back to 1957. However, George Fuechsel, an IBM programmer and instructor, is generally credited with coining the term in the early 1960s. Fuechsel is said to have used the term to state concisely that a computer model or program just processes what it is given: If it is given bad information, it will produce bad information.
The term is now widely used in mathematics, computer science, IT, data science, artificial intelligence (AI), machine learning (ML) and the internet of things (IoT). In fact, GIGO is used to refer to a wide range of situations in the real world, such as a faulty decision made as a result of incomplete information. "Rubbish in, rubbish out" or RIRO is another way of expressing GIGO.
Real-world examples of garbage in, garbage out
There are many real-world examples of GIGO in action, including the following:
- If a text editor tries to read a binary file, it will display unreadable content (garbage output) because it is not set up to read the input (binary). For the editor, the binary input is garbage.
- If a computer program tries to access a section of memory for which access has not been set up, the kernel will deny access. Consequently, the program will terminate abnormally (also known as a program crash or abend).
- If a machine learning model is not given correct training data, the model will learn incorrectly and produce incorrect output wherever its knowledge is applied.
- If a psychologist doesn't have all the information about a patient that's required to diagnose a mental disorder, they may misdiagnose and cause unintended harm to the patient.
In recent years, the COVID-19 pandemic has provided examples of GIGO and its effects. In the early days of the pandemic, some countries created very high time-series forecasts about hospitalizations and deaths. Dire predictions followed these forecasts, many of which were eventually unrealized, some of which were not. The better the quality of the data (i.e., not garbage) used in these predictions, the more accurate the forecasts.
Types of garbage input that result in garbage output
In GIGO situations, garbage input could be the following type of data:
- Incorrect (includes errors).
- Incorrectly obtained or recorded.
- Too different from other data (also known as outliers).
- Too similar to other data.
- Not applicable to the particular situation or application.
If data is incorrect due to errors, the errors may be the result of mistakes made during data collection or recording. Data that's not obtained in the right manner or from the right sources can also be garbage. Data points that are too different from the other data points in the set are known as outliers, which typically have values that are abnormally higher or lower than the average, affecting calculations and skewing the results.
Data points that are too similar to the other data points are said to be highly correlated or collinear. They make the model or system unstable so it makes incorrect inferences and produces garbage output. Missing data or data that's not applicable to the situation or application can prevent the system from generating significant results or producing biased insights.
In all of these cases, when the data is input into the system, it may result in misleading or incorrect results.
Other reasons for garbage output
Bad data is not the only reason for garbage output. GIGO also refers to incorrect thinking, incorrect assumptions and bias. All three are common problems in ML and AI applications. If the data scientist or modeler doesn't understand the issues being modeled, is biased in a certain direction or makes assumptions without proof, the resultant model may be inaccurate and produce garbage output.
Incorrect theoretical and conceptual models are also a cause of garbage output, along with incorrectly labeled data. Examples of such models include conspiracy theories. Other sources of garbage include the following:
- Poor understanding of causality.
- Incomplete, missing or inaccurate documentation.
- Wrong hypotheses.
- Inadequate research.
- Use of incorrect methods or statistical tests.
- Misunderstanding of goals.
- Erroneous judgments and reliance on human intuition.
Going too fast and taking shortcuts can also be a source of garbage.
How master data management can eliminate GIGO
Master data management (MDM) is about creating a single master record for all data sources and applications. The best MDM processes use multiple technologies and processes to prepare the master record:
- Data integration.
- Data reconciliation.
MDM activities clean and enrich data and remove duplicate, redundant and erroneous entries. They also keep track of data sources and create audit trails of changes. In this way, they provide consistent, reliable data that can then be used for a wide range of applications without causing the GIGO problem. MDM also enables businesses to make more informed, data-driven decisions.
Apart from MDM, some other ways to improve the quality of input data and avoid GIGO include the following:
- Clean input data by correcting or removing erroneous values.
- Combine data from multiple sources.
- Reformat data, if necessary.
- Divide the data into training, test and validation sets before building the model.
- Set success criteria and assess the model's performance based on those criteria.
- Regularly review data sets to correct inaccuracies.
Learn about data governance and your master data management strategy and explore 9 data quality issues that can sideline AI projects. See why good data quality for machine learning is an analytics must.