News Stay informed about the latest enterprise technology news and product updates.

Unstructured data is a misnomer

Nowadays, the term unstructured data pops up everywhere. It owes its popularity for a large part to the success of big data, to successful technologies such as NoSQL and Hadoop, and to formats such as JSON and XML. Unfortunately, different definitions for unstructured data exist. All these different definitions confuse many people and it blurs and obscures many discussions on unstructured data. The reason that so many definitions exist is that the term unstructured data is a misnomer, and maybe we should ban it from our discussions.

For some, unstructured data is textual data, for others it’s data that doesn’t fit rigid relational data structures, and there are those who say that unstructured data refers to tables or files in which each record can have a different structure. For example, in Webopedia unstructured data is defined as follows: “Unstructured data usually refers to information that doesn’t reside in a traditional row-column database.” For example, data stored in XML and JSON documents, CSV files, and Excel files is all unstructured. Definitions can also be very vague. Take for example the definition used in “Unstructured data is data that does not follow a specified format for big data.”

One reason why so many different definitions exist is that the adjective “unstructured” in combination with the word data makes no sense, because if we take the meaning of the word unstructured literally, then unstructured data doesn’t exist. The literal meaning of the word unstructured according to the Merriam-Webster online dictionary is the following: the adjective unstructured means lacking structure or organization; not formally organized in a set or conventional pattern; and not having a system or hierarchy. Many other dictionaries use comparable definitions. The Free Dictionary adds that in psychology the word unstructured is used to refer to something that has no intrinsic or objective meaning. And Microsoft Word proposes the words formless and shapeless as synonyms for unstructured. For example, a development approach can be unstructured and art can be unstructured; see for example this painting.

So, literally unstructured data is data without a shape or form, not formally organized, and without a system. Why would we want to store that type of data, because if it has all those characteristics, storing unstructured data is useless? It would only fill up the disks, and we would not be able to process it in any way. No organization would store that type of data. Conclusion, if we take the term unstructured data literally, no one would store unstructured data and, therefore, would not exist.

In fact, most data that is currently qualified as unstructured data is quite structured. For example, all the XML and JSON documents are highly structured. The same applies for text. A linguist would never agree with calling text unstructured data, because text has structure. If not, we would not be able to understand what is written and said. Additionally, no audio-to-text transcription software would exist, but it does.

Calling audio and video unstructured makes no sense either. For example, if you open up an MP3 file, you will see that it contains an indication of the version of MP3 used. It contains tags, such as Artist, Composer, Title, and Track number. Agreed, those tags are not always stored at the same spot in the file, sometimes they’re placed at the beginning, sometimes at the end, and sometimes somewhere in the middle, but everyone can read and understand them. MP3 files and all the other audio and video files are highly structured. Else, no tools would be able to recognize and play them.

So, the term unstructured data is a major misnomer. Confucius once said, “The beginning of wisdom is to call things by their proper name.” So, let’s follow his advice, let’s call things by their proper names. Call it data with a fixed or variable data structure, data with repetitive and hierarchical data structures, call it textual data or audio data. But stop calling it unstructured data. Let’s ban this term from now on, it’s a misnomer.

P.S.: And if we stop the word unstructured data, we can also stop using the term structured data, because it then becomes a pleonasm. It’s like the terms wet rain and burning fire. And now that we are on this topic, what does semi-structured data mean? Is that data that is structured for 50%? If so, then semi-structured data equals semi-unstructured data. Not useful either.

Business Analytics
Content Management