adam121 - Fotolia
CIO on the importance of holistic data collection and analysis methods
CIO Tammy Bilitzky explains what she means by holistic data collection and analysis methods, and how they ensure accuracy of results and build trust into automated data systems.
Tammy Bilitzky, the CIO of Data Conversion Laboratory, is focused on helping her company's customers organize and mine their data for insights that inform both tactical and strategic decisions. To do that, Bilitzky said she advises organizations to take a holistic view of their data. She said organizations must understand data in context, particularly as they use artificial intelligence to gain new insights, otherwise they risk using skewed results that could lead to faulty decisions.
Here, in part two of this two-part series, Bilitzky shared her ideas on the data collection and analysis methods that ensure accuracy of results and build trust into automated data systems.
Editor's note: The following has been edited for clarity and length.
What are the common challenges that organizations have when it comes to data collection and analysis methods?
Tammy Bilitzky: One of the biggest challenges [for organizations] is understanding what holistic data they have. Too much content is as much of a challenge as too little content. There are different aspects of what holistic data means, which is especially true when you get into the world of artificial intelligence. You have to semantically understand what each word means. Like the word 'pen': Am I talking about a writing instrument or animal enclosure or a peninsula? Then you have to understand it syntactically depending on the sentence structure. It's not always clear.
And then you have to understand data contextually. Like when I say something's 'bad,' does that mean it's good or bad. You have to look at the context.
You have to take a holistic view of your data to make it meaningful. To me, that's the biggest challenge companies have.
What's the consequence of not taking that holistic view to data collection and analysis?
Bilitzky: People are using data to make both strategic and tactical decisions. If you get the meaning wrong, all your results are potentially flawed and all your decisions you make against those results are potentially flawed.
Your data is your foundation, and if you get the foundation wrong, your house crumbles.
That's why we continually validate results, and we only use the algorithms when we know we can trust the results.
How do you build a holistic view of data?
Bilitzky: It's about building in checks and balances. There's not a lot of transparency in the underlying foundational work in the algorithms; the average developer doesn't have the expertise to write these algorithms. So we have to build in checks and balances to make sure the technology won't work against us, because by far the greatest danger of artificial intelligence is that people conclude too early that they understand it. But for most people, the AI algorithms are black boxes, and they don't know what data and logic are used to power them. So while you would think the results of these algorithms would always be consistent, that's not true. There's a dark side to AI. I'm still a proponent of AI, but it's the responsibility of the people who use it to understand what they're using and to validate the results. In my mind, there has to be a policy of trust and verify.
Is trust and verify a business task or a job for the technology?
Bilitzky: The answer is both. There are definitely technology solutions you can use. There is technology that you can use to do quality assurance on your results, and you can use technology to cross check them. But you also can have people look at them. Neither side can abdicate responsible; it should be the technology and the business together verifying that the information makes sense.
What I worry about when you give it to just technology [teams], they don't have the subject matter expertise to catch anomalies and potential repercussions.
Is this system of checks and balances all about automation?
Bilitzky: There's still a place for a person when you're training the model, and we encourage having people to do ongoing checks. The manual piece is the person who is reviewing the results and identifying where things are right and where things are wrong. So there's a person doing the ongoing checks, but the results of that are getting programmatically incorporated into the algorithms, [which] is called continuous improvement.
You promote the idea of making content 'smarter and more discoverable.' What do you mean by that?
Bilitzky: Your holistic data has to be smart, and you have to understand what it means so it won't mislead you. You have to make the content smart, meaning that the content is going to give you the information you're looking for. Something is smart when it is precise, understandable and meaningful.
So first of all, to use content, you have to be able to find it. Even if you have text, a straightforward keyword search won't necessarily give you the context you want, [in which case] your results are going to be skewed. You need to classify your documents so you understand the purpose of each document, and, by understanding that purpose, you can start to understand the information that's in that document; you'll have a general context.
Then you can go on to entity extraction -- the people, places, things -- the meaningful content buried in the freeform. That's where we start talking about the idea of triples, marrying data to the relational data. You're building relationships out of that data. We didn't have that 10 years ago. What we have now with this concept of triples is that you can infinitely evolve your data and their relationships. The concept of triples being the subject; predicate, or attribute; and object. The attribute or predicate connects the subject and the object. You can keep linking these things, and that evolves to create your ontology.
Do these data collection and analysis methods only work using AI?
Bilitzky: I wouldn't say that, but when you deal with high volumes of data, [it can assist with the process]. There are other ways to do it; you can do a lot of this -- create triples -- using regular expressions and other programming techniques. But it would take more time and resources. It might be cost prohibitive if you don't leverage these technologies.
Some experts have expressed concern that today's data won't be accessible in the future. What do you see on this front?
Bilitzky: We convert to all formats, but we're big proponents of XML, because XML is flexible.
You can't always interpret your future data needs, and there are data that are not necessary today but can become critical in the future as business needs evolve. So I believe you have to prepare for the future, but you don't want to overbuild now. So build the foundation. Get the data into a format like XML, but focus on delivering value now. What holistic data do I need to power my business now? Get that data to the most granular level you need to power your business.
But then leave the other data in XML for the future. At least you have the data in a format that can be extended. Once it's in XML or a similar format, getting it into the next generation is a lesser effort.
When you do that, you can preserve your context and your lineage and your security. You put it into a repository and then you continually look at how you want to mine your data and grow it. You can constantly iterate on it.