AI and machine learning success hinges on the reliability of the underlying data. As these tools and systems become increasingly popular across enterprises, securing trustworthy data becomes all the more necessary.
Data and data sets are like the oil to a well-running AI machine. Without quality data -- and enough of it -- AI devices struggle to properly learn the functions they are expected to perform. But problems such as primary key inconsistency and data duplication mean that data quality is never a given. By devoting time and resources to implementing techniques that ensure trusted data, organizations can build more trustworthy AI tools and systems.
Untrustworthy data in the modern enterprise
Blair Kjenner, founder of information management firm Method1 Enterprise Software, spoke about the pitfalls that can lead to untrusted software data at the November 2023 Estes Park Group meeting hosted by Semantic Arts.
Kjenner explained that inconsistencies in primary key methods, core data models, and row and column headers lead to an inability to fully integrate data. In turn, the inability to fully integrate data leads to systems proliferation, with system-level functions being duplicated repeatedly rather than reused.
Systems proliferation -- or sprawl -- leads to data siloing, Kjenner said. This siloing creates issues such as duplicated records across different systems, which makes it difficult to identify the most current or accurate version. Haphazardly loading data into a data lake, warehouse or large language model can exacerbate the problem, leading to overwhelming levels of complexity and a lack of confidence in shared data.
Dave McComb, president of information systems consulting firm Semantic Arts, added that manufacturers of physical goods can achieve economies of scale by doubling production and reducing cost per unit. But in a software system, every line of code must coexist with all the other code in the system, he said. Therefore, adding more code leads to more complexity.
Whenever enterprises add a new software subscription or build a new application, they must accommodate more new code, another primary key method and a different data model. Dan DeMers, CEO and co-founder of data collaboration platform Cinchy, calls this tradeoff the "integration tax." With more complexity, overwhelming amounts of data and inconsistent methods of storing it, untrusted data can run rampant in the enterprise.
3 trusted data techniques
Across enterprises -- and even countries -- organizations are finding ways to make data more trustworthy. Whether attempting to eliminate data duplication, mitigate primary key inconsistencies or implement disambiguation, these case studies highlight some of the methods that can make for more trusted data.
1. Zero-copy integration
In February 2023, Canada's Data Collaboration Alliance, led by DeMers, announced Zero-Copy Integration, a national standard ratified by the Standards Council of Canada. The standard advocates access-based data collaboration, rather than copy-based integration, to eliminate data duplication.
Those who adhere to the principles of zero-copy integration agree to share access to, rather than duplicate, data resources. By design, reusable data resources must be shareable and secure. It's a data-centric and application-agnostic approach that demands scalable and secure access control.
2. Integration scaling with native web techniques
Web Uniform Resource Identifiers (URIs) and Internationalized Resource Identifiers can mitigate the problem of primary key inconsistency, McComb said.
Creating globally unique URIs is a best practice for data consistency. But URLs, which add content location to URIs, can suffer over time from link rot. When content changes, URLs might change as well, resulting in a dead link.
By contrast, on the decentralized web, content identifiers (CIDs) for the InterPlanetary File System (IPFS) are uniquely and automatically generated whenever content changes. Decentralized storage service Filebase described the IPFS' CID approach as associating data with the content itself, as opposed to where it is located. IPFS' content-addressing methodology thereby sidesteps primary key issues such as link rot.
IPFS relies on peer-to-peer (P2P) architecture with no central web server. Instead, each file resides on multiple peer nodes, implying inherently better network resiliency -- if you assume a critical mass of available peer nodes, gateways and overall usage.
Because IPFS webpages are immutable, however, phishing pages can be harder to take down. Security software provider Trend Micro's report on 2022 IPFS usage underscored that criminals are finding other advantages to P2P networks as well. Like other systems, IPFS has its tradeoffs.
Currently, IPFS traffic is a small fraction of overall web traffic. Most enterprises haven't explored such P2P approaches, but research universities are often using IPFS for its low-cost benefits in scientific data storage.
3. Disambiguation with knowledge graphs
Bob Metcalfe -- a Xerox PARC veteran, Ethernet pioneer and 3Com co-founder -- created Metcalfe's law, which says a network's effect is proportional to the square of the number of users or devices on said network. Metcalfe sees IPFS as complementary to established networks with their own network effects -- one reason he joined the board at OriginTrail, a decentralized supply network platform provider.
OriginTrail combines web semantics, cryptography and decentralized graph networking into one platform. Knowledge graphs blend the power of symbolic logic with richly explicit relationship data. These graphs disambiguate and complement the probabilistic-only approaches of statistical machine learning with deterministic facts, rules, and the powers of deduction and induction.
Crawling and scraping webpages and loading data via tensors or vectorizing it are lossy ways to collect enterprise data for AI. Semantic knowledge graph approaches, by contrast, ensure the preservation of logical context and levels of abstraction for explicit, articulated native data.
The best way to preserve context and levels of abstraction in data is to ensure that data sets are easy to find, access, integrate with other systems and reuse -- a set of qualities known as the FAIR principles. Accessing data directly in its original form without being duplicated, known as access-only and zero-copy approaches, is also critical. Trustworthy AI requires trusted data, and disambiguation is a fundamental step for those managing and working with data records, particularly at the most granular levels.
Alan Morrison is an independent consultant and freelance writer covering data tech and enterprise transformation.