Major consumer-facing platforms including Amazon, LinkedIn and Netflix run large parts of their data architecture on graph databases. But the technology -- which excels at storing the relationships between users, their behaviors and products -- has not caught on in more traditional enterprises.
Indeed, only about 2% to 3% of current data processing workloads run on graph databases today, according to Michael Moore, executive director in the advisory services practice at EY. As enterprises take on more analytics projects that need to make sense of the connections between people and products, however, he predicts graph database use cases in the enterprise will rise sharply, accounting for 50% of data processing workloads over the next 10 years.
"I believe all CIOs should be following this graph conversation since it is here to stay," Moore said in an in an interview at the recent GraphTour San Francisco Conference.
The conversation already includes a comprehensive vendor market. Leading graph database platforms include offerings from Neo4J, DataStax and TigerGraph. AWS, Google and Microsoft also offer native graph database tooling on their cloud platforms.
Elements of a graph database
A type of NoSQL database, graph databases are organized to highlight the connections between entities. A classic application of the technology is a social media network where the database also stores information about whom is connected to whom.
The technology improves the speed and precision of analytic models by allowing enterprises to take the entirety of their business data and produce logical connections between the data in a way that links to business functions.
Graph databases make it possible to condense much larger data sets to run on an in-memory data fabric. This makes it easier to do sophisticated queries that root out those indirect relationships between business functions for the connections that might improve or reduce profits, efficiency or performance.
For example, two individuals who have never made the exact same purchase may nonetheless have buying habits similar enough to improve the product recommendations made to each. Insurance fraud rings using graph database technology for nefarious ends will make the insured party look different from claim to claim, while reusing the same doctors, lawyers or body shops. Fraud analysis engines, in turn, use graph database technology to uncover such patterns.
Improving business modeling
EY has been working with enterprises on graph database use cases that involve data discovery, data validation, fraud detection, supply chain analytics, recommendation engines, anti-money laundering and for providing a 360-degree view of the customer.
In addition to the features described, graph tools can tie things that don't exist together anywhere else in the system, creating a system of record, Moore said.
"Typically, enterprises deploy graphs on large data lakes and use it as a unifying data layer," he said.
Because the data can be structured to mirror business processes, graph database technology can also streamline the conversations between developers and business owners, since the data is organized around metrics the business finds familiar. Graph tools also make it easier to refactor data for new applications that call for new business metrics. In traditional databases, business metrics often need to be computed by joining rows from database tables that were organized to optimize the speed with which data is written into the database (not the connections).
As useful as graph databases are for many certain types of queries and analysis, graph tools will present several challenges to CIOs, Moore warned. Data engineers and business experts need to learn new skill sets and create new workflows for defining and refining the graph data models used for these applications.
Classical SQL databases were optimized to conserve memory and CPU. They are still the best technology for many kinds of applications such as ERP that involve doing a lot of columnar addition. But joining database tables together to do new kinds of queries can add considerable overhead to SQL databases. As a result, new types of queries can be limited by memory capacity.
In contrast, graph databases, as noted, precompute these relationships in a way that speeds analytics and shrinks the size of the data store. In one project, Moore said he managed to shrink a 5 TB SQL database into a 2 TB graph database.
A big challenge that must be factored into graph database use cases is their slower performance when writing to the database. This is because the database must compute the relationship between new and existing data when committing a transaction.
But Moore said the benefit outweighs the limitation: "This is a small price to pay compared to the massive speed up on the query side."
Start small to build expertise
As with most cutting-edge technology, CIOs should start small in order to build the culture and processes for effectively using graphs. Moore has worked with many enterprises on introducing the technology to more traditional data management teams. This typically involves starting with a proof of concept, implementing a pilot and then rolling out a live use case into production.
He said a good starter project should include three to four data sources. It's also important to keep the naming convention simple so that business managers can easily understand what the data represents.
In the pilot phase, it may help to start with a preconfigured graph database running in the cloud to keep things simple. It's useful to focus on creating a minimum viable product, and then later the team can work on expanding this to more data domains and applications.
The team also needs to spend time cleaning the data used in the pilot. Data cleansing is always about 80% of the actual work for any large data environment, Moore said. Data engineers must figure out the quality of the data and how the data maps to a specific business problem.
Emerging graph database use case: Improving AI
Enterprises are starting to explore using graph databases to improve AI models.
Amy Hodler, graph analytics and AI program manager at Neo4j, said early use cases involve improving the way data is ingested into the AI training tools in a process called feature engineering. For example, researchers at the University of California, San Francisco, have developed Het.io, a tool that structures biomedical information to highlight connections. The approach is being used to better correlate genes with disease and predict new uses for existing drugs.
Other researchers are looking at using graph databases to make AI models more transparent and explainable. For example, eBay has been experimenting with this technique to improve its recommendation engine.
Down the road, Hodler expects to see data scientists running machine learning workloads directly on graph data sources.
"The idea of adding context for helping AI to generalize and make it more broadly applicable is important for getting machine learning and AI solutions to the next step," Hodler said.
Graph tools, augmented analytics, natural language processing are trending up: Gartner analysts say