Can tokenization free up more data for AI model training?
Research from Capital One Software and PwC suggests enterprises can tap sensitive data to train AI models while balancing predictive power and privacy.
Enterprises expanding AI adoption face a dilemma: How to tap more of their high-value internal data for model training without compromising sensitive information.
Recent research from Capital One Software and consultancy PwC suggests business leaders can have their data science cake and eat it too. The study, released March 23, points to tokenization as an approach that preserves data's potency as well as its privacy and security. Capital One Software is a B2B software business operating within the $53.4 billion financial services company.
Data tokenization replaces sensitive data with a token that retains the data format. This process is one of several options for protecting sensitive information, including data masking and redaction, which obscure or remove parts of the data. But efforts to conceal key data often prevent businesses from using their most useful data to train AI models. The data is safe, but the models take a hit to their predictive power.
Against this backdrop, Capital One Software and PwC compared a baseline consisting of the original plaintext data with a masked data set and a tokenized data set. The study found that models trained on tokenized data retained 99.7% of their predictive performance compared with the baseline. In addition, the models trained on tokenized data had nearly double the accuracy of those trained on the masked data set.
Vince Goveas, director of product management at Capital One Software, said the researchers had an inkling that tokenization would provide better performance, but the results were still surprising.
"We were expecting improvement, but we were not expecting this level of improvement," he said, citing tokenization's advantage over masked data.
Data protection vs. AI value
The assessment used Capital One Software's Databolt offering to tokenize data. Databolt uses cryptographic algorithms to generate tokens on the fly, according to Capital One Software. Databolt, which launched in 2025, is adapted from Capital One's internal tokenization engine.
The research shows the tradeoff between protecting sensitive data and getting the most value out of AI "doesn't have to exist anymore," said Mir Kashifuddin, PwC's data risk and privacy practice leader.
"By using tokenization to protect sensitive information, while preserving the structure of the data, organizations can train highly effective AI models without exposing [personally identifiable information] or [protected health information]," he explained.
The research has implications for enterprises in regulated industries. Kashifuddin said the combination of data protection and AI performance lets businesses "innovate with confidence" while meeting customers' privacy, security and regulatory expectations.
The study cited healthcare, insurance and financial services as sectors where sensitive data is holding back AI.
Reengineering data pipelines
Tokenization lets data teams reengineer data pipelines for "speed and efficiency," according to the Capital One Software-PwC study.
This benefit stems from data engineers' ability to tokenize earlier in the pipeline, the report noted. That means sensitive data is protected as it enters the workflow, and its structure, length and format are maintained. Because it preserves structure, tokenized data fits into an organization's existing pipelines without major revisions to data flows.
The result is a reduction in "governance overhead" and deployment time, the study noted.
Capital One's AI plans and use of tokenization
Data sensitivity is a top consideration for Capital One's own AI plans, Goveas noted. "With the advent of AI, we wanted to be able to train our models using a lot of this in-house data, because this is valuable for [making] data-driven decisions," Goveas said. "The biggest barrier to entry was we wanted to make sure we don't compromise on privacy and security."
Capital One has been tokenizing data for several years. That history, coupled with the company's model training goals, served as the impetus for the research project, Goveas said. The company wanted to understand how tokenized data would perform during model training.
For Capital One, the research validated that models trained on tokenized in-house data would perform well without compromising security and privacy.
Goveas framed the key inquiry regarding tokenized data: "If I chunk it and if I put it in my model for training, will this provide … meaningful output for my data scientists and analyst community?"
For Capital One, the research validated that models trained on tokenized in-house data would perform well without compromising security and privacy. Goveas cited another benefit: Using tokenized data speeds up the data sourcing and preparation process, which previously included numerous checks and approvals.
"Your time to value is much faster now," he said.
Data tokenization market trends
While the study focused on AI model training, tokenization has much wider applicability. Prominent examples include payment processing use cases such as e-commerce and mobile wallets.
The tokenization market, overall, is forecast to reach $5.19 billion in 2026 compared with $4.1 billion last year, according to the Business Research Company. The market researcher, which published its forecast March 10, pegs the compound annual growth rate at 26.4%. The market's growth drivers include extensive adoption of digital payment platforms, data breach incidents and increasing regulatory compliance demands, among other factors, the company reported.
The Business Research Company predicted the tokenization market will continue to grow steadily at 26.3% annually and hit $13.2 billion in 2030. Expected contributors to this growth phase include an uptick in zero trust security deployments, increased interest in privacy-enhancing technologies and "broader application of tokenization beyond payment data," the company said.
As a cybersecurity technology, tokenization fits within the data security category. But it's also adjacent to attack surface management in that replacing sensitive data with tokens seeks to shrink an organization's exposure.
Adoption issues: Infrastructure and change management
Enterprise technology adoption often hinges on infrastructure requirements and an organization's ability to deal with change. Businesses already training data for generative AI applications most likely have the necessary infrastructure up and running, Goveas noted.
That said, tokenization adopters must attend to organizational considerations. "There is an aspect of change management that organizations have to go through, and that starts from the very top," Goveas explained. "It has to be a leadership-driven priority." That top-down approach, he said, is critical to making privacy and security an enterprise priority rather than an afterthought.
Creating a change management process starts with identifying sensitive data and determining how to desensitize it, Goveas said. The task also requires pinpointing which specific elements of the data the data science and analysis team will need, he added.
"The barrier to entry is identifying your data, prepping it, labeling it [and] classifying it," he said. "Then you protect it."
John Moore is a freelance writer who has covered business and technology topics for 40 years. He focuses on enterprise IT strategy, AI adoption, data management and partner ecosystems.