Not every company has the scale and skills of Intuit's Credit Karma, but the company's data science head has some advice on where others can begin devising their own AI governance framework.
Credit Karma can use Intuit's GenOS AI operating system, with its catalog of AI models, agents and software development tools. With help from GenOS, teams at Credit Karma recently created a multi-agent system to automatically review AI outputs before allowing them to reach production. These form the technical basis for the AI compliance initiative led by Madelaine Daianu, senior director of data science and engineering at Credit Karma, but these efforts began with hands-on human collaboration that other companies can and must emulate, as every company and industry must devise its own tailored approach.
Have your internal red team go and break an LLM-generated response and learn from it, and develop a thorough, custom evaluation framework for your use case.
Madelaine DaianuSenior director of data science and engineering, Credit Karma
"Finding a balancing act between innovation and safety, compliance or whatever is relevant to them is extremely important and taking the step to slow down a little bit before they run and move fast," Daianu said. "Have your internal red team go and break an LLM-generated response and learn from it, and develop a thorough, custom evaluation framework for your use case."
At Credit Karma, red teams that broke large language model (LLM)-driven workflows and identified their weaknesses devised a five-step evaluation framework for AI governance.
The framework's stages include the following:
Response quality and accuracy
AI safety, including detecting bias
Compliance, primarily with the contractual expectations of Credit Karma partners when it presents credit card and loan information to customers on its platform
Data provenance and accuracy
System metrics such as cost and latency
"Within this framework, compliance is where we had to get super innovative, because it would take us a very long time to [manually] check summaries from an LLM," Daianu said. "For instance, in the case of a credit card, we need to make sure that we represent the benefits of that card as mapped to the partner brand with the utmost accuracy. But to be able to do that, we had to extract the fields from the summary that are pertinent to, say, rates or fees."
That's where the multi-agent system came in. Specialized AI agents check each specific data field within LLM-generated summaries and ensure that their presentation to users follows the partner brand. In this and other stages of the evaluation framework, LLMs are also used to judge the overall response quality from groups of agents.
However, those models were trained with human feedback from Credit Karma's customer success team, which still performs spot checks. According to Daianu, AI agents simply reapply that evaluation process to new summaries, up to 50 times faster.
However, when evaluating AI tools, it's also important not to overuse them, Daianu said.
"We are using GenAI as a judge in some elements of our framework, especially for compliance, but not everywhere," she said. "For AI safety, we can use traditional machine learning. Not overfitting GenAI … is important, because that can oftentimes give you better accuracy, better explainability and is not as much of a black box."
Beth Pariseau, a senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.