freshidea - Fotolia
Working with messy data and software engineering are two of the biggest data science problems that come into play when building more robust AI systems, said experts at the Association for Computing Machinery - Institute of Mathematical Statistics Interdisciplinary Summit on the Foundations of Data Science in San Francisco.
Data science is evolving to keep pace with rapid advances in AI and new tools. Shirley Ho, group leader of Cosmology X Data Science at Flatiron Institute, said: "Data science involves the study of generalized extraction of knowledge from data but also producing the data required and the infrastructure, software and hardware to make this possible."
Enterprises need to keep in mind the data science problems and solutions that arise from this evolving paradigm.
Engineering builds resilient apps
One data science problem is that software developers are designing new tools and applications without concern for fundamental engineering principles, said Suchi Saria, assistant professor at Johns Hopkins University, where she directs the Machine Learning and Healthcare Lab. Although AI developers are demonstrating interesting results, no one is sure how, when and where these applications break, which is a big concern. In other fields, like civil engineering and nuclear engineering, engineers apply considerable effort to understand the fundamentals of how things work and where they break down.
"As data scientists, we give people tools and the freedom to build applications, but the engineering principles for being able to guarantee we understand what we built, can stand behind it and are not going to make catastrophic decisions [are] missing," Saria said.
In the data science world, engineering has become somewhat of a dirty word, she added. But the flip side is there is no way to build impactful systems if we cannot bring back engineering principles. It is important to understand how things could break down. She said she watches her students getting excited about playing with new architectures and adding tweaks to improve accuracy in a particular realm. But, to her, this seems like design without engineering principles.
This kind of engineering is a little different than the field of data engineering. Traditional data engineering focuses on building out the infrastructure for doing extract, transform and load and cleaning data at scale for data science apps. In contrast, Saria is suggesting a quality of engineering needs to be brought to bear on AI algorithms and data science as well.
Different kinds of mess
Another big problem facing data science lies in figuring out how to work with messy data. This is not just inaccurate data; there are a whole range of different ways that data can be messy with regard to a particular data science or AI application. Manuela Veloso, head of AI research at JPMorgan and professor at Carnegie Mellon University, said data science must deal with data generated from diverse sources and that spans a diverse variety of frequencies and ranges.
An important goal of AI is to make machines that can take data inputs, make decisions and take action as part of a loop of perception, cognition and learning. Data science can help provide the substrate to close this loop. But data scientists face challenges in how to bound the data or organize it in a way that it can be interpreted by AI or statistical tools. Ideally, data scientists would like to have descriptions, labels and clean data that can make it easier to use in new applications. "I'm amazed by how hard this is," Veloso said.
At JPMorgan, for example, she has not come across a single case where managers can describe how transactions are supposed to be recorded all the time. There are always exceptions of a specific nature, which account for about 1% of the transactions being different. It's challenging for data science to figure out what to do with these exceptions to the rules and, at the same time, understand the outliers or the noise.
"From a data science point of view, sometimes, the buts are things that have more information and are things that you don't want to miss," Veloso said.
JPMorgan has some of the world's leading business experts that manage processes for capturing trillions of different kinds of records. It's easy to imagine that these records could be analyzed with AI algorithms to create models of how something works. But it's much harder to do in practice.
"It is interesting to realize that, somehow, even these enormous amounts of data do not capture everything that humans know," Veloso said. She expects humans will play a key role in filling in the data that machines can't understand.
Lots of ways to fail
Another form of dirty data could be data from different distributions, said Sham Kakade, professor at the University of Washington. For example, in computer vision research, one of the challenges arises from trying to figure out what to do with noisy cameras. Even if the developers use high-quality cameras, they still generate data from different angles and with different kinds of lighting artifacts, like glare from the sun. Kakade said one way of thinking about this problem is to think about creating algorithms that can use transfer learning with a small amount of corrupted data that can learn to adapt more quickly on other problems.
Saria said one of the reasons she moved to Johns Hopkins was to do more work bringing comprehensive AI tools to healthcare. Her team is now deploying an AI application to over 4,500 physicians at over 90 clinics. When she started, she did not realize how hard it would be. Often, AI researchers start with a single objective function to determine success. "But, when we are deploying something in practice, we need to track reliability and accuracy from a variety of standpoints," Saria said.
She realized it is important to keep a variety of tools available to identify and address different types of noisy data. Some of the problems she identified include bias and whether the data is fit for a particular purpose. It's also important to put systems in place to monitor the results and to plan for maintenance when the models drift from reality. This is one of the most common data science problems and solutions.
Veloso suggested that one of the biggest problems lies in presenting outliers to AI algorithms to help them make sense of unlikely, but important scenarios. She said the development of better simulations could help train AI to better detect anomalous conditions. JPMorgan is building such simulations for operations across the whole bank.
Veloso believes that researchers need to invest in simulations that can stretch the reality of the world so that AI tools can begin to adapt to rare events. This could be particularly useful for improving reinforcement learning techniques that combine data and feedback from the real world to improve algorithms over time.
Only humans have eyeballs
Veloso recommended that every data scientist and AI developer see the movie Sully to get a real-world perspective on the limits of data science and AI for making sense of outliers. In the movie, a plane takes off, and there is a problem flying the plane, even though all the sensor readings said everything was OK. The captain makes a snap decision to land the plane on the Hudson River, saving the lives of everyone on board. When asked why he made this decision, he said: "I eyeballed the situation." Veloso said there's a lesson here for how to identify data science problems and solutions.
"There's something in our DNA that lets us eyeball the situation and make decisions that are not supported by the data. We will have jobs for the rest of our lives," Veloso said.