It's easy for IT managers to conflate data science and data engineering. After all, both roles play a part in helping organizations extract value from big data.
However, it's hard to get the most value from data scientists tasked with building analytics and machine learning models unless they can get the data in the right format for their needs, said Thomas Goolsby, director of decision science analytics at the United Services Automobile Association (USAA), at the recent Strata Data Conference in San Francisco.
It took Goolsby several years to identify data engineering as the weak link in the USAA's data science pipeline. He also identified a data engineering skills gaps in his discussions with peers across many other Fortune 100 companies. As it turns out, it's a hard problem to identify because data scientists can do many of the tasks data engineers excel at -- they just do it slower. In some cases, projects got perpetually pushed to the back burner because the data scientists did not know how to prep the data.
"If you are a manager of a data scientist team and it feels like you are not having data science deliveries on a regular basis, it may be a data engineering gap," Goolsby said.
Once he realized the root cause of these bottlenecks, he embarked on a plan to implement a data science culture that dramatically sped up the USAA analytics pipeline -- but the journey required a bit more than filling out an HR request. Goolsby had to spend considerable time working with HR, IT management, executives and the local university to explain how data engineering was a different skill required to complement the association's existing big data program.
Refocusing on data engineering skills
Building a better analytics and machine learning model tends to attract more executive interest than the boring, back-end data engineering aspects. But the time spent developing these models pales in comparison to the time required for data engineering tasks, said Jesse Anderson, managing director of the Big Data Institute, a data engineering consultancy.
One way to think about it is that data scientists are the consumers of data products, while data engineers are the creators of data products.
"Companies doing this work without a data engineering culture are going to get stuck, and that is something you may not realize," Anderson said
Thomas GoolsbyDirector of decision science analytics, USAA
A big problem is that companies often assume they have a data engineering culture in place, without really evaluating what they have. When experts from Uber and Airbnb give talks about their latest machine learning projects, there is an assumption that the audience already has a data engineering culture.
"They don't tip their hat to the data engineering culture that made this happen," Anderson said.
Assessing the need for data engineering skills
Goolsby started at USAA about 10 years ago as part of the data warehousing and data mart team. His team did a lot of work building various data models and analytics engines.
After about five years, he was given a team of eight data scientists to focus on predicting life events using data. After two years of work, they had a lot of findings, but they realized that a lot of the data was not structured in a way that was going to help them in the future.
One problem was the team had to make a request to the IT department every time they needed to get a new data set or get the data structured differently. Finally, after some long negotiations, they were able to take control of data management from the IT department.
A deciding factor in this transition was that other departments would call Goolsby's team every time they ran into problems with analytics. Most of the time, the issue was a problem with a data set used by the analytics tool. After a long period of troubleshooting these requests and submitting new IT service tickets, his team negotiated control of the data sets for these projects.
The light comes on
Once Goolsby's team took responsibility for managing the data set, it was much easier to appreciate the need for a team with data engineering skills, so he began looking for help creating one.
USAA had an ongoing relationship with a local university to help attract talent. However, he said he soon realized the university did not have a data engineering curriculum, per se. So Goolsby encouraged the university to reach out to other schools and leading tech companies to help build a curriculum focused on the kinds of programming and data management skills required for students to become junior data engineers.
Big Data's Anderson said two fundamental skills for data engineers are familiarity with Spark and Hadoop for distributed systems and advanced programming abilities. Although database administrators (DBAs) can support these teams, DBA skills are typically associated with SQL and implementing extract, transform and load capabilities. Other things to look for include some knowledge of analytics, good communication skills, an understanding of data schemas and domain knowledge.
Build a data science culture
USAA also worked with the Big Data Institute to help train more than 100 people at USAA in data engineering skills. As a result of these efforts, Goolsby has created a data engineering culture at USAA. Data engineers work with the IT team, business executives and data scientists to help identify and troubleshoot problems in their data management workflows.
Although there was a cost involved in building a data engineering team, Goolsby said that, ultimately, the ROI was more than his budget over several years, as the cost of not completing some projects was quite high. Now, USAA is in a better position to identify these as priorities and hand off data engineering projects to folks who can deliver data in a few days in the way data scientists need it.
A side benefit is that the data engineering culture also improved the data scientists' ability to explore data. Goolsby said the data scientists tended to use the exploration tools they knew -- like Spark -- when SQL would have been a much more efficient tool for the job. While Spark works well for large data sets, it is less efficient on smaller sets. A focused person with advanced programming skills can go in to get exactly what the data scientist needs, Goolsby said.