What is a data engineer?
A data engineer is an IT worker whose primary job is to prepare data for analytical or operational uses. These software engineers are typically responsible for building data pipelines to bring together information from different source systems. They integrate, consolidate and cleanse data and structure it for use in analytics applications. They aim to make data easily accessible and to optimize their organization's big data ecosystem.
The amount of data an engineer works with varies with the organization, particularly with respect to its size. The bigger the company, the more complex the analytics architecture, and the more data the engineer will be responsible for. Certain industries are more data-intensive, including healthcare, retail and financial services.
Data engineers work in conjunction with data science teams, improving data transparency and enabling businesses to make more trustworthy business decisions.
The data engineer role
Data engineers focus on collecting and preparing data for use by data scientists and analysts. They take on three main roles as follows:
- Generalists. Data engineers with a general focus typically work on small teams, doing end-to-end data collection, intake and processing. They may have more skill than most data engineers, but less knowledge of systems architecture. A data scientist looking to become a data engineer would fit well into the generalist role.
A project a generalist data engineer might undertake for a small, metro-area food delivery service would be to create a dashboard that displays the number of deliveries made each day for the past month and forecasts the delivery volume for the following month.
- Pipeline-centric engineers. These data engineers typically work on a midsize data analytics team and more complicated data science projects across distributed systems. Midsize and large companies are more likely to need this role.
A regional food delivery company might undertake a pipeline-centric project to create a tool for data scientists and analysts to search metadata for information about deliveries. They might look at distance driven and drive time required for deliveries in the past month, then use that data in a predictive algorithm to see what it means for the company's future business.
- Database-centric engineers. These data engineers are tasked with implementing, maintaining and populating analytics databases. This role typically exists at larger companies where data is distributed across several databases. The engineers work with pipelines, tune databases for efficient analysis and create table schemas using extract, transform, load (ETL) methods. ETL is a process in which data is copied from several sources into a single destination system.
A database-centric project at a large, multistate or national food delivery service would be to design an analytics database. In addition to creating the database, the data engineer would write the code to get data from where it's collected in the main application database into the analytics database.
Data engineer responsibilities
Data engineers often work as part of an analytics team alongside data scientists. The engineers provide data in usable formats to the data scientists who run queries and algorithms against the information for predictive analytics, machine learning and data mining applications. Data engineers also deliver aggregated data to business executives and analysts and other end users so they can analyze it and apply the results to improving business operations.
Data engineers deal with both structured and unstructured data. Structured data is information that can be organized into a formatted repository like a database. Unstructured data -- such as text, images, audio and video files -- doesn't conform to conventional data models. Data engineers must understand different approaches to data architecture and applications to handle both data types. A variety of big data technologies, such as open source data ingestion and processing frameworks, are also part of the data engineer's toolkit.
Data engineer skill set
Data engineers are skilled in programming languages such as C#, Java, Python, R, Ruby, Scala and SQL. Python, R and SQL are the three most important languages data engineers use.
Engineers need a good understanding of ETL tools and REST-oriented APIs for creating and managing data integration jobs. These skills also help in providing data analysts and business users with simplified access to prepared data sets.
Data engineers must understand data warehouses and data lakes and how they work. For instance, Hadoop data lakes that offload the processing and storage work of established enterprise data warehouses support the big data analytics efforts data engineers work on.
Data engineers must also understand NoSQL databases and Apache Spark systems, which are becoming common components of data workflows. Data engineers should have a knowledge of relational database systems as well, such as MySQL and PostgreSQL. Another focus is Lambda architecture, which supports unified data pipelines for batch and real-time processing.
Business intelligence (BI) platforms and the ability to configure them are another important focus for data engineers. With BI platforms, they can establish connections among data warehouses, data lakes and other data sources. Engineers must know how to work with the interactive dashboards BI platforms use.
Although machine learning is more in the data scientist's or the machine learning engineer's skill set, data engineers must understand it, as well, to be able to prepare data for machine learning platforms. They should know how to deploy machine learning algorithms and gain insights from them.
Lastly, knowledge of Unix-based operating systems (OS) is important. Unix, Solaris and Linux provide functionality and root access that other OSes -- such as Mac OS and Windows -- don't. They give the user more control over the OS, which is useful for data engineers.
As the data engineer job has gained more traction, companies such as IBM and Hadoop vendor Cloudera Inc. have begun offering certifications for data engineering professionals. Some popular data engineer certifications include the following:
- Certified Data Professional is offered by the Institute for Certification of computing professionals, or ICCP, as part of its general database professional program. Several tracks are offered. Candidates must be members of the ICCP and pay an annual membership fee to take the exam.
- Cloudera Certified Professional Data Engineer verifies a candidate's ability to ingest, transform, store and analyze data in Cloudera's data tool environment. Cloudera charges a fee for its four-hour test. It consists of five to 10 hands-on tasks, and candidates must get a minimum score of 70% to pass. There are no prerequisites, but candidates should have extensive experience.
- Google Cloud Professional Data Engineer tests an individual's ability to use machine learning models, ensure data quality and build and design data processing systems. Google charges a fee for the two-hour, multiple choice exam. There are no prerequisites, but Google recommends having some experience with Google Cloud Platform.
As with many IT certifications, those in data engineering are often based on a specific vendor's product, and the trainings and exams focus on teaching people to use their software.
How to become a data engineer
Certifications alone aren't enough to land a data engineering job. Experience is also necessary to be considered for a position. Other ways to break into data engineering include the following:
- University degrees. Useful degrees for aspiring data engineers include bachelor's degrees in applied mathematics, computer science, physics or engineering. Also, master's degrees in computer science or computer engineering can help candidates set themselves apart.
- Online courses. Inexpensive and free online courses are a good way to learn data engineering skills. There are many useful videos on YouTube, as well as free online courses and resources, such as the following six options:
- Codecademy's Learn Python. Knowledge of Python is essential for data engineers. This course requires no prior knowledge.
- Coursera's guide to Linux server management and security. This four-week course covers the Linux basics.
- GitHub SQL Cheatsheet. This GitHub repository is consistently updated with SQL query examples.
- O'Reilly data engineering e-books. Titles in the big data architecture section cover data engineering topics.
- Udacity Data Engineering Nanodegree. Udacity's online learning offerings include a data engineering track.
- Project-based learning. With this more practical approach to learning data engineering skills, the first step is to set a project goal and then determine which skills are necessary to reach it. The project-based approach is a good way to maintain motivation and structure learning.
Data engineer vs. data scientist
Data engineers and data scientists work together. The data engineers prepare and organize the data that companies have in databases and other formats. They also build data pipelines that make data available to the data scientists. The data scientists use all that data for analytics and other projects that improve business operations and outcomes.
Data scientists and data engineers differ in their skillsets and focus. Data engineers don't necessarily have a specific focus; they tend to be competent in several areas and well-rounded in their knowledge and skills. By contrast, data scientists often have specialized areas of focus. They are concerned with more exploratory data analysis. Data scientists tackle new, big-picture problems, while data engineers put the pieces in place to make that possible.
In addition to data engineers and data scientists, data management and analytics teams contain a variety of roles and specialties. Read more about the skillsets and personnel required to have a strong enterprise data science team.
More about data engineers and data scientists
Data scientists and engineers are key parts of any data analytics team. Learn more about the IT pros who work together to make data analytics happen.