statistical analysis data mining

data scientist

A data scientist is a professional responsible for collecting, analyzing and interpreting extremely large amounts of data. The data scientist role is an offshoot of several traditional technical roles, including mathematician, scientist, statistician and computer professional. This job requires the use of advanced analytics technologies, including machine learning and predictive modeling.

A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets.

In business, data scientists typically work in teams to mine big data for information that can be used to predict customer behavior and identify new revenue opportunities. In many organizations, data scientists are also responsible for setting best practices for collecting data, using analysis tools and interpreting data. 

The demand for data science skills has grown significantly over the years, as companies look to glean useful information from big data, the voluminous amounts of structured, unstructured and semi-structured data that a large enterprise or internet of things produces and collects.

Why is data science important?

Data science is a highly interdisciplinary practice involving a large scope of information and one that usually takes into account the big picture more than other analytical fields. In business, the goal of data science is to provide intelligence about consumers and campaigns and help companies create strong plans to engage their audience and sell their products.

Data scientists must rely on creative insights using big data, the large amounts of information collected through various collection processes, like data mining.

On an even more fundamental level, big data analytics can help brands understand the customers who ultimately help determine the long-term success of a business or initiative. In addition to targeting the right audience, data science can be used to help companies control the stories of their brands.

Because big data is a rapidly growing field, there are constantly new tools available, and those tools need experts who can quickly learn their applications. Data scientists can help companies create a business plan to achieve goals based on research and not just intuition.

Data science plays a very important role in security and fraud detection, because the massive amounts of information allow for drilling down to find slight irregularities in data that can expose weaknesses in security systems.

Data science is a driving force between highly specialized user experiences created through personalization and customization. The analysis can be used to make customers feel seen and understood by a company.

Roles and responsibilities

The concept of data scientist is derived from some of the most important major technological modern fields, including science, math, statistics, chemometrics and computer science. The mix of personality traits, experience and analytics skills required for this role are rare, so the demand for qualified data scientists is in an upward swing.

Data scientist topped the list of "50 Best Jobs in America" by Glassdoor in 2016, 2017, 2018 and 2019, based on metrics such as job satisfaction, number of job openings and median base salary. A data scientist job may also be advertised as a machine learning architect.

Basic responsibilities include analyzing large data sets of quantitative and qualitative data. These professionals are tasked with developing statistical learning models for data analysis and must have experience using statistical tools. They must also have the required knowledge to create complex predictive models. 

Some professionals who might engage in data science work or become full-time data scientists include computer scientists, database and software programmers, disciplinary experts, curators, and expert annotators and librarians. Job postings for data scientists may also advertise the opening as "machine learning architect" or "data strategy architect."

data scientist attributes
This image illustrates the personal and professional attributes of a data scientist.


Soft skills required for this role include intellectual curiosity, combined with skepticism and intuition, along with creativity. Interpersonal skills are a critical part of the role, as it involves working across many teams on a regular basis. Many employers expect their data scientists to be strong storytellers who know how to present data insights to people at all levels of an organization. They also need leadership skills to steer data-driven decision-making processes in an organization. Leadership, business savvy and the ability to predict risks are also important characteristics for handling the massive amount of data required for predictive analytics.

Qualifications and required skills

Data scientists generally need enough educational or experiential background to complete a wide range of extremely complex planning and analytical tasks in real time. While a specific job might call for specific qualifications, most to all data science roles require at bare minimum a bachelor's degree in a technical field.

Data science requires knowledge of a number of big data platforms and tools, including Hadoop, Pig, Hive, Spark and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R.

Hard skills required for the job include data mining, machine learning, deep learning, and the ability to integrate structured and unstructured data. Experience with statistical research techniques, such as modeling, clustering, data visualization and segmentation, and predictive analysis, are also a big part of the roles. 

In job postings, necessary skills typically include the following:

  • expertise in all phases of data science, from initial discovery through cleaning, model selection, validation and deployment;
  • knowledge and understanding of common data warehouse structures;
  • experience with using statistical approaches to solve analytical problems;
  • proficiency in common machine learning frameworks;
  • experience with public cloud platforms and services;
  • familiarity with a wide variety of data sources, including databases, public or private APIs and standard data formats, like JSON, YAML and XML;
  • ability to identify new opportunities to apply machine learning to business processes to improve their efficiency and effectiveness;
  • ability to design and implement reporting dashboards that can track key business metrics and provide actionable insights;
  • experience with techniques for both qualitative and quantitative analysis;
  • ability to share qualitative and quantitative analysis in a way the audience will understand;
  • familiarity with machine learning techniques, such as K-nearest neighbors, Naive Bayes, random forests and support vector machines;
  • ability to design and implement validation tests;
  • advanced degree, with a specialization in statistics, computer science, data science, economics, mathematics, operations research or another quantitative field;
  • experience in visualization tools, such as Tableau and Power BI;
  • coding skills, such as R, Python or Scala;
  • ability to aggregate data from disparate sources; and
  • ability to conduct ad hoc analysis and present results in a clear manner.

Education, training and certifications

The education requirements for data scientists typically include an advanced degree in statistics, data science, computer science or mathematics. There are a number of certification opportunities for this role, including Dell EMC DECA-DS, MCSA: Various SQL/Data Engineering Options, Microsoft MCSE Data Management and Analytics, and Certified Analytics Professional.

Data scientist salary

The additional responsibilities and expectations of working abstractly in a massive scale amount to a salary more than double that of a data analyst. According to Glassdoor, the U.S. average data scientist salary was $117,345 as of October 2019. 

Data scientists vs. citizen data scientist

The differences between data scientists and citizen data scientists include the following:

Education. Data scientists usually have at least a bachelor's degree in mathematics, data analytics, computer science or statistics. On the other hand, citizen data scientists might have a wide variety of educational backgrounds, but have experience with analytical tools and software that makes them better able to create models and perform complex analyses without a formal education in the aforementioned fields. 

Code. Citizen data scientists generally rely on software tools that include prebuilt modeling tools, drag-and-drop features and user-friendly algorithms to perform standard analyses. These tools do not prevent a citizen data scientist from discovering important patterns or data points. Professional data scientists are able to create complex custom algorithms and approach data analysis in creative new ways.

Salary. Data scientist is one of the highest-paying job titles, and there is a high demand for professionals who are able to complete the various responsibilities of the role. On the other hand, citizen data scientists may be hobbyists or volunteers, or may receive a small amount of compensation for the work they do for major companies.

What are the six major areas of data science?

The six major areas of data science include the following:

  • Multidisciplinary investigations. Considering large, complex systems with interconnected pieces, data scientists use varying methods to collect large amounts of data.
  • Models and methods for data. Data scientists need to rely on experience and intuition to decide which methods will work best for modeling their data, and they need to adjust those methods continuously to hone in on the insights they seek.
  • Pedagogy. It is up to data scientists to work with companies and clients to determine the best ideologies to apply while collecting and analyzing information about their customers and products.
  • Computing with data. The biggest thing that all data science projects have in common is the necessity to use tools and software to analyze the involved algorithms and statistics, because the size of the pool of information they are working with is so massive.
  • Theory. Data science theory is an evolving and sophisticated professional arena with countless applications.
  • Tool evaluation. There are many tools available for data scientists to use to manipulate and study huge quantities of data, and it's important to always evaluate their effectiveness and keep trying new ones as they become available.

Industries that rely on data science

Industries and sectors that are heavily affected by data scientist professionals include, but are not limited to, the following:

  • agriculture
  • big data
  • digital economy
  • economics
  • fraud detection
  • healthcare
  • human resources
  • IT
  • marketing analytics
  • marketing optimization
  • public policy
  • risk management
  • robotics
  • machine translation
  • manufacturing
  • medical informatics
  • social science
  • speech recognition
  • travel

History of data science

Data science is largely a branch of computer science. The term was first used in 1960 by Peter Naur, who was a pioneer in computer science. He described the foundational aspects of the techniques and approaches used in data science in his 1974 book, Concise Survey of Computer Methods.

In 1996, the International Federation of Classification Societies used the term data science in its conference. A computer scientist named William S. Cleveland introduced data science as a discipline in his article, "Data Science: An Action Plan for Expanding the Technical Areas of Statistics," which was published in 2001 in the International Statistical Review. Over the years, it morphed and grew into the most sought-after, rapid-paced research technique of modern technology.

Recently, the Office of Personnel Management (OPM) for the United States government agencies authorized agencies to use a parenthetical of (data scientist) along with the occupational title for positions that perform data science work as a major portion of the job. OPM has determined that data science work may be found in various occupational series, including but not limited to jobs in epidemiology, actuarial science, operations research, statistics and information technology. The Center for Optimization & Data Science supports data scientists at the Census Bureau and promotes their leadership in adaptive design, data analytics and machine learning for other government agencies.


Although considered one of the best jobs in consistent yearly polls, data scientists still experience some of the setbacks of statisticians and those in similar roles. While they are often hired to make sense of large systems of information, they are not necessarily always given specific questions to ask or directions to take their research. Many companies ask employees to complete data science work without investing the money in a full data science team. They also sometimes experience incorrect or disorganized data, known as dirty data that can improperly skew the results of their models. 

Data scientist vs. data analyst

The role of data scientist is often confused with that of data analyst. But while there is overlap in many of the skills, there are also some significant differences.

Though the role of a data analyst varies depending on the company, in general, these professionals collect data, process that data and perform statistical analysis using standard statistical tools and techniques. Analysts also identify patterns and make correlations in data sets to identify new opportunities for improvements in business processes, products or services. In some cases, data analysts also design, build, and maintain big data and relational database systems. The average U.S. data analyst salary as of October 2019 was $67,377, according to Glassdoor. 

Data scientists are responsible for those tasks and many more. These professionals are equipped to analyze big data using advanced analytics tools and are expected to have the research background to develop new algorithms for specific problems. They may also be tasked with exploring data without a specific problem to solve. In that scenario, they need to understand the data and the business well enough to formulate questions and deliver insights back to business executives, with the goal of improving business operations, products, services or customer relations.

Difference between structured and unstructured data

One of the major components that separates data scientists from traditional statisticians and mathematicians is their ability to analyze unstructured data. Structured data is information that can be analyzed, mapped out and loaded into databases, spreadsheets and organized systems. Unstructured data, on the other hand, is more organic and takes some creative approaches, such as coding, to load into analytics models.

For example, if a weather channel releases 45 weather-related videos on their website in one month, the structured data might include the number of times they were uploaded, the length of each video and the keywords included with each one. Unstructured data, which is often qualitative in nature, could range from sentiment analysis -- whether the presenter's tone was upbeat -- to how well the video supported the weather channel's brand.

That information might be mappable in a graph database, but it could also be assigned codes and treated like quantifiable data. Similarly, it might be easy to get quantifiable results based on how people reacted to each video if it included some kind of positivity metric, like a favorites button. But to collect data about public reactions to it beyond those who gave feedback, a data scientist would need to delve deeper into some qualitative research.

Semi-structured data lies somewhere between structured and unstructured data. Semi-structured refers to data that can fall into very specific categories and subcategories, but is not already organized into easy-to-manipulate compartments.

Common methods used in data science

  • Machine learning or statistical learning. Machine learning and statistical learning are forms of artificial intelligence that involve the ability for systems like computers to become more accurate and efficient at tasks over time using algorithms and statistical models without the input of a human programmer. 
  • Signal processing. Signal processing is any method used to analyze and improve digital signals.
  • Data mining. Data mining is the process of collecting into databases large amounts of information about websites, users, software or other stakeholders in a digital process, often for the purpose of learning about customers or product users to improve business practices and sales.
  • Databases. Databases are large collections of information created for the purpose of organizing and analyzing data.
  • Data engineering. Data engineering, similar to data science, is the practice of manipulating data in various ways, with the intention of discovering insights or improving operations.
  • Visualization. Large amounts of data can be organized into charts or models to be quickly understood by viewers without needing to involve them in the granular aspects of an analysis.
  • Data preparation. Data preparation is any process used to combine, gather, organize and structure data into a visually appealing or easily digestible format.
  • Predictive modeling. Predictive modeling is the process of creating charts and models to test different scenarios and, by applying statistics and mathematics, try to make the most educated guess about the likeliest outcome.
This was last updated in October 2019

Continue Reading About data scientist

Dig Deeper on Careers in artificial intelligence