The increasing volume and complexity of enterprise data, and its central role in decision-making and strategic planning, are driving organizations to invest in the people, processes and technologies they need to gain valuable business insights from their data assets. That includes a variety of tools commonly used in data science applications.
In a survey conducted by consultancy NewVantage Partners in late 2021, 91.7% of IT and business executives from 94 large companies said they're increasing their investments in data and AI initiatives such as data science programs. Meanwhile, market research firm IDC predicted in an August 2021 report that overall spending on big data and analytics systems will grow at a compound annual growth rate of 12.8% worldwide through 2025.
As data science teams build their portfolios of enabling technologies, they can choose from a wide selection of tools and platforms. Here's a rundown of 18 top data science tools that may be able to aid you in the analytics process, listed in alphabetical order with details on their features and capabilities -- and some potential limitations.
1. Apache Spark
Apache Spark is an open source data processing and analytics engine that can handle large amounts of data -- upward of several petabytes, according to proponents. Spark's ability to rapidly process data has fueled significant growth in the use of the platform since it was created in 2009, helping to make the Spark project one of the largest open source communities among big data technologies.
Due to its speed, Spark is well suited for continuous intelligence applications powered by near-real-time processing of streaming data. However, as a general-purpose distributed processing engine, Spark is equally suited for extract, transform and load uses and other SQL batch jobs. In fact, Spark initially was touted as a faster alternative to the MapReduce engine for batch processing in Hadoop clusters.
Spark is still often used with Hadoop but can also run standalone against other file systems and data stores. It features an extensive set of developer libraries and APIs, including a machine learning library and support for key programming languages, making it easier for data scientists to quickly put the platform to work.
D3.js lets visualization designers bind data to documents via the Document Object Model and then use DOM manipulation methods to make data-driven transformations to the documents. First released in 2011, it can be used to design various types of data visualizations and supports features such as interaction, animation, annotation and quantitative analysis.
3. IBM SPSS
IBM SPSS is a family of software for managing and analyzing complex statistical data. It includes two primary products: SPSS Statistics, a statistical analysis, data visualization and reporting tool, and SPSS Modeler, a data science and predictive analytics platform with a drag-and-drop UI and machine learning capabilities.
SPSS Statistics covers every step of the analytics process, from planning to model deployment, and enables users to clarify relationships between variables, create clusters of data points, identify trends and make predictions, among other capabilities. It can access common structured data types and offers a combination of a menu-driven UI, its own command syntax and the ability to integrate R and Python extensions, plus features for automating procedures and import-export ties to SPSS Modeler.
Created by SPSS Inc. in 1968, initially with the name Statistical Package for the Social Sciences, the statistical analysis software was acquired by IBM in 2009, along with the predictive modeling platform, which SPSS had previously bought. While the product family is officially called IBM SPSS, the software is still usually known simply as SPSS.
Julia is an open source programming language used for numerical computing, as well as machine learning and other kinds of data science applications. In a 2012 blog post announcing Julia, its four creators said they set out to design one language that addressed all of their needs. A big goal was to avoid having to write programs in one language and convert them to another for execution.
To that end, Julia combines the convenience of a high-level dynamic language with performance that's comparable to statically typed languages, such as C and Java. Users don't have to define data types in programs, but an option allows them to do so. The use of a multiple dispatch approach at runtime also helps to boost execution speed.
Julia 1.0 became available in 2018, nine years after work began on the language; the latest version is 1.7, released in November 2021. The documentation for Julia notes that, because its compiler differs from the interpreters in data science languages like Python and R, new users "may find that Julia's performance is unintuitive at first." But, it claims, "once you understand how Julia works, it's easy to write code that's nearly as fast as C."
5. Jupyter Notebook
An open source web application, Jupyter Notebook enables interactive collaboration among data scientists, data engineers, mathematicians, researchers and other users. It's a computational notebook tool that can be used to create, edit and share code, as well as explanatory text, images and other information. For example, Jupyter users can add software code, computations, comments, data visualizations and rich media representations of computation results to a single document, known as a notebook, which can then be shared with and revised by colleagues.
As a result, notebooks "can serve as a complete computational record" of interactive sessions among the members of data science teams, according to Jupyter Notebook's documentation. The notebook documents are JSON files that have version control capabilities. In addition, a Notebook Viewer service enables them to be rendered as static webpages for viewing by users who don't have Jupyter installed on their systems.
Jupyter Notebook's roots are in the programming language Python -- it originally was part of the IPython interactive toolkit open source project before being split off in 2014. The loose combination of Julia, Python and R gave Jupyter its name; along with supporting those three languages, Jupyter has modular kernels for dozens of others.
Keras is a programming interface that enables data scientists to more easily access and use the TensorFlow machine learning platform. It's an open source deep learning API and framework written in Python that runs on top of TensorFlow and is now integrated into that platform. Keras previously supported multiple back ends but was tied exclusively to TensorFlow starting with its 2.4.0 release in June 2020.
As a high-level API, Keras was designed to drive easy and fast experimentation that requires less coding than other deep learning options. The goal is to accelerate the implementation of machine learning models -- in particular, deep learning neural networks -- through a development process with "high iteration velocity," as the Keras documentation puts it.
The Keras framework includes a sequential interface for creating relatively simple linear stacks of layers with inputs and outputs, as well as a functional API for building more complex graphs of layers or writing deep learning models from scratch. Keras models can run on CPUs or GPUs and be deployed across multiple platforms, including web browsers and Android and iOS mobile devices.
Developed and sold by software vendor MathWorks since 1984, Matlab is a high-level programming language and analytics environment for numerical computing, mathematical modeling and data visualization. It's primarily used by conventional engineers and scientists to analyze data, design algorithms and develop embedded systems for wireless communications, industrial control, signal processing and other applications, often in concert with a companion Simulink tool that offers model-based design and simulation capabilities.
While Matlab isn't as widely used in data science applications as languages like Python, R and Julia, it does support machine learning and deep learning, predictive modeling, big data analytics, computer vision and other work done by data scientists. Data types and high-level functions built into the platform are designed to speed up exploratory data analysis and data preparation in analytics applications.
Considered relatively easy to learn and use, Matlab -- which is short for matrix laboratory -- includes prebuilt applications but also enables users to build their own. It also has a library of add-on toolboxes with discipline-specific software and hundreds of built-in functions, including the ability to visualize data in 2D and 3D plots.
Matplotlib is an open source Python plotting library that's used to read, import and visualize data in analytics applications. Data scientists and other users can create static, animated and interactive data visualizations with Matplotlib, using it in Python scripts, the Python and IPython shells, Jupyter Notebook, web application servers and various GUI toolkits.
The library's large code base can be challenging to master, but it's organized in a hierarchical structure that's designed to enable users to build visualizations mostly with high-level commands. The top component in the hierarchy is pyplot, a module that provides a "state-machine environment" and a set of simple plotting functions similar to the ones in Matlab.
First released in 2003, Matplotlib also includes an object-oriented interface that can be used together with pyplot or on its own; it supports low-level commands for more complex data plotting. The library is primarily focused on creating 2D visualizations but offers an add-on toolkit with 3D plotting features.
Short for Numerical Python, NumPy is an open source Python library that's used widely in scientific computing, engineering, and data science and machine learning applications. The library consists of multidimensional array objects and routines for processing those arrays to enable various mathematical and logic functions. It also supports linear algebra, random number generation and other operations.
One of NumPy's core components is the N-dimensional array, or ndarray, which represents a collection of items that are the same type and size. An associated data-type object describes the format of the data elements in an array. The same data can be shared by multiple ndarrays, and data changes made in one can be viewed in another.
NumPy was created in 2006 by combining and modifying elements of two earlier libraries. The NumPy website touts it as "the universal standard for working with numerical data in Python," and it is generally considered one of the most useful libraries for Python because of its numerous built-in functions. It's also known for its speed, partly resulting from the use of optimized C code at its core. In addition, various other Python libraries are built on top of NumPy.
Another popular open source Python library, pandas typically is used for data analysis and manipulation. Built on top of NumPy, it features two primary data structures: the Series one-dimensional array and the DataFrame, a two-dimensional structure for data manipulation with integrated indexing. Both can accept data from NumPy ndarrays and other inputs; a DataFrame can also incorporate multiple Series objects.
Created in 2008, pandas has built-in data visualization capabilities, exploratory data analysis functions and support for file formats and languages that include CSV, SQL, HTML and JSON. Additionally, it provides features such as intelligent data alignment, integrated handling of missing data, flexible reshaping and pivoting of data sets, data aggregation and transformation, and the ability to quickly merge and join data sets, according to the pandas website.
The developers of pandas say their goal is to make it "the fundamental high-level building block for doing practical, real-world data analysis in Python." Key code paths in pandas are written in C or the Cython superset of Python to optimize its performance, and the library can be used with various kinds of analytical and statistical data, including tabular, time series and labeled matrix data sets.
Python is the most widely used programming language for data science and machine learning and one of the most popular languages overall. The Python open source project's website describes it as "an interpreted, object-oriented, high-level programming language with dynamic semantics," as well as built-in data structures and dynamic typing and binding capabilities. The site also touts Python's simple syntax, saying it's easy to learn and its emphasis on readability reduces the cost of program maintenance.
The multipurpose language can be used for a wide range of tasks, including data analysis, data visualization, AI, natural language processing and robotic process automation. Developers can create web, mobile and desktop applications in Python, too. In addition to object-oriented programming, it supports procedural, functional and other types, plus extensions written in C or C++.
Python is used not only by data scientists, programmers and network engineers, but also by workers outside of computing disciplines, from accountants to mathematicians and scientists, who often are drawn to its user-friendly nature. Python 2.x and 3.x are both production-ready versions of the language, although support for the 2.x line ended in 2020.
An open source framework used to build and train deep learning models based on neural networks, PyTorch is touted by its proponents for supporting fast and flexible experimentation and a seamless transition to production deployment. The Python library was designed to be easier to use than Torch, a precursor machine learning framework that's based on the Lua programming language. PyTorch also provides more flexibility and speed than Torch, according to its creators.
First released publicly in 2017, PyTorch uses arraylike tensors to encode model inputs, outputs and parameters. Its tensors are similar to the multidimensional arrays supported by NumPy, but PyTorch adds built-in support for running models on GPUs. NumPy arrays can be converted into tensors for processing in PyTorch, and vice versa.
The library includes various functions and techniques, including an automatic differentiation package called torch.autograd and a module for building neural networks, plus a TorchServe tool for deploying PyTorch models and deployment support for iOS and Android devices. In addition to the primary Python API, PyTorch offers a C++ one that can be used as a separate front-end interface or to create extensions to Python applications.
The R programming language is an open source environment designed for statistical computing and graphics applications, as well as data manipulation, analysis and visualization. Many data scientists, academic researchers and statisticians use R to retrieve, cleanse, analyze and present data, making it one of the most popular languages for data science and advanced analytics.
The open source project is supported by The R Foundation, and thousands of user-created packages with libraries of code that enhance R's functionality are available -- for example, ggplot2, a well-known package for creating graphics that's part of a collection of R-based data science tools called tidyverse. In addition, multiple vendors offer integrated development environments and commercial code libraries for R.
R is an interpreted language, like Python, and has a reputation for being relatively intuitive. It was created in the 1990s as an alternative version of S, a statistical programming language that was developed in the 1970s; R's name is both a play on S and a reference to the first letter of the names of its two creators.
SAS is an integrated software suite for statistical analysis, advanced analytics, BI and data management. Developed and sold by software vendor SAS Institute Inc., the platform enables users to integrate, cleanse, prepare and manipulate data; then they can analyze it using different statistical and data science techniques. SAS can be used for various tasks, from basic BI and data visualization to risk management, operational analytics, data mining, predictive analytics and machine learning.
The development of SAS started in 1966 at North Carolina State University; use of the technology began to grow in the early 1970s, and SAS Institute was founded in 1976 as an independent company. The software was initially built for use by statisticians -- SAS was short for Statistical Analysis System. But, over time, it was expanded to include a broad set of functionality and became one of the most widely used analytics suites in both commercial enterprises and academia.
Development and marketing are now focused primarily on SAS Viya, a cloud-based version of the platform that was launched in 2016 and redesigned to be cloud-native in 2020.
Scikit-learn is an open source machine learning library for Python that's built on the SciPy and NumPy scientific computing libraries, plus Matplotlib for plotting data. It supports both supervised and unsupervised machine learning and includes numerous algorithms and models, called estimators in scikit-learn parlance. Additionally, it provides functionality for model fitting, selection and evaluation, and data preprocessing and transformation.
Initially called scikits.learn, the library started as a Google Summer of Code project in 2007, and the first public release became available in 2010. The first part of its name is short for SciPy toolkit and is also used by other SciPy add-on packages. Scikit-learn primarily works on numeric data that's stored in NumPy arrays or SciPy sparse matrices.
The library's suite of tools also enables various other tasks, such as data set loading and the creation of workflow pipelines that combine data transformer objects and estimators. But scikit-learn has some limits due to design constraints. For example, it doesn't support deep learning, reinforcement learning or GPUs, and the library's website says its developers "only consider well-established algorithms for inclusion."
SciPy is another open source Python library that supports scientific computing uses. Short for Scientific Python, it features a set of mathematical algorithms and high-level commands and classes for data manipulation and visualization. It includes more than a dozen subpackages that contain algorithms and utilities for functions such as data optimization, integration and interpolation, as well as algebraic equations, differential equations, image processing and statistics.
The SciPy library is built on top of NumPy and can operate on NumPy arrays. But SciPy delivers additional array computing tools and provides specialized data structures, including sparse matrices and k-dimensional trees, to extend beyond NumPy's capabilities.
SciPy actually predated NumPy: It was created in 2001 by combining different add-on modules built for the Numeric library that was one of NumPy's predecessors. Like NumPy, SciPy uses compiled code to optimize performance; in its case, most of the performance-critical parts of the library are written in C, C++ or Fortran.
TensorFlow is an open source machine learning platform developed by Google that's particularly popular for implementing deep learning neural networks. The platform takes inputs in the form of tensors that are akin to NumPy multidimensional arrays and then uses a graph structure to flow the data through a list of computational operations specified by developers. It also offers an eager execution programming environment that runs operations individually without graphs, which provides more flexibility for research and debugging machine learning models.
The platform also includes a TensorFlow Extended module for end-to-end deployment of production machine learning pipelines, plus a TensorFlow Lite one for mobile and IoT devices. TensorFlow models can be trained and run on CPUs, GPUs and Google's special-purpose Tensor Processing Units.
Weka is an open source workbench that provides a collection of machine learning algorithms for use in data mining tasks. Weka's algorithms, called classifiers, can be applied directly to data sets without any programming via a GUI or a command-line interface that offers additional functionality; they can also be implemented through a Java API.
The workbench can be used for classification, clustering, regression, and association rule mining applications and also includes a set of data preprocessing and visualization tools. In addition, Weka supports integration with R, Python, Spark and other libraries like scikit-learn. For deep learning uses, an add-on package combines it with the Eclipse Deeplearning4j library.
Weka is free software licensed under the GNU General Public License. It was developed at the University of Waikato in New Zealand starting in 1992; an initial version was rewritten in Java to create the current workbench, which was first released in 1999. Weka stands for the Waikato Environment for Knowledge Analysis and is also the name of a flightless bird native to New Zealand that the technology's developers say has "an inquisitive nature."
Data science and machine learning platforms
Commercially licensed platforms that provide integrated functionality for machine learning, AI and other data science applications are also available from numerous software vendors. The product offerings are diverse -- they include machine learning operations hubs, automated machine learning platforms and full-function analytics suites, with some combining MLOps, AutoML and analytics capabilities. Many platforms incorporate some of the data science tools listed above.
Matlab and SAS can also be counted among the data science platforms. Other prominent platform options for data science teams include the following technologies:
- Alteryx Analytics Automation Platform
- Amazon SageMaker
- Azure Machine Learning
- Databricks Lakehouse Platform
- DataRobot AI Cloud Platform
- Domino Enterprise MLOps Platform
- Google Cloud Vertex AI
- H2O AI Cloud
- IBM Watson Studio
- Saturn Cloud
- Tibco Data Science
Some platforms are also available in free open source or community editions -- examples include Dataiku and H2O. Knime combines an open source analytics platform with a commercial Knime Server software package that supports team-based collaboration and workflow automation, deployment and management.