Browse Definitions :

Sergey Nivens - Fotolia

Top 25 big data glossary terms you should know

Big data knowledge and awareness can help businesses and individuals get ahead. Understanding the most important terms in big data is an excellent place to start that journey.

The significance of big data doesn't lie in the volume of data a business has accumulated. Its true value lays in how that data is used. Forward-looking organizations understand they need to capitalize on the potential of big data through its practical and thoughtful use to inform, steer and refine business decision-making.

Businesses and individuals in all industries have woken up to the potential benefits of big data and analytics. If you're interested in the field of big data -- whether you're a student contemplating it as a career path, or a business or technology professional looking to brush up on your knowledge -- getting up to speed on the most important big data terminology is a must.

To help get you started, we have collected key big data terms in this convenient glossary.

important characteristics and terms of big data
A collection of data from various sources, ranging from well-defined to loosely defined, big data is derived from human or machine sources. Big data is characterized by several terms starting with the letter V.

The top 25 big data terms you should know

  1. Algorithm: A procedure or formula for solving a problem based on conducting a sequence of specified actions. In the context of big data, algorithm refers to a mathematical formula embedded in software to perform an analysis on a set of data.
  2. Artificial intelligence: The simulation of human intelligence processes by machines, especially computer systems. These machines can perceive the environment and take corresponding required actions and even learn from those actions.
  3. Cloud computing: A general term for anything that involves delivering hosted services over the internet. For big data practitioners, cloud computing is important because their roles involve accessing and interfacing with software and/or data hosted and running on remote servers.
  4. Data lake: A storage repository that holds a vast amount of raw data in its native format until it's required. Every data element within a data lake is assigned a unique identifier and set of extended metadata tags. When a business question arises, users can access the data lake to retrieve any relevant, supporting data.
  5. Data science: The field of applying advanced analytics techniques and scientific principles to extract valuable information from data. Data science typically involves the use of statistics, data visualization and mining, computer programming, machine learning and database engineering to solve complex problems.
difference between a data scientist and a data engineer
Data scientists and data engineers play a big role in the world of big data, but their responsibilities and necessary skills differ.
  1. Database management system (DBMS): System software that serves as an interface between databases and end users or application programs, ensuring that data is consistently organized and remains easily accessible. DBMSes make it possible for end users to create, protect, read, update and delete data in a database.
  2. Data set: A collection of related, discrete items of data that may be accessed individually or collectively, or managed as a single, holistic entity. Data sets are generally organized into some formal structure, often in a tabular format.
  3. Hadoop: An open-source distributed processing framework that manages data processing and storage for big data applications. It provides a reliable means for managing pools of big data and supporting related analytics applications.
  4. Hadoop distributed file system (HDFS): The primary data storage system used by Hadoop HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
  5. Machine learning: A type of artificial intelligence that improves software applications' ability to predict accurate outcomes without being explicitly programmed to do so. Common use cases for machine learning include recommendation engines, fraud and malware threat detection, business process automation and predictive maintenance.
  6. MapReduce: Specific tools that support distributed computing on large data sets. These form core components of the Apache Hadoopsoftware framework.
  1. Natural language processing (NLP): A computer program's ability to understand both written and spoken human language. A component of artificial intelligence, NLP has existed for over five decades and has roots in the field of linguistics.
  2. NoSQL: An approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schemais carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.
  3. Object-based image analysis: The analysis of digital images using data from individual pixels. It combines spectral, textural and contextual information to identify thematic classes in an image.
  1. Programming language: Used by developers and data scientists to perform big data manipulation and analysis. R, Python, and Scala are the three major languages for data science and data mining.
  2. Pattern recognition: The ability to detect arrangements of characteristics or data that provide information about a given system or data set. Patterns could manifest as recurring sequences of data that can be used to predict trends, specific configurations of features in images that identify objects, frequent combinations of words and phrases or clusters of activities on a network that could indicate a cyber attack.
  3. Python: An interpreted, object-oriented programming language that's gained popularity for big data professionals due to its readability and clarity of syntax. Python is relatively easy to learn and highly portable, as its statements can be interpreted in several operating systems.
  4. Qualifications and learning resources for big data careers: Students exploring a career in big data, and even established professionals seeking to augment their existing experience, have a host of options and resources at their disposal to advance their ambitions and grow their big data skills. These include university degrees at both undergraduate and graduate levels as well as online certifications and learning modules.
  5. R programming language: An open source scripting programming language used for predictive analytics and data visualization. R includes functions that support both linear and nonlinear modeling, classical statistics, classifications and clustering.
  6. Relational databases: A collection of information that organizes data points with defined relationships for easy access. Data structures (including data tables, indexes and views) remain separate from the physical storage. This enables administrators to edit the physical data storage without affecting the logical data structure.
  7. Scala: A software programming language that blends object-oriented methods with functional programming capabilities. This allows it to support a more concise programming style which reduces the amount of code that developers need to write. Another benefit is that Scala features, which operate well in smaller programs, also scale up effectively when introduced into more complex environments.
  8. Soft skills: Today's most successful big data professionals are those who can harmonize their academic qualifications, innate intellectual abilities and real-world experience with a diverse range of other softer skills. These soft skills include tenacity, problem-solving abilities, curiosity, effective communication, presentation and interpersonal skills, and well-rounded business understanding and acumen.
  1. Statistical computing: The collection and interpretation of data aimed at uncovering patterns and trends. It may be used in scenarios such as gathering research interpretations, statistical modeling or designing surveys and studies, and advanced business intelligence. R is a programming language that's highly compatible with statistical computing.
  2. Structured data: Structured data is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis.
  3. Unstructured data: Everything that can't be organized in the manner of structured data. Unstructured data includes emails and social media posts, blogs, and messages, transcripts of audio recordings of people's speech, images and video files, and machine data, such as log files from websites, servers, networks and applications.
unstructured data types

Learn more about big data, its terminology and what it means to succeed in this field by reading our guide to big data best practices, tools and trends.

Dig Deeper on Data and data management

  • SD-WAN security

    SD-WAN security refers to the practices, protocols and technologies protecting data and resources transmitted across ...

  • net neutrality

    Net neutrality is the concept of an open, equal internet for everyone, regardless of content consumed or the device, application ...

  • network scanning

    Network scanning is a procedure for identifying active devices on a network by employing a feature or features in the network ...

  • cloud penetration testing

    Cloud penetration testing is a tactic an organization uses to assess its cloud security effectiveness by attempting to evade its ...

  • cloud workload protection platform (CWPP)

    A cloud workload protection platform (CWPP) is a security tool designed to protect workloads that run on premises, in the cloud ...

  • out-of-band authentication

    Out-of-band authentication is a type of two-factor authentication (2FA) that requires a secondary verification method through a ...

  • strategic management

    Strategic management is the ongoing planning, monitoring, analysis and assessment of all necessities an organization needs to ...

  • IT budget

    IT budget is the amount of money spent on an organization's information technology systems and services. It includes compensation...

  • project scope

    Project scope is the part of project planning that involves determining and documenting a list of specific project goals, ...

  • director of employee engagement

    Director of employee engagement is one of the job titles for a human resources (HR) manager who is responsible for an ...

  • digital HR

    Digital HR is the digital transformation of HR services and processes through the use of social, mobile, analytics and cloud (...

  • employee onboarding and offboarding

    Employee onboarding involves all the steps needed to get a new employee successfully deployed and productive, while offboarding ...

Customer Experience
  • chatbot

    A chatbot is a software or computer program that simulates human conversation or "chatter" through text or voice interactions.

  • martech (marketing technology)

    Martech (marketing technology) refers to the integration of software tools, platforms, and applications designed to streamline ...

  • transactional marketing

    Transactional marketing is a business strategy that focuses on single, point-of-sale transactions.