Browse Definitions :
Definition

data labeling

Data labeling, in the context of machine learning, is the process of detecting and tagging data samples. The process can be manual but is usually performed or assisted by software.

What is data labeling used for?

Data labeling is an important part of data preprocessing for ML, particularly for supervised learning, in which both input and output data are labeled for classification to provide a learning basis for future data processing.

A system training to identify animals in images, for example, might be provided with multiple images of various types of animals from which it would learn the common features of each, enabling it to correctly identify the animals in unlabeled images.

Data labeling is also used when constructing ML algorithms for autonomous vehicles. Autonomous vehicles such as self-driving cars need to be able to tell the difference between objects in their course so that they can process the external world and drive safely. Data labeling is used to enable the car's artificial intelligence (AI) to tell the difference between a person, the street, another car and the sky by labeling the key features of those objects or data points and looking for similarities between them.

How does data labeling work?

ML and deep learning systems often require massive amounts of data to establish a foundation for reliable learning patterns. The data they use to inform learning must be labeled or annotated based around data features that help the model organize the data into patterns that produce a desired answer.

The labels used to identify data features must be informative, discriminating and independent to produce a quality algorithm. A properly labeled dataset provides a ground truth that the ML model uses to check its predictions for accuracy and to continue refining its algorithm.

A quality algorithm is high in both accuracy and quality. Accuracy refers to the proximity of certain labels in the dataset to ground truth. Quality refers to how consistently accurate an entire dataset is.

Errors in data labeling impair the quality of the training dataset and the performance of any predictive models it’s used for. To mitigate this, many organizations take a Human-in-the-Loop (HITL) approach, maintaining human involvement in training and testing data models throughout their iterative growth.

Methods of data labeling

An enterprise can use several methods to structure and label its data. The options range from using in-house staff to crowdsourcing and data labeling services. These options include the following:

  • CrowdsourcingA third-party platform gives an enterprise access to many workers at once.
  • Contractors. An enterprise can hire temporary freelance workers to process and label data.
  • Managed teams. An enterprise can enlist a managed team to process data. Managed teams are trained, evaluated and managed by a third-party organization.
  • In-house staff. An enterprise can use its existing employees to process data.

There is no one optimal method for labeling data. Enterprises should use the method or combination of methods that best suits their needs. Some criteria to consider when choosing a data labeling method are as follows:

  • the size of the enterprise;
  • the size of the dataset that requires labeling;
  • the skill level of employees on staff;
  • the financial restraints of the enterprise; and
  • the purpose of the ML model being supplemented with labeled data.

A good data labeling team should ideally have domain knowledge of the industry an enterprise serves. Data labelers who have outside context guiding them are more accurate. They should also be flexible and agile, because data labeling and ML are iterative processes, always changing and evolving as more information is taken in.

Importance of data labeling

A recent report from AI research and advisory firm Cognilytica found that over 80% of the time enterprises spend on AI projects goes toward preparing, cleaning and labeling data. Manual data labeling is the most time-consuming and expensive method, but it may be warranted for important applications.

Critics of AI speculate that automation will put low skill-jobs such as call center work, truck and Uber driving at risk, because rote tasks are becoming easier to perform for machines. However, some experts believe that data labeling may present a new low-skill job opportunity to replace the ones that are nullified by automation, because there is an ever-growing surplus of data and machines that need to process it to perform the tasks necessary for advanced ML and AI.

This was last updated in August 2019

Continue Reading About data labeling

SearchNetworking
  • cloud-native network function (CNF)

    A cloud-native network function (CNF) is a service that performs network duties in software, as opposed to purpose-built hardware.

  • microsegmentation

    Microsegmentation is a security technique that splits a network into definable zones and uses policies to dictate how data and ...

  • Wi-Fi 6E

    Wi-Fi 6E is one variant of the 802.11ax standard.

SearchSecurity
  • MICR (magnetic ink character recognition)

    MICR (magnetic ink character recognition) is a technology invented in the 1950s that's used to verify the legitimacy or ...

  • What is cybersecurity?

    Cybersecurity is the protection of internet-connected systems such as hardware, software and data from cyberthreats.

  • Android System WebView

    Android System WebView is a system component for the Android operating system (OS) that allows Android apps to display web ...

SearchCIO
  • privacy compliance

    Privacy compliance is a company's accordance with established personal information protection guidelines, specifications or ...

  • contingent workforce

    A contingent workforce is a labor pool whose members are hired by an organization on an on-demand basis.

  • product development (new product development -- NPD)

    Product development, also called new product management, is a series of steps that includes the conceptualization, design, ...

SearchHRSoftware
  • talent acquisition

    Talent acquisition is the strategic process employers use to analyze their long-term talent needs in the context of business ...

  • employee retention

    Employee retention is the organizational goal of keeping productive and talented workers and reducing turnover by fostering a ...

  • hybrid work model

    A hybrid work model is a workforce structure that includes employees who work remotely and those who work on site, in a company's...

SearchCustomerExperience
  • Salesforce Trailhead

    Salesforce Trailhead is a series of online tutorials that coach beginner and intermediate developers who need to learn how to ...

  • Salesforce

    Salesforce, Inc. is a cloud computing and social enterprise software-as-a-service (SaaS) provider based in San Francisco.

  • data clean room

    A data clean room is a technology service that helps content platforms keep first person user data private when interacting with ...

Close