Google Cloud Dataproc

Google Cloud Dataproc is a managed service for processing large datasets, such as those used in big data initiatives. Dataproc is part of Google Cloud Platform, Google's public cloud offering.

Dataproc helps users process, transform and understand vast quantities of data. For example, organizations could use the service to process data from millions of internet of things (IoT) devices, to predict manufacturing or sales opportunities from business data, or to analyze log files to spot potential security flaws.

The Dataproc service allows users to create managed clusters that can scale from three to hundreds of nodes. Users can create clusters on-demand, use them for the duration of the processing task and then turn them off when the task is complete. Users can also size clusters based on the type of workload, budget limitations, performance requirements and existing resources. It's possible to dynamically scale clusters up or down dynamically – even while jobs are processing. Users only pay for the compute resources that are consumed during the process.

Dataproc is built on open source platforms, including:

  • Apache Hadoop -- supports the distributed processing of large data sets across clusters
  • Apache Spark – serves as the engine for fast, large-scale data processing
  • Apache Pig -- analyzes large data sets
  • Apache Hive -- provides data warehousing and SQL database storage management

Dataproc supports native versions of Hadoop, Spark, Pig and Hive, allowing users to employ the latest versions of each platform, as well as the entire ecosystem of related open source tools and libraries. Users can develop Dataproc jobs in languages that are popular within the Spark and Hadoop ecosystem, such as Java, Scala, Python and R.

Google Cloud Dataproc is fully integrated with other Google Cloud Platform services. These services include:

Users can create clusters, manage clusters and operate Spark or Hadoop jobs using the Google Cloud Platform console, cloud software development kit (SDK) or the cloud representational state transfer (REST) application programming interface (API).

Billing for Google Cloud Dataproc is currently an incremental fee of $0.01 per hour per virtual machine (VM) used in the Dataproc cluster. Other services involved in Dataproc projects, such BigQuery and Bigtable, carry additional costs.

Dataproc is primarily used by data scientists, business decision-makers, researchers and other IT professionals.

This was last updated in July 2016

Continue Reading About Google Cloud Dataproc

Dig Deeper on Cloud app development and management

Data Center