Dataproc helps users process, transform and understand vast quantities of data. For example, organizations could use the service to process data from millions of internet of things (IoT) devices, to predict manufacturing or sales opportunities from business data, or to analyze log files to spot potential security flaws.
The Dataproc service allows users to create managed clusters that can scale from three to hundreds of nodes. Users can create clusters on-demand, use them for the duration of the processing task and then turn them off when the task is complete. Users can also size clusters based on the type of workload, budget limitations, performance requirements and existing resources. It's possible to dynamically scale clusters up or down dynamically – even while jobs are processing. Users only pay for the compute resources that are consumed during the process.
Dataproc is built on open source platforms, including:
- Apache Hadoop -- supports the distributed processing of large data sets across clusters
- Apache Spark – serves as the engine for fast, large-scale data processing
- Apache Pig -- analyzes large data sets
- Apache Hive -- provides data warehousing and SQL database storage management
Dataproc supports native versions of Hadoop, Spark, Pig and Hive, allowing users to employ the latest versions of each platform, as well as the entire ecosystem of related open source tools and libraries. Users can develop Dataproc jobs in languages that are popular within the Spark and Hadoop ecosystem, such as Java, Scala, Python and R.
Google Cloud Dataproc is fully integrated with other Google Cloud Platform services. These services include:
- BigQuery -- a managed, petabyte-scale data analytics warehouse
- Bigtable -- a NoSQL big data database service
- Google Cloud Storage – a durable and highly available object storage service
- Stackdriver Monitoring – a tool for tracking cloud performance and availability
- Stackdriver Logging – a tool to store, search, monitor and produce alerts based on log data and events
Users can create clusters, manage clusters and operate Spark or Hadoop jobs using the Google Cloud Platform console, cloud software development kit (SDK) or the cloud representational state transfer (REST) application programming interface (API).
Billing for Google Cloud Dataproc is currently an incremental fee of $0.01 per hour per virtual machine (VM) used in the Dataproc cluster. Other services involved in Dataproc projects, such BigQuery and Bigtable, carry additional costs.
Dataproc is primarily used by data scientists, business decision-makers, researchers and other IT professionals.