zagandesign - Fotolia
BigQuery was one of the first drivers of Google Cloud Platform adoption, so it's no surprise that Google continues to invest in the querying service.
Notable recent BigQuery updates include a move to GCP Console, wider geographic availability, tools for machine learning and in-memory analysis, a transfer service and a storage API. Google has also improved cost controls and upgraded the serverless capabilities for easier management.
However, compared to other cloud providers' services, Google needs to improve BigQuery's regional support and its ability to execute smaller queries.
"BigQuery has been on a tear, adding new features, both small and large," said Matt Baird, founder and CTO of AtScale, a data warehouse virtualization provider. "[However,] most of the features Google introduces are catching up to the more established technology of Amazon Redshift and Azure SQL Data Warehouse (DW), which are based on decades-old IP."
BigQuery updates lean on machine learning
These services from AWS and Microsoft integrate with business intelligence (BI) and analytics tools using a standard SQL syntax that is essentially a PostgreSQL database update that has been around for decades. Big Query's SQL-like grammar varies, so existing BI, analytics and data processing tools must be updated to work with it.
Because AWS and Microsoft based their services on established technologies, they were able to have more standard analytics functionality out of the gate. But, since BigQuery is based off modernized technologies, Google is more equipped to incorporate other new features, such as AI and machine learning, Baird said.
Non-data scientists can do light machine learning with BigQuery because tools -- such as AutoML, which automates aspects of the machine learning process -- are baked into the service. BigQuery can also help scale machine learning applications by bringing the code to the data rather than the other way around.
Enterprises can use BigQuery ML to create machine learning models using SQL statements. BigQuery ML started out with logistical and linear regression models and recently added k-means segmentation.
Tying machine learning directly into extract, transform, load (ETL) pipelines is a game changer in terms of ease of implementation, since it can be one of the most time-consuming aspects of developing and deploying AI models into production, said Jeff Lewis, principal consultant for Maven Wave, a Google Premier Partner and consulting firm in Chicago.
To get the most out of BigQuery ML, it's important to understand the teams that utilize that data and which use cases best integrate with their current ETL pipeline or its processes. For example, segmentation breaks down data based on common attributes to create more accurate machine learning models. This is usually done outside of the ETL pipeline and then brought back in. But executing segmentation in-line with the data load simplifies the overall process and reduces the break points within a system.
BigQuery provides a variety of capabilities directly in SQL, thus making model exploration more approachable to business users and analysts. This enables users to develop applications with SQL scripting they're familiar with rather than having to learn R or Python. These apps can be deployed into production the same way the rest of a BigQuery implementation is executed, and models can be directly tied into any BI tool that supports BigQuery.
Faster, better managed big queries
For large queries, BigQuery is still faster than its competitors. But it can be slower when it comes to small queries because of the way the service is implemented.
"Any query in BigQuery currently will take a few seconds to execute, even if it is going against a small amount of data," Lewis said. In comparison, Redshift can return a subsecond response against a small table.
Google is starting to close this gap with another BigQuery update: BI Engine, an in-memory cache. These types of caches are typically used to aggregate dashboards when building reports based on smaller tables. BI Engine provides inexpensive memory cache for BigQuery that can produce those subsecond results. However, it's currently only available for Google Data Studio.
A flat rate pricing plan is another recent BigQuery update that should expand the service's functionality. This plan enables managers to balance fixed price and on-demand options on a project-level basis to better optimize costs. For example, ETL batch workflows on massive tables can be processed at a flat cost, while the smaller projects can work with on-demand usage.
Google still catching up
BigQuery excels at lights-out database operations, Baird said. Redshift and Azure SQL DW are ACID-compliant databases. BigQuery is focused purely on elastic analytical loads, which makes it easier to scale up. Redshift does require data engineering to get the best performance for things like compressing tables, sorting and distributing keys.
One downside is that BigQuery does not have indexes. As a result, analytical queries can be costly as they scan the entire data set, Baird said.
Other areas Baird said Google still needs to improve include the following:
- Region support: BigQuery is considerably behind its competitors in terms of global reach, even with availability in 14 regions. This can create latency issues, especially for I/O-intensive applications.
- Data imports: BigQuery Data Transfer Service is not yet on par with Azure Data Factory and AWS cloud data migration services. BigQuery is more straightforward to use, but Redshift has more format support, including delimiter and compression formats.
- Parquet/ORC formats and Nearline storage: Translating between formats used in BigQuery and Nearline -- Google's cheaper storage tier -- can create added complexity.
- Numeric support: Redshift's numeric types are useful for exact numeric operations, such as tracking money in a ledger. BigQuery instead has a record data type that can be used for representing nested structures, making it easier to compute on semistructured data.
- Ecosystem: BigQuery is closing the gap on this but still has work to do. Redshift is based on PostgreSQL and has a much larger community of tools. Azure SQL DW is SQL Server-derived with a large and vibrant set of supporting tools.