Gorodenkoff - stock.adobe.com
The ability to connect and make large volumes of disparate sources of information available for analysis is a hallmark of data lake architectures. Making sense of many disparate data sets is also critical for researchers to find ways to battle the COVID-19 pandemic.
Amazon Web Services is throwing some of its data lake capabilities into the fray to help researchers. The AWS COVID-19 data lake became generally available on April 8, providing a repository of curated data sets full of information about the coronavirus. The information includes case tracking data, hospital bed availability and research articles.
Beyond just being a repository for data, AWS is connecting analysis and querying tools, including Amazon Athena for queries, Amazon QuickSight for visualization, AWS Data Exchange for subscribing to data sets and Amazon Kendra for exploring research articles.
The AWS COVID-19 data lake could be a good showcase for data lakes, as long as people are inputting relevant, accurate, unstructured and structured data on the coronavirus-spawned disease, said Patrick Moorhead, president and principal analyst at Moor Insights & Strategy.
"What is most interesting to me is how users will leverage AWS' massive compute instances to work on the data," Moorhead said. "I believe AWS has the widest variety of compute and I believe we will see some interesting results coming from the different ways the data is processed."
AWS' data lake efforts have been successful in the market for some straightforward reasons, Moorhead said. AWS has more security certifications than any other vendor, and AWS also can ingest, store and release many different data types, from structured and columnar data to unstructured data like photos, videos, text and audio, Moorhead said.
"It also helps that AWS has many different kinds of databases that can pull on that data lake, as well as federated data sources that can feed into the data lake," he said.
How the AWS COVID-19 data lake is put together
Patrick Moorhead President and principal analyst, Moor Insights & Strategy
"You can think of the S3 bucket as the storage for the data lake contents, and then there is the data lake itself, which includes additional components like data pipelines for data movement and transformation, and a data catalog," said Herain Oberoi, general manager of databases, analytics and blockchain marketing at AWS. "AWS Lake Formation is typically used by customers when, in addition to building data pipelines and a catalog, you also need to secure your data, which is not needed in a public data lake."
Oberoi noted that for the COVID-19 data lake, AWS automatically curates the data and keeps it up to date so that it is ready for analysis through a number of analytics and machine learning engines.
"We have AWS Glue data pipelines that continuously prepare the data from AWS Data Exchange on every update and load it into the lake," Oberoi said. "In addition, we register the data set into the AWS Glue Data Catalog so you can analyze it through engines like Amazon Athena, Amazon Redshift, Amazon EMR Spark, EMR Presto, Amazon SageMaker and more."
COVID-19 data lake is free
All access to the data in the public data lake bucket is free, Oberoi said.
AWS would normally charge for the Athena queries and additional data services that are used alongside the data, but is making it easier for researchers with the AWS Diagnostic Development Initiative (DDI). With that effort, AWS is providing credits for services and technical support for diagnostic research.
Looking ahead, Oberoi said AWS is working with scientists and researchers to meet their evolving needs.
"So far, they have asked us to source more data sets, and we will be expanding our portfolio accordingly," he said. "As we learn more about their critical needs, we will fill the gaps to enable experts to contain and neutralize the virus."