Sergey Nivens - Fotolia
COVID-19 data resources for volunteering data scientists
Many data scientist are looking to research the novel coronavirus and ways to fight the continued outbreak. Read on for the top COVID-19 data resources for volunteer researchers.
As the world comes to grips with the COVID-19 pandemic, various efforts are emerging to harness the talents of analysts, AI developers and data engineers. These initiatives can provide individuals and teams the opportunity to do something meaningful, work with others and learn new skills.
"No doubt statistical and predictive models need to be built to serve the science community, which is working hard to understand the virus, treatment efficacies and develop vaccinations," said Joshua Swartz, a partner in digital transformation at Kearney, a global strategy and management consulting firm. "What developers can do is help the science community conduct these analyses and make sense of the results."
Pivoting for the cause
Various vendors are jumping to offer their own unique talents as well.
David Leichner, CMO at SQream, an SQL GPU data warehouse vendor, said they're building on prior volunteer efforts around cancer research and DNA analytics to find correlations of various indicators to build a risk model that takes into account demographics, density of urban areas, smoking habits and more.
Sean Knight, head of marketing at Knowi, a unified analytics platform, said his firm has shifted most of their developer talent to building out dashboards to help track the spread of coronavirus. The goal is to offer a trusted free place where people can track the spread of COVID-19. Analysts can also incorporate these into their own apps.
The largest coordinated project of the many COVID-19 data resources is probably the Kaggle COVID-19 Open Research Dataset Challenge, also called CORD-19. It was prompted by a White House call to action and brings together prizes, data sources and collaboration opportunities for data analysts who want to help.
CORD-19 asks participants to solve current questions by developing text and data mining tools for various data sets and thousands of scientific papers and reports.
"The call was issued in the hope that AI and other tools can be used to help find answers to a key set of questions about COVID-19," said Phil Gurbacki, senior vice president of product and customer support at DataRobot.
The raw data set includes over 44,000 scholarly articles about various coronaviruses and the full RNA sequencing of the virus. There are a variety of high-level tasks and subtasks that researchers are encouraged to help solve.
High level tasks include questions such as "What is known about transmission, incubation, and environmental stability?" Subtasks require answering questions such as "What are the range of incubation periods?" "What is the seasonality of transmissions?" and "What is the prevalence of asymptomatic transmission?"
The EndCoronavirus effort was built and is maintained by the New England Complex Systems Institute (NECSI) and collaborators with the goal of minimizing the impact of COVID-19 by providing useful data and guidelines for action. It has quickly grown to over 2,100 volunteers and is looking for more.
Participants get to hone their skills in analyzing networks, agent-based modeling, multi-scale analysis and complexity.
Stephanie So, founder and CDO of Geeq, a blockchain startup, said they all work on Slack. The team employs a range of mathematical tools designed for systems with many interacting components in which traditional statistical assumptions break down.
COVID-19 data sets
Other groups are working on collating COVID-19 data resources that may be useful to various types of analysis and the development of new applications.
Andrew EyeCEO and co-founder, ClosedLoop.ai
"The COVID-19 crisis has highlighted how different groups with unique skills can work together in a distributed way very quickly," said Andrew Eye, CEO and co-founder of ClosedLoop.ai, a healthcare-focused data science platform.
He said some of the leading groups capturing and aggregating data sets include Worldometer COVID-19 Statistics and Johns Hopkins, which has a data set and visual dashboard on GitHub. This data has also been collated with free access on AWS.
Free supercomputer access
Several government labs and private cloud providers -- including IBM and AWS -- are making their high-performance computers available for data analysts with novel ideas for analyzing COVID-19-related data through the COVID-19 HPC Consortium. Organizers are making over 330 petaflops, 775,000 CPU cores and 34,000 GPUs (and counting) available to data analysts.
This program provides technical support and promotional credits on cloud services required for running these workloads as well. For example, Amazon has offered researchers working on time-critical projects use of AWS to instantly access a virtually unlimited infrastructure capacity.
Several other organizations are making their tools and resources available for COVID-19 projects as well.
"Developers, data scientists and others in the tech community can begin to get involved in the response to the coronavirus by taking advantage of the myriad platforms and tools available to the public," DataRobot's Gurbacki said.
DataRobot is making its automated machine learning and Paxata data preparation products available to researchers free of charge.
Topcoder, a developer crowdsourcing company, launched the Topcoder Anti-Coronavirus Hackathon challenge.
"The goal is to find a new app, algorithm or website to help people during this novel time," said Michael Morris, CEO of Topcoder.
The Deep Learning Coronavirus Cure is using deep learning to generate novel molecules as candidates for a cure to the novel virus. And OpenCovid19 is working on various data analytics and real-world tools to safely test COVID-19 using common tools.