Project Summary

In this critical moment, COVID-19 data is being collected, released, analyzed, interpreted, and used to inform recovery and response efforts. D4BL has worked to consolidate state level data to explore the disproportionate impact of COVID-19 on Black people in the US. The D4BL COVID-19 Dataset captures state-level COVID-19 cases and deaths for Black people in the United States. D4BL established a team of volunteer data scientists to develop a codebase for automating the data extraction from state websites and storing it into this dataset. Click the link below to download the dataset or explore our codebase on github: https://github.com/d4bl/COVID19_tracker_data_extraction

COVID19 Data Extraction Script on GitHub

Download Latest Combined Data Output (CSV)

How D4BL Built This Dataset, and Why:

We, at Data for Black Lives, assert that public health data must always be interpreted in the proper historical context considering the various elements of structural racism that shape the American public health ecosystem. We emphasize the importance of considering potential impact and power surrounding whose voices are driving the interpretations of these disparities and informing proposed solutions. This is a critical moment where people are largely considering the question: why are Black people particularly vulnerable and over-represented among COVID-19 cases and deaths? The conditions that make Black communities vulnerable to the virus are the same conditions that make Black communities vulnerable to the daily harms of structural racism.

In April 2020, we convened a Movement Pulsecheck to explore this in-depth. Additionally, we began tracking which states were releasing COVID-19 data on cases and deaths disaggregated by race. We released a public demand for all states to report this data in support of monitoring the ways structural racism exacerbates the pandemic. By the end of April 2020, we began establishing the COVID-19 Data Tracker that captures the number of Black folks reported as having tested positive for or died from COVID-19.

The first iteration of this dataset involved manually clicking on each state website, taking down the number of cases and deaths for Black people, and populating that data into a google sheet. Over time, as more states began reporting and we contemplated how data might change day to day, we set out on a mission to build a code base that scrapes COVID-19 race/ethnicity data from official state websites in an automated fashion and combines these daily snapshots into a publicly available dataset.

The Director of Research, Jamelle Watson-Daniels, began writing code for the start of the codebase. Once it became clear that the scope of the project would be quite involved, Jamelle brought on two lead D4BL volunteers: Sydeaka Watson and Natarajan Krishnaswami. The team encountered several challenges in their efforts to extract data from state websites.

  1. Variation in data extraction methods: There is no consistent reporting strategy or mechanism across state websites. State websites report COVID data in various ways including: CSV and excel worksheets, text embedded in HTML code, text in static PDF documents, image snapshots of tables (stored as image files), Tableau dashboards and graphs (bar charts, pie charts, etc).
  2. Website changes and/or updates: Often, after the team developed code to successfully extract data from a given state website, states would change their reporting mechanism. This led to the adoption of a code management strategy in which the team would monitor the state websites and edit and at times completely rewrite the code for that state.
  3. Specialized extraction techniques: The state websites that report data via image snapshots of tables and/or graphs seemed to require the use of special software such as Optical Character Recognition (OCR) in order to enable the computer to ‘read’ the data. This non-standard technique required specialized skills.
  4. Discrepancies in reporting standards: There are moments when states changed the way they count cases and deaths attributed to COVID-19 or the way these counts are reported. Thus, any dataset based on state website data faces the challenge of how to address or account for these sudden shifts.
  5. Discrepancies in dealing with racial categories: When demographic breakdowns are given, the team had to understand how cases with unknown race are accounted for in demographic percentages and whether race or race/ethnicity categories are used for the breakdowns, e.g., “Black Non-Hispanic” or “Black or African American.”

change to "these challenges, during the first week of June 2020, D4BL established a team of 6 volunteer data scientists to take on the task of developing the codebase for automating the data extraction procedure from state websites. On April 8, when D4BL compiled and released a list of all states reporting COVID-19 infections and deaths by race, there were only 12 states publicly reporting this data. D4BL included contact information for the appropriate state offices and urged the public to demand the public release of this data by all states. To date, all states are reporting disaggregated COVID-19 data.

Dataset usage and best practices:

AVOID WEAPONIZING COVID-19 DATA -

COVID-19 data should not be used to determine risk. It should not be used to surveil, criminalize, cage, and/or deny critical benefits.

COVID-19 data should not be used to inform any of the following automated decision making systems, for example:

  • Predictive policing and enforcement of social distancing orders (i.e., COVID-19 hot spots should not be assigned greater police presence and prioritized enforcement of social distancing measures)
  • Public safety assessments to determine whether a person can be released from jail or prison
  • Forced testing (general and antibodies testing) that would disproportionately target Black, Latinx, and/or poor communities
  • Denying a person credit
  • Reinforcing historical practices of redlining in the form of denying loans, lowering property values, and reducing public and private investments
  • Denying a person a job
  • Denying a person housing
  • Denying a person access to health care, treatment, or services (i.e., ventilators)
  • Denying a person access to public services and benefits (i.e., public transportation)

INTENDED PURPOSE OF DATA -

COVID-19 Data should inform the implementation of the following immediate actions:

  • Release of individuals in jails, prisons, and ICE detention facilities
  • Transparent, accountable, and community-informed protocols on automated decision systems used for contact tracing and other public health concerns
  • Consistent testing protocols and workflows in Black communities
  • Available and accessible testing sites and tests that meet the health needs of Black communities in light of the social determinants that cause racial health disparities,
  • Moratoriums on negative credit reporting, late payment fees, rental evictions, foreclosures, debt collection, and wage garnishments
  • Suspension of rent payments in federally-subsidized housing programs and in low-income neighborhoods for one year
  • Suspension of consumer and business credit payments (including mortgages, car, student, personal loans, and credit cards)

COVID-19 Data should be used to establish a reparative stimulus plan and efforts for long-term structural change.

  1. COVID-19 data should be used to issue reparations tracing back to slavery. COVID-19 data offers more evidence that reparations are needed for harms tracing back to slavery. Evidence supporting the cumulative impact of structural racism on Black communities includes: Racial disparities in COVID-19 deaths, limited access to tests and health care, unemployment rates, and unequal loss of income. Reparations are the most viable option to rectify the deeply entrenched inequities that have only exacerbated the impact of the pandemic for Black communities.
  2. COVID-19 stimulus plans, government loans, and other forms of government aid must account for COVID-19 racial disparities in deaths and loss of employment. Stimulus plans must account for the unique health challenges many Black folks face due to historical structural racism in the American public health system. Black communities need a tailored fix to recover from this pandemic in distinct ways. Otherwise, the pandemic will cause irreparable harm to Black communities that will make the promise of equity and equality a more distant reality. Additional COVID-19 stimulus plans should account for this in the allocation of dollars and relief aid.
  3. COVID-19 data should be committed to a public data trust that would entrust the public with full agency over their data, as opposed to private or government actors. Designating the public as owners of this data would provide the highest level of transparency and accountability--not to mention give individuals greater negotiating power to use the data to achieve better outcomes.
  4. Data on the economic impact of COVID-19 should inform alternatives to current discriminatory financial systems (e.g. credit scores). We anticipate that this data will reveal traditionally hidden biased decision-making that continues to disproportionately inhibit Black Communities' access to wealth in American society.
  5. State and local health officials must reject unethical clinical studies that perform antibody and vaccine testing exclusively on poor Black communities. State and local health officials must co-develop general testing protocols with impacted communities to protect Black communities. This would include open and public announcements about where and when COVID-19 clinical trials will take place, ensuring Black scientists and researchers contribute to experimental planning, and providing accessible science communication surrounding experimental results.

About the Team:

Here is a list of contributors to the development of this dataset: Jamelle Watson-Daniels (D4BL Project Lead), Sydeaka Watson (Volunteer Project Manager and Co-Technical Lead), Natarajan Krishnaswami (Volunteer Co-Technical Lead), Fred Sun (Volunteer Technical Strategy Lead), Vinay Padmanabhi (Volunteer), Andrés Crucetta Nieto (Volunteer), Jamie Prezioso (Volunteer) and Atilio Barreda II(Web Integration Lead).

Dataset maintenance and updates:

Jamelle, Natarajan, Sydeaka and Fred are supporting the updates to the dataset. Errors in the codebase and updates to the codebase will continue to be documented in the github repository. If you are interested in extending, building on or contributing to the dataset, please contact Data for Black Lives.

Feature Name Description
Location The geographic entity for which this row provides data. These can be states, counties, or cities.
Date published The date as of which the underlying data was published by the reporting entity.
Date/time of data pull The date/time the D4BL team ran the code to retrieve the data was retrie.
Total Cases The number of confirmed COVID-19 cases reported for the location.
Total Deaths The number of deaths attributed to COVID-19 reported for the location.
Count Cases Black/AA The number of confirmed COVID-19 cases corresponding to “Black or African American” or “Non-Hispanic Black” reported for the location.
Count Deaths Black/AA The number of confirmed COVID-19 deaths corresponding to “Black or African American” or “Non-Hispanic Black” reported for the location.
Percentage of Cases Black/AA The percentage of COVID-19 cases (of those with race reported) corresponding to “Black or African American” or “Non-Hispanic Black”.
Percentage of Deaths Black/AA The percentage of COVID-19 deaths (of those with race reported) corresponding to “Black or African American” or “Non-Hispanic Black”
Percentage includes unknown race? Logical (True/False) indicator of whether the `Percentage of Cases Black/AA` field includes COVID-19 cases with race/ethnicity unknown
Percentage includes Hispanic Black? Logical (True/False) indicator of whether the `Percentage of Deaths Black/AA` field includes COVID-19 deaths with race/ethnicity unknown
Count Cases Known Race The number of cases in which race was reported and, hence, “known”
Count Deaths Known Race The number of deaths in which race was reported and, hence, “known”
Percentage of Black/AA population (Census data) The percentage of “Black or African American alone” individuals for the region, computed using 2013-2018 American Community Survey fields B02001_003E and B02001_001E.

Description of how we handled missing data: A number of state websites were not reporting data by race. For these, we extract the available data and leave the missing fields blank.

Dataset maintenance and updates:

Jamelle, Natarajan, Sydeaka and Fred are supporting the updates to the dataset. Errors in the codebase and updates to the codebase will continue to be documented in the github repository. If you are interested in extending, building on or contributing to the dataset, please contact Data for Black Lives.