Green River LibGuides: ENGL 127 Research Writing: Social Sciences (Schaefer): Data Literacy

Data Needs Context

Data Literacy

What is Data Literacy?

Data literacy is the ability to understand data and data practices sufficiently to meaningfully interpret data and effectively communicate that meaning. As such, it involves understanding where data came from, how to draw meaning or conclusions from it, how to read charts appropriately and make inferences from visualizations, and how to recognize when data are being used to mislead. Data literacy is inclusive of a broad range of data skills including data management, cleaning, analysis, and visualization. Most importantly it requires understanding the meaning of data, how it fits into a broader context, and what conclusions can and can’t be derived from that data.

quoted from NIH: National Library of Medicine

Data Literacy: NIH National Library of Medicine

Why does Data Literacy Matter?

We live in a world surrounded by data.
Data is present in our daily news in every graph and table.
The conclusions drawn by journalists are drawn from data that someone else has collected.
There is valid data and invalid data.
Data can be misinterpreted, intentionally and unintentionally
Data is used to make decisions, policies, and arguments that impact us all.

It is essential to understand:

"what good data and data analysis is so that you can make stronger arguments and better evaluate the arguments of others. It’s important to realize that everyone has an agenda of some sort, and being more data literate helps you understand if others are making a fair argument. Part of being able to take a more informed (some might say skeptical) view of data is being literate in how data are manipulated and subsequently presented: how they are collected, made into tables, and shown in pictures or graphs. Once you know how to do this the right way .... you can start asking if someone else is doing it in a way that is fair, or if they are distorting the data for their own purposes."

Adapted bullet points and quote are from: Bowen, M., & Bartley, A. (2013). The Basics of Data Literacy : Helping Your Students (and You!) Make Sense of Data.

What is Data?

Information represented numerically via raw numbers, percentages, percentiles, averages (mean, median, mode), etc.
Information that can be used algorithmically to determine compatibility (OKCupid), fitness levels (Fitbit), personality (BuzzFeed quizzes), etc.
Numerical information rendered visually (charts, graphs, coded maps, tables, etc.) to aid in pattern-finding and comprehension. (Fontichiaro and Abilock 2015)

(credit to Fontichiaro & Oehrli. "Why Data Literacy Matters")

Note that data and statistics are not the same thing. Data refers to information that has not yet been interpreted or processed in any way, while statistics are data that have been numerically analyzed.

Data Literacy Concepts

The following concepts are excerpted from Data Cadre, Nebraska Department of Education and Educational Service Unit

baseline data: the level of performance at the start of a data collection or process that is used as a reference point for which future levels of performance are compared
data analysis: a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making
demographic data: statistical characteristics of human populations (e.g. ethnicity)
disaggregation: the process of breaking down data into smaller units or sets of observations
longitudinal data: describes data that are collected on specific groups of people over a period of time. The data are useful for understanding change over time. It is recommended that data be collected over a period of at least 3 years (or events) in order to identify a trend.
mean: he numerical average of a set of scores (or data set), computed as the sum of all scores divided by the number of scores
median: the midpoint in a data set, that is, the score or value that divides it into two equal-sized halves
ordinal data: information that consists of counts or observations in specific categories rather than measurements
perceptions data: observations, opinions, beliefs, or convictions individuals have about a system or organization (e.g. school, district)
percentile: he location of a score in a distribution expressed as the percentage of cases in the data set with scores equal to or below the score in question
qualitative data: information that is not expressed numerically
quantitative data: information expressed numerically and therefore can be measured
reliability: the degree to which a test or other measurement instrument produces consistent results
research-based strategies: refers to any concept or strategy that is derived from or informed by objective evidence consisting largely or entirely of data, academic research, or scientific findings
snapshot data: data collected at one specific time
standard deviation: variation or dispersion from the mean (average)
trend data: factual information, numerical or narrative, that conveys patterns or directions about student learning, instruction and/or organizational conditions over time
validity: the degree to which an assessment measures what the assessment was designed to measure, such as specific content or skills

Additional Terminology:

Data Glossary: NIH National Libraries of Medicine

How Do I know if Data is "Good" Data?

Try asking and answering the following questions:

Who collected the data?
How were the data collected?
What types of questions were used?
Who were the participants in this sample?
Is the sample a good representation of the entire population? If not, how are they different?
Who is missing from the data?
What categories are used to group the data?
Do the variables and definitions match? (Avoid comparing apples and oranges.)
What are the sampling strategies and response rates?
Are the data quantitative, qualitative, or both?
How timely are the data?
How precise are the data?
Are there any missing data?
How relevant are the data to your own research?
Is there data to contextualize the data?
Are the data of good quality?

Questions are adapted from and added to: Emma Smith, Using Secondary Data in Educational and Social Research (New York: McGraw Hill/Open University Press, 2008) and Tulane University Library

From the introduction to Joel Best's Damned Lies and Statistics:

There are three basic questions that deserve to be asked whenever we encounter a new statistic.

Who created this statistic? Every statistic has its authors, its creators. Sometimes a number comes from a particular individual. On other occasions, large organizations (such as the Bureau of the Census) claim authorship (although each statistic undoubtedly reflects the work of particular people within the organization). In asking who the creators are, we ought to be less concerned with the names of the particular individuals who produced a number than with their part in the public drama about statistics. Does a particular statistic come from activists, who are striving to draw attention to and arouse concern about a social problem? Is the number being reported by the media in an effort to prove that this problem is newsworthy? Or does the figure come from officials, bureaucrats who routinely keep track of some social phenomenon, and who may not have much stake in what the numbers show?
Why was this statistic created? The identities of the people who create statistics are often clues to their motives. In general, activists seek to promote their causes, to draw attention to social problems. Therefore, we can suspect that they will favor large numbers, be more likely to produce them and less likely to view them critically...We need to be aware that the people who produce statistics often care what the numbers show, they use numbers as tools of persuasion.
How was this statistic created? We should not discount a statistic simply because its creators have a point of view, because they view a social problem as more or less serious. Rather, we need to ask how they arrived at the statistic. All statistics are imperfect, but some are far less perfect than others. There is a big difference between a number produced by a wild guess, and one generated through carefully designed research. This is the key question. Once we understand that all social statistics are created by someone, and that everyone who creates social statistics wants to prove something (even if that is only that they are careful, reliable, and unbiased), it becomes clear that the methods of creating statistics are key.

Real World Examples

Case 1: Data on Inflation

NY Times April 6, 2022: "Caviar and Canned Tuna: Top Fed Official Points out Income-based Inflation Gaps"

This news article raises two important points about the necessity of context to understand data.

The first point is that inflation numbers alone do not tell us how inflation impacts people at different income levels. To understand that, we need the context provided by additional data on the percentage of overall income an individual needs to devote to essentials, such as housing, feed, health, etc., and the shift in that percentage with inflation. Lower income individuals will have less left over after paying for basic needs. And as the article also notes, may not have room to flex to a less expensive option.
The second point the article makes is that we also need to see inflation numbers contextualized with information on what is even being measured. In this case, the article notes that the inflation measurement itself is flawed.

(click on image to enlarge)

Caviar and Canned Tuna: Top Fed Official Points out Income-based Inflation Gaps
NY Times, April 6, 2022 in ProQuest

Case 2: Data on Police Use of Force against Civilians

Following the death of Michael Brown at the hands of Ferguson, MO police in 2014, there was a demand from many sectors of society for information on police use of force and killings of civilians. This data was not available from any governmental office.

"How are people supposed to fix something that hadn't been properly counted?" The director of the FBI, James Comey, acknowledged as much in this 2016 speech... 'We simply don't know. As a country, we have not bothered to collect the data, to collect the information. And in the absence of information, we have anecdotes. We have videos.'" (NPR March 2, 2022)

In 2015, the FBI committed to gathering information from police departments on use of force against civilians. The FBI began collecting data from law enforcement agencies on January 1, 2019. Providing the data is voluntary and in the first two years of the program, the FBI has not met the minimum threshold of 60% participation by police departments that would enable them to publish the data. They will release some general information on trends.

"In 2021, 44 out of 302 agencies in Washington participated and provided use-of-force data. The officers employed by these agencies represent 51% of sworn law enforcement officers in the state." (from Crime Explorer)

In 2015 the Washington Post launched their own research specifically on police shootings and killings of civilians. They have shared data with the public since 2015.

Data collected by the Washington Post includes only fatal shootings by police. The data is disaggregated, or broken down, by the race of the victim, the circumstances of the killing, whether the person was armed, if they were experiencing a mental health crisis, and other factors.

Fatal Force: Americans Killed by Police
Washington Post data on police killings in the US: "After Michael Brown, an unarmed black man, was killed in 2014 by police in Ferguson, Mo., a Post investigation found that the FBI undercounted fatal police shootings by more than half. This is because reporting by police departments is voluntary and many departments fail to do so. In 2015, The Washington Post began to log every fatal shooting by an on-duty police officer in the United States. In that time there have been more than 5,000 such shootings recorded by The Post."
FBI may Shut Down Police Use-of-force Database Due to Lack of Police Participation
WA Post, Dec. 21
The FBI Wants Data on Police Use of Force. Police Departments aren't Cooperating
NPR report, March 2, 2022

Select Resources on Data Literacy

Articles (online & in Holman Library Databases)

Data Journalism and the School to Prison Pipeline
Miriam Pierce, 2017, The Medium
Interesting article by a journalism graduate student. Part 1 discusses pitfalls and benefits of data driven journalism. Part 2 is a discussion of data and infographics representing data on racial inequities in the school to prison pipeline. The author critiques an existing infographic that is lacking in data to a series she created with data and context.
“Beyond Data Literacy: Reinventing Community Engagement and Empowerment in the Age of Data”
This White Paper was written by Rahul Bhargava, Erica Deahl, Emmanuel Letouzé, Amanda Noonan, David Sangokoya, and Natalie Shoup, in collaboration with Internews Center for Innovation and Learning and the MIT Media Lab Center for Civic Media.

Videos/ Video Tutorials

Tutorial: What is Data and Data Literacy
Eastern Michigan University

Books

More Damned Lies and Statistics: How Numbers Confuse Public Issues by Joel Best
Call Number: eBook

ISBN: 0-520-23830-3

Publication Date: 2004

Available online in Ebook Central
Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists by Joel Best
Call Number: 303.38 B561d 2012

ISBN: 9780520274709

Publication Date: 2012-08-07

In print in Holman Library Main Collection or online in ProQuest or Ebsco. Links to Ebsco
Everybody Lies: Big Data, New Data, and What the Internet can Tell Us about Who We really are by Seth Stephens-Davidowitz
Call Number: 302.231 S835e 2017

ISBN: 9780062390851

Publication Date: 2017-05-09

Main Collection

Other Useful Research Guides on Data:

Data Literacy Guide from Eastern Michigan U.
Data Module 1: What is Research Data
From Macalester College Library

Organizations Devoted to Data Education & Literacy

Case Study

Starting Research Study

This case study centers on a research study published in Science Advances in April 2021. The abstract for the study states:

Racial-ethnic minorities in the United States are exposed to disproportionately high levels of ambient fine particulate air pollution (PM_2.5), the largest environmental cause of human mortality. However, it is unknown which emission sources drive this disparity and whether differences exist by emission sector, geography, or demographics. Quantifying the PM_2.5 exposure caused by each emitter type, we show that nearly all major emission categories—consistently across states, urban and rural areas, income levels, and exposure levels—contribute to the systemic PM_2.5 exposure disparity experienced by people of color. We identify the most inequitable emission source types by state and city, thereby highlighting potential opportunities for addressing this persistent environmental inequity.

PM2.5 Polluters Disproportionately and Systemically Affect People of Color in the United States
April 2021, Science Advances

Article 1 about the Research Study

"People of Color Hardest Hit by Air Pollution from Nearly All Sources," published in UW News April 28. The news article notes that one of the study co-authors is a UW professor.

"People of Color Hardest Hit by Air Pollution from Nearly All Sources"
UW News, April 28 2021

The UW article contextualizes the report with quotes on the subject from several of the study's co-authors and links out to their web pages.
For example, co-author Julian Marshall situates this one study within the larger social context of systemic racism.

“We find that nearly all emission sectors cause disproportionate exposures for people of color on average.... The inequities we report are a result of systemic racism: Over time, people of color and pollution have been pushed together, not just in a few cases but for nearly all types of emissions.”

Article 2 about the Research Study

Deadly Air Pollutant ‘Disproportionately and Systematically’ Harms Americans of Color, Study Finds
Washington Post, April 28, 2021

The WA Post article contextualizes and interprets the research study in a number of ways:

It sythesizes its findings with two other analyses released the same day also looking at the disparate impacts of air pollution on communities of color.
It refers and links to Trump administration documents that it says minimize the impact of fine-particle matter on communities of color.
It brings in examples and quotes from activists and residents from impacted communities.
It refered to another recent study in order to contextualize air quality within a larger picture of the inequitable impacts of climate change on low-income and communities of color.
It compared the 2014 data set used in the original study to the findings in the more current non-peer reviewed American Lung Association 2021 State of the Air Report - and found that their conclusions were the same.
The article does not explain the significance of data showing that while "Black, Hispanic and Asian Americans face a higher level of exposure than average to PM 2.5 from industry, light-duty vehicles, diesel-powered heavy trucks and construction," and "Black Americans are exposed to greater-than-average concentrations from all categories in the Environmental Protection Agency National Emissions Inventory," "White Americans have slightly higher-than-average exposure from agriculture and coal-fired power plants, the analysis found, because of where both are located."

Article 3 about the research study

Study Finds Exposure to Air Pollution Higher for People of Color Regardless of Region or Income
EPA website, September 20, 2021

This news article from the EPA (the federal government's Environmental Protection Agency) contextualizes the research study with some different facts.

It shares the conclusions reached by the research study.
It notes possible implications of the study for the regulatory responsibility of the EPA to protect disproportionately impacted communities.
It pointed to possible weaknesses in the study: "The study results also comes with caveats including uncertainty in the models and in inputs to the models and notes the potential benefit of additional analysis using local data and expertise. In addition, the study focuses on outdoor concentrations at locations of residence; disparities in associated health impacts would also reflect racial-ethnic variability in mobility, microenvironment, outdoor-to-indoor concentration relationships, dose-response, access to health care, and baseline mortality and morbidity rates."

Additional Articles and Context to the Original Research Study

Referenced in WA Post story:

Particulate Matter (PM) Basics
EPA on what particulate matter is
Shingle Mountain and the Weight of Environmental Injustice
WA Post Nov. 16 2020
Biden to Place Environmental Justice at Center of Sweeping Climate Plan
WA Post Jan. 27, 2021
The Tree Cover and Temperature Disparity in US Urbanized Areas: Quantifying the Association with Income across 5,723 Communities
US Urban Tree Cover Inequality Atlas 2021
The Nature Conservancy
State of the Air Report
Updated annually, from the American Lung Association
Environmental Justice and Refinery Pollution
Report from Environmental Integrity Project on Benzene Monitoring around Oil Refineries
Other Sources

Other Sources:

People of Color Breathe More Unhealthy Air from Nearly All Polluting Sources
Scientific American, April 2021

Sample Sources:

Pew Research Center: Topics
The Pew Research Center is a nonpartisan "fact tank" that provides reports and statistics on the issues, attitudes and social trends shaping America and the world.
Gapminder: The Joy of Stats
Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions.
Gapminder is an independent Swedish foundation with no political, religious, or economic affiliations.
Social Explorer (free edition)
Use the Maps function to track demographic and other data visually on the United States map.

Sample Sources

U.S. Data & Statistics (FedStats)
Statistics collected by over 100 separate U.S. Government agencies are gathered together. Subjects are searchable by keywords or by a list of topics. Mapstats allows searches for statistics from U.S. states, counties, and cities.
Data.gov
The home of the U.S. Government’s open data. Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
Dataset Search - Google Open Data
Dataset Search is a search engine for datasets.
PEW Research Center: Tools & Datasets
Find PEW Research datasets and more