ENGL 127 Research Writing: Social Sciences (Schaefer)

This research guide is for students in Amanda Schaefer's English 127: Focus on Family & Home

Data Needs Context

Source: "This is How Easy It Is to Lie With Statistics " by Zach Star, is licensed under a Standard YouTube License.

Data Literacy

What is Data Literacy?

Here are four definitions of data literacy:

  • Data literacy is "the ability to read, work with, analyze, and argue with data."(from The Data Literacy Project)
  • "Data literacy is the ability to read and interpret data, to think critically about statistics, and to use statistics as evidence." (from Hogenboom et al. "Show Me the Data! Partnering with Instructors to Teach Data Literacy")
  • "Data literacy involves understanding what data mean, including how to read graphs and charts appropriately, draw correct conclusions from data, and recognize when data are being used in misleading or inappropriate ways." (from Carlson et al. "Determining Data Information Literacy Needs: A Study of Students and Research Faculty")
  • Data literacy is “the desire and ability to constructively engage in society through and about data.” (from Bhargava et al. “Beyond Data Literacy: Reinventing Community Engagement and Empowerment in the Age of Data”)

Why does Data Literacy Matter?

  • We live in a world surrounded by data.
  • Data is present in our daily news in every graph and table. 
  • The conclusions drawn by journalists are drawn from data that someone else has collected.
  • There is valid data and invalid data.
  • Data can be misinterpreted, intentionally and unintentionally
  • Data is used to make decisions, policies, and arguments that impact us all.

It is essential to understand:

"what good data and data analysis is so that you can make stronger arguments and better evaluate the arguments of others. It’s important to realize that everyone has an agenda of some sort, and being more data literate helps you understand if others are making a fair argument. Part of being able to take a more informed (some might say skeptical) view of data is being literate in how data are manipulated and subsequently presented: how they are collected, made into tables, and shown in pictures or graphs. Once you know how to do this the right way .... you can start asking if someone else is doing it in a way that is fair, or if they are distorting the data for their own purposes."

Adapted bullet points and quote are from: Bowen, M., & Bartley, A. (2013). The Basics of Data Literacy : Helping Your Students (and You!) Make Sense of Data

What is Data? 

  1. Information represented numerically via raw numbers, percentages, percentiles, averages (mean, median, mode), etc.
  2. Information that can be used algorithmically to determine compatibility (OKCupid), fitness levels (Fitbit), personality (BuzzFeed quizzes), etc.
  3. Numerical information rendered visually (charts, graphs, coded maps, tables, etc.) to aid in pattern-finding and comprehension. (Fontichiaro and Abilock 2015)

(credit to Fontichiaro & Oehrli. "Why Data Literacy Matters")

Note that data and statistics are not the same thing. Data refers to information that has not yet been interpreted or processed in any way, while statistics are data that have been numerically analyzed.

Data Literacy Concepts

The following concepts are excerpted from Data Cadre, Nebraska Department of Education and Educational Service Unit

  • baseline data: the level of performance at the start of a data collection or process that is used as a reference point for which future levels of performance are compared
  • data analysis: a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making
  • demographic data: statistical characteristics of human populations (e.g. ethnicity)
  • disaggregation: the process of breaking down data into smaller units or sets of observations
  • longitudinal data: describes data that are collected on specific groups of people over a period of time. The data are useful for understanding change over time. It is recommended that data be collected over a period of at least 3 years (or events) in order to identify a trend.
  • mean: he numerical average of a set of scores (or data set), computed as the sum of all scores divided by the number of scores
  • median: the midpoint in a data set, that is, the score or value that divides it into two equal-sized halves
  • ordinal data: information that consists of counts or observations in specific categories rather than measurements
  • perceptions data: observations, opinions, beliefs, or convictions individuals have about a system or organization (e.g. school, district)
  • percentile: he location of a score in a distribution expressed as the percentage of cases in the data set with scores equal to or below the score in question
  • qualitative data: information that is not expressed numerically
  • quantitative data: information expressed numerically and therefore can be measured
  • reliability: the degree to which a test or other measurement instrument produces consistent results
  • research-based strategies: refers to any concept or strategy that is derived from or informed by objective evidence consisting largely or entirely of data, academic research, or scientific findings
  • snapshot data: data collected at one specific time
  • standard deviation: variation or dispersion from the mean (average)
  • trend data: factual information, numerical or narrative, that conveys patterns or directions about student learning, instruction and/or organizational conditions over time
  • validity: the degree to which an assessment measures what the assessment was designed to measure, such as specific content or skills

For the full list of concepts:

How Do I know if Data is "Good" Data?

Try asking and answering the following questions:

  • Who collected the data?
  • How were the data collected?
  • What types of questions were used?
  • Who were the participants in this sample?
  • Is the sample a good representation of the entire population? If not, how are they different?
  • Who is missing from the data?
  • What categories are used to group the data?
  • Do the variables and definitions match? (Avoid comparing apples and oranges.)
  • What are the sampling strategies and response rates?
  • Are the data quantitative, qualitative, or both?
  • How timely are the data?
  • How precise are the data?
  • Are there any missing data?
  • How relevant are the data to your own research?
  • Is there data to contextualize the data?
  • Are the data of good quality?

Questions are adapted from and added to: Emma Smith, Using Secondary Data in Educational and Social Research (New York: McGraw Hill/Open University Press, 2008) and Tulane University Library

From the introduction to Joel Best's Damned Lies and Statistics:

There are three basic questions that deserve to be asked whenever we encounter a new statistic.  

  1. Who created this statistic? Every statistic has its authors, its creators. Sometimes a number comes from a particular individual. On other occasions, large organizations (such as the Bureau of the Census) claim authorship (although each statistic undoubtedly reflects the work of particular people within the organization). In asking who the creators are, we ought to be less concerned with the names of the particular individuals who produced a number than with their part in the public drama about statistics. Does a particular statistic come from activists, who are striving to draw attention to and arouse concern about a social problem? Is the number being reported by the media in an effort to prove that this problem is newsworthy? Or does the figure come from officials, bureaucrats who routinely keep track of some social phenomenon, and who may not have much stake in what the numbers show?  
  2. Why was this statistic created? The identities of the people who create statistics are often clues to their motives. In general, activists seek to promote their causes, to draw attention to social problems. Therefore, we can suspect that they will favor large numbers, be more likely to produce them and less likely to view them critically...We need to be aware that the people who produce statistics often care what the numbers show, they use numbers as tools of persuasion.  
  3. How was this statistic created? We should not discount a statistic simply because its creators have a point of view, because they view a social problem as more or less serious. Rather, we need to ask how they arrived at the statistic. All statistics are imperfect, but some are far less perfect than others. There is a big difference between a number produced by a wild guess, and one generated through carefully designed research. This is the key question. Once we understand that all social statistics are created by someone, and that everyone who creates social statistics wants to prove something (even if that is only that they are careful, reliable, and unbiased), it becomes clear that the methods of creating statistics are key.

Real World Examples

Case 1: Data on Inflation

NY Times April 6, 2022: "Caviar and Canned Tuna: Top Fed Official Points out Income-based Inflation Gaps"

This news article raises two important points about the necessity of context to understand data.

  • The first point is that inflation numbers alone do not tell us how inflation impacts people at different income levels. To understand that, we need the context provided by additional data on the percentage of overall income an individual needs to devote to essentials, such as housing, feed, health, etc., and the shift in that percentage with inflation. Lower income individuals will have less left over after paying for basic needs. And as the article also notes, may not have room to flex to a less expensive option. 
  • The second point the article makes is that we also need to see inflation numbers contextualized with information on what is even being measured. In this case, the article notes that the inflation measurement itself is flawed.

(click on image to enlarge)

Data Literacy- Numbers need context: Impact of inflation

Case 2: Data on Police Use of Force against Civilians

Following the death of Michael Brown at the hands of Ferguson, MO police in 2014, there was a demand from many sectors of society for information on police use of force and killings of civilians. This data was not available from any governmental office.

"'How are people supposed to fix something that hadn't been properly counted?" The director of the FBI, James Comey, acknowledged as much in this 2016 speech... 'We simply don't know. As a country, we have not bothered to collect the data, to collect the information. And in the absence of information, we have anecdotes. We have videos.'" (NPR March 2, 2022)

In 2015, the FBI committed to gathering information from police departments on use of force against civilians. The FBI began collecting data from law enforcement agencies on January 1, 2019. Providing the data is voluntary and in the first two years of the program, the FBI has not met the minimum threshold of 60% participation by police departments that would enable them to publish the data. They will release some general information on trends.

"In 2021, 44 out of 302 agencies in Washington participated and provided use-of-force data. The officers employed by these agencies represent 51% of sworn law enforcement officers in the state." (from Crime Explorer)

In 2015 the Washington Post launched their own research specifically on police shootings and killings of civilians. They have shared data with the public since 2015. 

Data collected by the Washington Post includes only fatal shootings by police. The data is disaggregated, or broken down, by the race of the victim, the circumstances of the killing, whether the person was armed, if they were experiencing a mental health crisis, and other factors. 

Select Resources on Data Literacy

Articles (online & in Holman Library Databases)
Videos/ Video Tutorials

Source: " What is Data Literacy? " by Data-Pop Alliance, is licensed under a Standard YouTube License.

Source: " Own your Body's Data " by Talithia Williams, is licensed under a Standard YouTube License.

Source: "The Human Insights Missing from Big Data" by Tricia Wang, is licensed under a Standard YouTube License.

Source: " Stories vs. statistics: Professor John Allen Paulos at TEDxTempleU " by John Allen Paulos , is licensed under a Standard YouTube License.

Other Useful Research Guides on Data:
Organizations Devoted to Data Education & Literacy

Case Study

Starting Research Study

This case study centers on a research study published in Science Advances in April 2021. The abstract for the study states: 

Racial-ethnic minorities in the United States are exposed to disproportionately high levels of ambient fine particulate air pollution (PM2.5), the largest environmental cause of human mortality. However, it is unknown which emission sources drive this disparity and whether differences exist by emission sector, geography, or demographics. Quantifying the PM2.5 exposure caused by each emitter type, we show that nearly all major emission categories—consistently across states, urban and rural areas, income levels, and exposure levels—contribute to the systemic PM2.5 exposure disparity experienced by people of color. We identify the most inequitable emission source types by state and city, thereby highlighting potential opportunities for addressing this persistent environmental inequity.
Article 1 about the Research Study

"People of Color Hardest Hit by Air Pollution from Nearly All Sources," published in UW News April 28. The news article notes that one of the study co-authors is a UW professor.

The UW article contextualizes the report with quotes on the subject from several of the study's co-authors and links out to their web pages. 
For example, co-author Julian Marshall situates this one study within the larger social context of systemic racism. 

“We find that nearly all emission sectors cause disproportionate exposures for people of color on average.... The inequities we report are a result of systemic racism: Over time, people of color and pollution have been pushed together, not just in a few cases but for nearly all types of emissions.”
Article 2 about the Research Study

The WA Post article contextualizes and interprets the research study in a number of ways:

  • It sythesizes its findings with two other analyses released the same day also looking at the disparate impacts of air pollution on communities of color. 
  • It refers and links to Trump administration documents that it says minimize the impact of fine-particle matter on communities of color.
  • It brings in examples and quotes from activists and residents from impacted communities.
  • It refered to another recent study in order to contextualize air quality within a larger picture of the inequitable impacts of climate change on low-income and communities of color.
  • It compared the 2014 data set used in the original study to the findings in the more current non-peer reviewed American Lung Association 2021 State of the Air Report - and found that their conclusions were the same.
  • The article does not explain the significance of data showing that while "Black, Hispanic and Asian Americans face a higher level of exposure than average to PM 2.5 from industry, light-duty vehicles, diesel-powered heavy trucks and construction," and "Black Americans are exposed to greater-than-average concentrations from all categories in the Environmental Protection Agency National Emissions Inventory," "White Americans have slightly higher-than-average exposure from agriculture and coal-fired power plants, the analysis found, because of where both are located."

Article 3 about the research study

This news article from the EPA (the federal government's Environmental Protection Agency) contextualizes the research study with some different facts.  

  • It shares the conclusions reached by the research study.
  • It notes possible implications of the study for the regulatory responsibility of the EPA to protect disproportionately impacted communities.
  • It pointed to possible weaknesses in the study: "The study results also comes with caveats including uncertainty in the models and in inputs to the models and notes the potential benefit of additional analysis using local data and expertise. In addition, the study focuses on outdoor concentrations at locations of residence; disparities in associated health impacts would also reflect racial-ethnic variability in mobility, microenvironment, outdoor-to-indoor concentration relationships, dose-response, access to health care, and baseline mortality and morbidity rates."

Additional Articles and Context to the Original Research Study

Referenced in WA Post story: 

Other Sources:

Data Online

Sample Sources:
Sample Sources