Work Description

Title: Quantifying News Media Bias through Crowdsourcing and Machine Learning Dataset Open Access Deposited

Attribute Value
  • This dataset is generated through an analysis of news articles published online in 2013. We first gather all relevant news articles through an examination of Bing toolbar data and through the use of automated scripts to retrieve the content of the URL observed in that data. Next, we use a combination of crowdsourcing and machine learning techniques to first identify all news and, next, political news in our sample. Next, we run a large crowdsourcing experiment to determine the ideological bias and topic of the articles pertaining to political news.
  • Our primary analysis is based on articles published in 2013 by the top thirteen US news outlets and two popular political blogs. To compile the set of articles published by these outlets, we first examined the complete web-browsing records for US-located users who installed the Bing Toolbar, an optional add-on application for the Internet Explorer web browser. For each of the fifteen news sites, we recorded all unique URLs that were visited by at least ten toolbar users, and we then crawled the news sites to obtain the full article title and text. This process resulted in a corpus of 803,146 articles published on the fifteen news sites over the course of a year, with each article annotated with its relative popularity.

  • Next, we built two binary classifiers using large-scale logistic regression. The first classifier—which we refer to as the news classifier —identifies “news” articles (i.e., articles that would typically appear in the front section of a traditional newspaper). The second classifier—the politics classifier —identifies political news from the subset of articles identified as news by the first classifier. 340,191 (42 percent) were classified as news. On the set of 340,191 news articles, 114,814 (34 percent) were classified as political.

  • Having identified approximately 115,000 political news articles, we next seek to categorize the articles by topic (e.g., gay rights, healthcare, etc.), and to quantify the political slant of the article. To do so, we turn to human judges recruited via Mechanical Turk to analyze the articles. For every day in 2013, we randomly selected two political articles, when available, from each of the 15 outlets we study, with sampling weights equal to the number of times the article was visited by our panel of toolbar users.

  • Amazon Mechanical Turk Labeling task: To detect and control for possible preconceptions of an outlet’s ideological slant, workers, upon first entering the experiment, were randomly assigned to either a blinded or unblinded condition. In the blinded condition, workers were presented with only the article’s title and text, whereas in the unblinded condition, they were additionally shown the name of the outlet in which the article was published. Each article was then analyzed by two workers, one each from the sets of workers in the two conditions. For each article, each worker completed the following three tasks. First, they provided primary and secondary article classifications from a list of fifteen topics: (1) civil rights; (2) Democrat scandals; (3) drugs; (4) economy; (5) education; (6) elections; (7) environment; (8) gay rights; (9) gun-related crimes; (10) gun rights/regulation; (11) healthcare; (12) international news; (13) national security; (14) Republican scandals; and (15) other.

  • Second, workers determined whether the article was descriptive news or opinion. Third, to measure ideological slant, workers were asked, “Is the article generally positive, neutral, or negative toward members of the Democratic Party?” and separately, “Is the article generally positive, neutral, or negative toward members of the Republican Party?” Choices for these last two questions were provided on a five-point scale: very positive, somewhat positive, neutral, somewhat negative, and very negative. To mitigate question-ordering effects, workers were initially randomly assigned to being asked either the Democratic or Republican party question first; the question order remained the same for any subsequent articles the worker rated. Finally, we assigned each article a partisanship score between –1 and 1, where a negative rating indicates that the article is net left-leaning and a positive rating indicates that it is net right-leaning. Specifically, for an article’s depiction of the Democratic Party, the five-point scale from very positive to very negative is encoded as –1, –0.5, 0, 0.5, 1. Analogously, for an article’s depiction of the Republican Party, the scale is encoded as 1, 0.5, 0, –.0.5, –1. The score for each article is defined as the average over these two ratings. Thus, an average score of –1, for example, indicates that the article is very positive toward Democrats and very negative toward Republicans. The result of this procedure is a large, representative sample of political news articles, with direct human judgments on partisanship and article topic.
Contact information
Citations to related material
Resource type
Last modified
  • 11/18/2022
  • 10/08/2019
To Cite this Work:
Budak, C., Goel, S., Rao, J. M. (2019). Quantifying News Media Bias through Crowdsourcing and Machine Learning Dataset [Data set], University of Michigan - Deep Blue Data.


This work is not a member of any user collections.

Files (Count: 2; Size: 2.93 MB)

This file provides information about the data presented in newsArticlesWithLabels.tsv. There are 7 columns in this file. They are listed below:

1) Url
2) News Type: Has three possible values: other (remember that the articles are sampled from those that are predicted to be political based on our classifiers and so there are false positives we remove based on this label), News, or Opinion.
3) Perceived (whether the worker was looking at the blinded or unblinded version. perceived=1 means unblinded version)
4) Primary topic identified by the worker (If "None", the primary topic is not captured by our list of 14 topics)
5) Secondary topic (If "None", there is no secondary topic or the secondary topic is not captured by our list)
6) Democratic party vote
7) Republican vote

For more information about the dataset, please check out the relevant paper:

Download All Files (To download individual files, select them in the “Files” panel above)

Best for data sets < 3 GB. Downloads all files plus metadata into a zip file.

Files are ready   Download Data from Globus
Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.