CoVaxxy: 3.5 Million Downloads of the Dataset Mapping COVID-19 Vaccine Discourse on Twitter

Name: CoVaxxy Tweet IDs data set
Creator: DeVerna, M., Pierri, F., Truong, B., Bollenbacher, J., Axelrod, D., Loynes, N., Torres-Lugo, C., Yang, K.-C., Menczer, F., Bryden, J.
Published: 2021
License: CC BY 4.0
Keywords: Twitter, COVID-19, vaccine, misinformation, social media

A living dataset of tweet IDs that became the backbone of global research into how COVID-19 vaccine narratives spread — and how misinformation hijacked them.

DeVerna, M., Pierri, F., Truong, B., Bollenbacher, J., Axelrod, D., Loynes, N., Torres-Lugo, C., Yang, K.-C., Menczer, F., Bryden, J. · 2021DOI: 10.5281/zenodo.7752586CC BY 4.0View on Zenodo →

43Kresearcher views

24/7continuous collection since Jan 2021

570M+estimated tweet IDs in the dataset+12M weekly at peak

60+countries with citing publications

The conversation mutated faster than the virus

When COVID-19 vaccines arrived in late 2020, social media became the front line of public health. Twitter wasn't just where people shared their opinions — it was where narratives were born, amplified, distorted, and weaponized. A team at Indiana University's Observatory on Social Media realized that to study this phenomenon, they needed to capture everything, in real time, indefinitely.

CoVaxxy began collecting tweet IDs on January 4, 2021 — the same week the U.S. vaccine rollout stumbled through its chaotic early days. The system tracked tweets containing vaccine-related keywords in multiple languages, running 24/7, pulling in millions of data points per week. By sharing only tweet IDs (not content), the team navigated Twitter's terms of service while enabling anyone to reconstruct the full dataset through the platform's API.

The result is staggering in scale. With over 3.5 million downloads, CoVaxxy is one of the most-accessed datasets in the history of open research data. It has powered studies on everything from bot networks amplifying anti-vaccine content to how public health messaging failed — or succeeded — in different countries. Researchers on six continents have used it to trace the lifecycle of specific false claims, from fringe forums to mainstream media.

What makes CoVaxxy exceptional isn't just its size. It's that the dataset is alive — continuously updated, capturing the shifting landscape of vaccine discourse as new variants emerged, as mandates were imposed and lifted, as the conversation itself evolved from urgency to exhaustion to politicization. It is, in effect, a seismograph for the social media age.

Weekly tweet volume over time

Peaks align with major vaccine milestones and misinformation surges

Misinformation share by topic category

Percentage of tweets flagged as containing misleading claims, by topic

Vaccine efficacy & safety

Anti-vaccine narratives

Government policy & mandates

Personal experiences

Scientific research updates

Conspiracy theories

Other / multilingual

We didn't just collect tweets. We captured the nervous system of a global conversation — every spike, every tremor, every coordinated push.

Tweet volume peaked during Omicron and booster debates in late 2021 — outpacing even the initial vaccine rollout

Coordinated bot networks accounted for an estimated 15-20% of anti-vaccine amplification, according to studies using this dataset

The dataset revealed that misinformation narratives mutated like the virus itself — old claims repackaged around new variants and new policies

🏥

Public Health Strategy

Health agencies used CoVaxxy-derived research to identify misinformation surges in near-real time, enabling targeted counter-messaging during critical vaccination windows.

🔍

Platform Accountability

The dataset provided independent evidence of how algorithmic amplification spread vaccine misinformation — evidence that platforms themselves were unwilling or unable to share publicly.

🌍

Global Pandemic Preparedness

CoVaxxy established a blueprint for real-time social media surveillance during health emergencies. Future pandemics will have better tools because this infrastructure exists.