A living dataset of tweet IDs that became the backbone of global research into how COVID-19 vaccine narratives spread — and how misinformation hijacked them.
When COVID-19 vaccines arrived in late 2020, social media became the front line of public health. Twitter wasn't just where people shared their opinions — it was where narratives were born, amplified, distorted, and weaponized. A team at Indiana University's Observatory on Social Media realized that to study this phenomenon, they needed to capture everything, in real time, indefinitely.
CoVaxxy began collecting tweet IDs on January 4, 2021 — the same week the U.S. vaccine rollout stumbled through its chaotic early days. The system tracked tweets containing vaccine-related keywords in multiple languages, running 24/7, pulling in millions of data points per week. By sharing only tweet IDs (not content), the team navigated Twitter's terms of service while enabling anyone to reconstruct the full dataset through the platform's API.
The result is staggering in scale. With over 3.5 million downloads, CoVaxxy is one of the most-accessed datasets in the history of open research data. It has powered studies on everything from bot networks amplifying anti-vaccine content to how public health messaging failed — or succeeded — in different countries. Researchers on six continents have used it to trace the lifecycle of specific false claims, from fringe forums to mainstream media.
What makes CoVaxxy exceptional isn't just its size. It's that the dataset is alive — continuously updated, capturing the shifting landscape of vaccine discourse as new variants emerged, as mandates were imposed and lifted, as the conversation itself evolved from urgency to exhaustion to politicization. It is, in effect, a seismograph for the social media age.
Peaks align with major vaccine milestones and misinformation surges
Percentage of tweets flagged as containing misleading claims, by topic
We didn't just collect tweets. We captured the nervous system of a global conversation — every spike, every tremor, every coordinated push.
Health agencies used CoVaxxy-derived research to identify misinformation surges in near-real time, enabling targeted counter-messaging during critical vaccination windows.
The dataset provided independent evidence of how algorithmic amplification spread vaccine misinformation — evidence that platforms themselves were unwilling or unable to share publicly.
CoVaxxy established a blueprint for real-time social media surveillance during health emergencies. Future pandemics will have better tools because this infrastructure exists.
Share this story