Why Econ Researchers Should Care About the Cloud

research
code
notes
Published

January 21, 2026

Researchers often merge, clean, transform, and analyze data locally. That works well when only one or two coauthors handle the data and the computational requirements are modest. As projects grow in scope, however, knowing how to use cloud infrastructure becomes a meaningful comparative advantage.

In this article, I describe typical use cases for cloud-based data pipelines in academic research.

Cloud services let you scale compute, memory (RAM), and storage on demand. They become especially useful when you need to: 1. Store very large datasets (e.g., terabytes) 2. Run heavy transformations (joins, feature engineering) or model estimation on data that does not fit comfortably in local RAM. 3. Execute numerical algorithms that benefit from parallelization or high-performance hardware

Even if your current projects do not require that scale, the cloud can still help in collaborations by standardizing the research environment. Instead of debugging differences across operating systems or package versions, coauthors can run the same containerized setup and reproduce the same results, which is useful both for internal replication and for producing clean replication packages.

Another important use case is building new datasets from real-time sources. Some services do not provide historical data for free, so constructing a panel dataset requires scraping and storing observations at regular intervals for weeks or months, something that is much easier to run reliably in the cloud.

For example, suppose you want to study how the elasticity of substitution between commuting by bus and by train (with respect to travel time) changes on rainy days. Ideally, you would collect high-frequency panel data on usage and travel times for buses and metro lines. Apps such as Transit provide real-time vehicle locations and service information; scraping these data consistently over a couple of months could generate the dataset needed to answer the question.

The main drawback is the learning curve. Fortunately, you can get started without learning everything at once. A practical first step is using Docker and GitHub Codespaces, which lets coauthors work in the same environment while you keep using VS Code. If your project later needs more storage or computing power, you can expand to full cloud platforms such as AWS or Google Cloud. If you would like, I am happy to suggest a minimal setup and learning path tailored to your use case.