This example demonstrates how to develop a basic, modular data pipeline using Python scripts and containers with Docker. The pipeline consists of three steps:
Extract: Download data from a source.
Transform: Process/clean the data.
Load: Save the output to a database or a file.
Project Structure
data-pipeline/
│
├── docker-compose.yml
├── extract/
│ └── extract.py
├── transform/
│ └── transform.py
├── load/
│ └── load.py
├── data/ # Shared directory for passing files
│
├── requirements.txt
extract/extract.py
import json
import requests
def fetch_data():
url = 'https://jsonplaceholder.typicode.com/posts'
response = requests.get(url)
response.raise_for_status()
with open('/data/raw_data.json', 'w') as f:
json.dump(response.json(), f)
print("Extracted data saved to /data/raw_data.json")
if __name__ == '__main__':
fetch_data()
transform/transform.py
import json
def transform():
with open('/data/raw_data.json', 'r') as infile:
records = json.load(infile)
# Remove fields and keep id, title
transformed = [{'id': r['id'], 'title': r['title']} for r in records]
with open('/data/clean_data.json', 'w') as outfile:
json.dump(transformed, outfile)
print("Transformed data saved to /data/clean_data.json")
if __name__ == '__main__':
transform()
load/load.py
import json
import csv
def load():
with open('/data/clean_data.json', 'r') as infile:
records = json.load(infile)
with open('/data/result.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['id', 'title'])
writer.writeheader()
writer.writerows(records)
print("Loaded data into /data/result.csv")
if __name__ == '__main__':
load()
- Docker Setup requirements.txt
requests
Dockerfiles
All three steps use a simple Dockerfile (create in each subdirectory):
extract/Dockerfile, transform/Dockerfile, load/Dockerfile (identical):
FROM python:3.11-slim
WORKDIR /app
COPY ../../requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "extract.py"] # change filename to transform.py/load.py for respective steps
- docker-compose.yml
version: '3.9'
services:
extract:
build:
context: ./extract
dockerfile: Dockerfile
volumes:
- ./data:/data
transform:
build:
context: ./transform
dockerfile: Dockerfile
volumes:
- ./data:/data
depends_on:
- extract
load:
build:
context: ./load
dockerfile: Dockerfile
volumes:
- ./data:/data
depends_on:
- transform
- How to Run the Pipeline Build all containers:
docker-compose build
Run the pipeline services in order:
docker-compose run extract
docker-compose run transform
docker-compose run load
Or, to chain all automatically, use a shell script (advanced).
- Result After running all steps, you’ll have:
data/raw_data.json (raw API data)
data/clean_data.json (transformed)
data/result.csv (CSV output ready for use)
- Customization Ideas Swap the extract step to read from a database or S3.
Change the load step to upload to a database instead of writing a CSV.
Add logging, error handling, or orchestration (e.g., Airflow) for advanced pipelines.
Without docker how to achive
Running Your Pipeline
You can run your ETL steps individually by:
python extract.py
python transform.py
python load.py
Top comments (0)