Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through AfriData Pipeline — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard.
No paid APIs. No cloud services required. Just Python, DuckDB, and free public data.
Why This Project?
Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful.
This project demonstrates:
- ETL pipeline design with proper error handling and retries
- Dimensional modeling (star schema) in DuckDB
- Data quality engineering — automated checks for completeness, validity, and freshness
- Full-stack delivery — from raw API to interactive dashboard
Architecture Overview
World Bank API v2 → Extract (httpx) → Transform (Python) → Load (DuckDB)
↓
Export JSON → Static Dashboard (Vercel)
The pipeline processes 13,500 data points (54 countries × 10 indicators × 25 years) in under 50 seconds.
The Data: 10 Key Indicators
I selected indicators that tell a comprehensive economic story:
| Indicator | Category | Why It Matters |
|---|---|---|
| GDP (US$) | Economy | Total economic output |
| GDP Growth (%) | Economy | Economic momentum |
| Population | Demographics | Scale context |
| Inflation (CPI) | Economy | Cost of living pressure |
| Unemployment | Labor | Job market health |
| Life Expectancy | Health | Quality of life proxy |
| Internet Users (%) | Technology | Digital readiness |
| Electricity Access (%) | Infrastructure | Development foundation |
| Literacy Rate (%) | Education | Human capital |
| FDI Inflows (% GDP) | Investment | External confidence |
Building the Extract Layer
The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request:
import httpx
import time
WB_BASE = "https://api.worldbank.org/v2"
MAX_RETRIES = 3
def extract_indicator(client: httpx.Client, indicator_code: str,
country_codes: str) -> list[dict]:
url = (f"{WB_BASE}/country/{country_codes}/indicator/{indicator_code}"
f"?format=json&date=2000:2024&per_page=10000")
for attempt in range(MAX_RETRIES):
try:
resp = client.get(url, timeout=60)
resp.raise_for_status()
data = resp.json()
# World Bank returns [metadata, records]
if isinstance(data, list) and len(data) == 2:
return data[1] or []
except (httpx.HTTPStatusError, httpx.ReadTimeout) as e:
delay = 2 * (2 ** attempt)
time.sleep(delay)
return []
Key design decisions:
- Exponential backoff on failures (2s, 4s, 8s)
- Single request per indicator — semicolon-separated country codes let us fetch all 54 countries at once
- 60-second timeout — some indicators return large payloads
- 0.5s delay between indicators — respect the free API
The Star Schema
DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file.
dim_country ◄──── fact_indicators ────► dim_indicator
│ │
└────────── dim_date ──────────────┘
import duckdb
def create_schema(conn):
conn.execute("""
CREATE TABLE IF NOT EXISTS fact_indicators (
country_key INTEGER,
indicator_key INTEGER,
date_key INTEGER,
value DOUBLE,
yoy_change DOUBLE,
extracted_at TIMESTAMP DEFAULT current_timestamp,
PRIMARY KEY (country_key, indicator_key, date_key)
)
""")
# Plus dim_country (54 rows), dim_indicator (10 rows), dim_date (25 rows)
The transform layer also computes year-over-year change for every data point:
def calculate_yoy(current, previous):
if current is not None and previous is not None and previous != 0:
return round(((current - previous) / abs(previous)) * 100, 2)
return None
Data Quality Framework
This is what separates a toy project from a production one. The quality framework scores three dimensions:
1. Completeness — What percentage of expected data points are non-null?
Literacy Rate: only 18% complete (data is sparse)
Population: 100% complete (every country, every year)
2. Validity — Are values within expected ranges?
Life expectancy: 25-95 years ✅
GDP: $1M - $10T ✅
Inflation: -30% to 10,000% (yes, hyperinflation happens) ✅
3. Freshness — How recent is the latest data?
GDP: 2024 ✅
Literacy: 2021 ⚠️ (surveys are infrequent)
The final score: 95.8/100 — with completeness dragging slightly due to sparse literacy data (expected for survey-based indicators).
Interactive Dashboard
The dashboard is a static site (HTML + Tailwind CSS + Chart.js + Leaflet.js) that loads pre-exported JSON files:
Features:
- 🗺️ Choropleth map — click any African country, toggle between indicators
- 📈 Country comparison — compare up to 6 countries over 25 years
- 🏆 Rankings table — sortable by any indicator
- 🌙 Dark mode — full theme support
- 📱 Responsive — works on mobile
The dashboard reads four JSON files exported by the pipeline:
-
country_profiles.json— all data per country (897KB) -
rankings.json— pre-sorted rankings per indicator -
summary_stats.json— aggregate statistics -
quality_report.json— transparency on data quality
Automated Daily Refresh
A GitHub Actions workflow runs the pipeline daily at 6 AM UTC:
name: Daily ETL Pipeline
on:
schedule:
- cron: '0 6 * * *'
workflow_dispatch:
jobs:
etl:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- run: pip install -r requirements.txt
- run: python -m pipeline.main all
- run: |
git config user.name "github-actions[bot]"
git add dashboard/data/
git diff --cached --quiet || git commit -m "chore: update data"
git push
Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention.
Key Takeaways
Free APIs are underrated — The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history.
DuckDB is a game-changer for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds.
Data quality isn't optional — Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought.
Static dashboards scale — By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free.
Star schemas still matter — Even in a world of data lakes and denormalized tables, dimensional modeling makes your data queryable and understandable.
Try It Yourself
The entire project is open source:
- GitHub: hajirufai/afridata-pipeline
- Stack: Python 3.12, httpx, DuckDB, Chart.js, Leaflet.js, Tailwind CSS
git clone https://github.com/hajirufai/afridata-pipeline.git
cd afridata-pipeline
pip install -r requirements.txt
python -m pipeline.main all
cd dashboard && python -m http.server 8080
Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question.
What economic indicators would you add? Drop a comment below!



















