惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Security Latest
Security Latest
U
Unit 42
D
Docker
H
Help Net Security
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Microsoft Azure Blog
Microsoft Azure Blog
C
Cisco Blogs
阮一峰的网络日志
阮一峰的网络日志
S
Schneier on Security
Project Zero
Project Zero
F
Future of Privacy Forum
V
Vulnerabilities – Threatpost
Recent Announcements
Recent Announcements
T
Threatpost
T
True Tiger Recordings
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Recorded Future
Recorded Future
T
The Blog of Author Tim Ferriss
S
SegmentFault 最新的问题
A
Arctic Wolf
Martin Fowler
Martin Fowler
I
InfoQ
Malwarebytes
Malwarebytes
T
Tor Project blog
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
S
Securelist
T
Tailwind CSS Blog
Blog — PlanetScale
Blog — PlanetScale
P
Proofpoint News Feed
W
WeLiveSecurity
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
H
Hacker News: Front Page
The Cloudflare Blog
O
OpenAI News
C
CERT Recently Published Vulnerability Notes
Hacker News: Ask HN
Hacker News: Ask HN
NISL@THU
NISL@THU
E
Exploit-DB.com RSS Feed
Scott Helme
Scott Helme
Jina AI
Jina AI
Spread Privacy
Spread Privacy
T
The Exploit Database - CXSecurity.com
T
Troy Hunt's Blog
N
News | PayPal Newsroom
李成银的技术随笔

DEV Community

A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control How to Build a Self-Hosted AI Code Review Tool in Python Why We Switched from React to HTMX in Production: A 200-Site Case Study Gemma-Loom: The Intent-Based Virtual Machine (IVM) for Edge Sovereignty Java实习海投攻略:3天300个沟通,我是怎么拿到面试的 I Deployed Netflix's Web Server in 30 Seconds (And So Can You) - Docker Project 1 Debugging Android 14 WebRTC Disconnects on a coturn Relay Path 1/30 Days System Design Question Testing FastAPI + SQLAlchemy with Real PostgreSQL Fixtures: No More Mocking Misery
Building an African Economic Data Pipeline with Python, DuckDB & World Bank API
Haji Rufai · 2026-05-23 · via DEV Community

Haji Rufai

Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through AfriData Pipeline — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard.

No paid APIs. No cloud services required. Just Python, DuckDB, and free public data.

Why This Project?

Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful.

This project demonstrates:

  • ETL pipeline design with proper error handling and retries
  • Dimensional modeling (star schema) in DuckDB
  • Data quality engineering — automated checks for completeness, validity, and freshness
  • Full-stack delivery — from raw API to interactive dashboard

Architecture Overview

World Bank API v2 → Extract (httpx) → Transform (Python) → Load (DuckDB)
                                                               ↓
                                            Export JSON → Static Dashboard (Vercel)

Enter fullscreen mode Exit fullscreen mode

The pipeline processes 13,500 data points (54 countries × 10 indicators × 25 years) in under 50 seconds.

The Data: 10 Key Indicators

I selected indicators that tell a comprehensive economic story:

Indicator Category Why It Matters
GDP (US$) Economy Total economic output
GDP Growth (%) Economy Economic momentum
Population Demographics Scale context
Inflation (CPI) Economy Cost of living pressure
Unemployment Labor Job market health
Life Expectancy Health Quality of life proxy
Internet Users (%) Technology Digital readiness
Electricity Access (%) Infrastructure Development foundation
Literacy Rate (%) Education Human capital
FDI Inflows (% GDP) Investment External confidence

Building the Extract Layer

The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request:

import httpx
import time

WB_BASE = "https://api.worldbank.org/v2"
MAX_RETRIES = 3

def extract_indicator(client: httpx.Client, indicator_code: str, 
                      country_codes: str) -> list[dict]:
    url = (f"{WB_BASE}/country/{country_codes}/indicator/{indicator_code}"
           f"?format=json&date=2000:2024&per_page=10000")

    for attempt in range(MAX_RETRIES):
        try:
            resp = client.get(url, timeout=60)
            resp.raise_for_status()
            data = resp.json()
            # World Bank returns [metadata, records]
            if isinstance(data, list) and len(data) == 2:
                return data[1] or []
        except (httpx.HTTPStatusError, httpx.ReadTimeout) as e:
            delay = 2 * (2 ** attempt)
            time.sleep(delay)
    return []

Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  • Exponential backoff on failures (2s, 4s, 8s)
  • Single request per indicator — semicolon-separated country codes let us fetch all 54 countries at once
  • 60-second timeout — some indicators return large payloads
  • 0.5s delay between indicators — respect the free API

The Star Schema

DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file.

dim_country ◄──── fact_indicators ────► dim_indicator
     │                  │
     └────────── dim_date ──────────────┘

Enter fullscreen mode Exit fullscreen mode

import duckdb

def create_schema(conn):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS fact_indicators (
            country_key  INTEGER,
            indicator_key INTEGER,
            date_key     INTEGER,
            value        DOUBLE,
            yoy_change   DOUBLE,
            extracted_at TIMESTAMP DEFAULT current_timestamp,
            PRIMARY KEY (country_key, indicator_key, date_key)
        )
    """)
    # Plus dim_country (54 rows), dim_indicator (10 rows), dim_date (25 rows)

Enter fullscreen mode Exit fullscreen mode

The transform layer also computes year-over-year change for every data point:

def calculate_yoy(current, previous):
    if current is not None and previous is not None and previous != 0:
        return round(((current - previous) / abs(previous)) * 100, 2)
    return None

Enter fullscreen mode Exit fullscreen mode

Data Quality Framework

This is what separates a toy project from a production one. The quality framework scores three dimensions:

1. Completeness — What percentage of expected data points are non-null?

Literacy Rate: only 18% complete (data is sparse)
Population: 100% complete (every country, every year)

Enter fullscreen mode Exit fullscreen mode

2. Validity — Are values within expected ranges?

Life expectancy: 25-95 years ✅
GDP: $1M - $10T ✅
Inflation: -30% to 10,000% (yes, hyperinflation happens) ✅

Enter fullscreen mode Exit fullscreen mode

3. Freshness — How recent is the latest data?

GDP: 2024 ✅
Literacy: 2021 ⚠️ (surveys are infrequent)

Enter fullscreen mode Exit fullscreen mode

The final score: 95.8/100 — with completeness dragging slightly due to sparse literacy data (expected for survey-based indicators).

Interactive Dashboard

The dashboard is a static site (HTML + Tailwind CSS + Chart.js + Leaflet.js) that loads pre-exported JSON files:

Features:

  • 🗺️ Choropleth map — click any African country, toggle between indicators
  • 📈 Country comparison — compare up to 6 countries over 25 years
  • 🏆 Rankings table — sortable by any indicator
  • 🌙 Dark mode — full theme support
  • 📱 Responsive — works on mobile

The dashboard reads four JSON files exported by the pipeline:

  • country_profiles.json — all data per country (897KB)
  • rankings.json — pre-sorted rankings per indicator
  • summary_stats.json — aggregate statistics
  • quality_report.json — transparency on data quality

Automated Daily Refresh

A GitHub Actions workflow runs the pipeline daily at 6 AM UTC:

name: Daily ETL Pipeline
on:
  schedule:
    - cron: '0 6 * * *'
  workflow_dispatch:

jobs:
  etl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt
      - run: python -m pipeline.main all
      - run: |
          git config user.name "github-actions[bot]"
          git add dashboard/data/
          git diff --cached --quiet || git commit -m "chore: update data"
          git push

Enter fullscreen mode Exit fullscreen mode

Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention.

Key Takeaways

  1. Free APIs are underrated — The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history.

  2. DuckDB is a game-changer for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds.

  3. Data quality isn't optional — Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought.

  4. Static dashboards scale — By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free.

  5. Star schemas still matter — Even in a world of data lakes and denormalized tables, dimensional modeling makes your data queryable and understandable.

Try It Yourself

The entire project is open source:

git clone https://github.com/hajirufai/afridata-pipeline.git
cd afridata-pipeline
pip install -r requirements.txt
python -m pipeline.main all
cd dashboard && python -m http.server 8080

Enter fullscreen mode Exit fullscreen mode

Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question.


What economic indicators would you add? Drop a comment below!