Recently, I completed my first full Data Engineering project: building an end-to-end ETL pipeline using real-world Australian weather data spanning 10 years.
The dataset contained over 145,000 rows, and the goal of the project was to understand how modern data systems ingest, process, validate, and orchestrate data workflows.
Rather than focusing only on completing the project quickly, I wanted to understand the engineering decisions happening at each stage of the pipeline.
Project Overview
The pipeline was divided into four major stages:
Extract
Transform
Load
Orchestration
The project processes weather data from raw CSV format and prepares it for downstream analytics inside Google BigQuery.
Extract Phase
The extraction layer focused on:
reading raw CSV files,
validating ingestion,
handling inconsistent records,
and detecting missing values early in the pipeline.
This stage helped me understand why ingestion reliability is important in real-world data workflows.
Transform Phase
The transformation stage introduced much more engineering complexity than I initially expected.
I worked on:
handling null values,
converting inconsistent data types,
restructuring records,
and performing feature engineering.
Some engineered features included:
temp_range
is_hot_day
season classification
The transformed dataset was then converted from CSV to Parquet format.
Result:
13.44 MB → 2.35 MB
(82.5% storage reduction)
This phase made me appreciate how important schema consistency and data quality are in ETL systems.
Load Phase
After transformation, the processed data was loaded into Google BigQuery.
I also implemented:
row-count validation,
null-value checks,
and integrity verification after loading.
This stage introduced me to the importance of downstream reliability and validation in Data Engineering systems.
Orchestration with Apache Airflow
The entire workflow was orchestrated using Apache Airflow running inside Docker containers.
The DAG included:
scheduled execution,
retry logic,
logging,
and task dependency management.
This was one of the most interesting parts of the project because it made the pipeline feel much closer to a production-style workflow.
Project Statistics
✅ 145,460 rows processed
✅ 343,248 missing values handled
✅ 0 missing values after transformation
✅ All Airflow tasks completed successfully
Tech Stack
Python
Pandas
PyArrow
Google BigQuery
Apache Airflow
Docker
GitHub Codespaces
Key Learnings
This project taught me that Data Engineering is not just about moving data from one system to another.
It also involves:
reliability,
validation,
orchestration,
scalability,
and ensuring downstream systems can trust the data they receive.
To document the learning journey more deeply, I published the project across multiple platforms — each covering a different perspective of the ETL pipeline:
Hashnode — Technical deep dive into the ETL architecture, orchestration flow, and system design decisions : HashNode
🔹 Medium — Reflections on approaching Data Engineering projects through smaller engineering exercises and incremental learning: Medium
Building the project end-to-end gave me a much deeper understanding of how ETL workflows evolve in real-world systems.
GitHub Repository : ETL Pipeline
























