惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
GbyAI
GbyAI
博客园 - 聂微东
The Cloudflare Blog
Y
Y Combinator Blog
V
V2EX
罗磊的独立博客
F
Fortinet All Blogs
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
宝玉的分享
宝玉的分享
H
Hackread – Cybersecurity News, Data Breaches, AI and More
WordPress大学
WordPress大学
D
Darknet – Hacking Tools, Hacker News & Cyber Security
美团技术团队
E
Exploit-DB.com RSS Feed
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 三生石上(FineUI控件)
L
Lohrmann on Cybersecurity
S
Security Archives - TechRepublic
P
Privacy & Cybersecurity Law Blog
Project Zero
Project Zero
Recent Commits to openclaw:main
Recent Commits to openclaw:main
W
WeLiveSecurity
AWS News Blog
AWS News Blog
Scott Helme
Scott Helme
雷峰网
雷峰网
爱范儿
爱范儿
The GitHub Blog
The GitHub Blog
Microsoft Security Blog
Microsoft Security Blog
J
Java Code Geeks
博客园 - 叶小钗
云风的 BLOG
云风的 BLOG
博客园 - 【当耐特】
阮一峰的网络日志
阮一峰的网络日志
IT之家
IT之家
C
Check Point Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
aimingoo的专栏
aimingoo的专栏
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
月光博客
月光博客
Hugging Face - Blog
Hugging Face - Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
量子位
V
Visual Studio Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
酷 壳 – CoolShell
酷 壳 – CoolShell
有赞技术团队
有赞技术团队

Towards AI

What Really Makes Cars Pollute? A Data Science Deep Dive into CO₂ Emissions | Towards AI Principal Component Analysis (PCA): Theory, Mathematics, and Applications Build a Zero-Cost Web Automation Pipeline With OpenRouter, OpenClaw, and MediaUse I Gave Qwen3.7-Plus a Screenshot and It Found the Exact Pixel to Click for $0.40 Beyond the Prompt: Why Autonomous AI Agents Are Replacing the Chatbot Moonshot Cracked Claude Code’s Playbook with an MIT Terminal Agent and a $0.60 Model Connections, Roles, and Warehouses: Getting CoCo Desktop Production-Ready from Day One My First $5,000 Month Writing About AI Engineering on Medium Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good LangChain Explained: Understanding Models, Prompts, Chains, Memory, Indexes, and Agents TOON: Beyond JSON for LLMs Claude Code Casual, Pro, Elite: The Three Working Personas of Claude Code Mastery MiniMax M3 Decodes 1M Tokens 15x Faster — and It Shouldn’t Be This Cheap Using Amazon SQS for AI Agent Orchestration I Ran a 1.5B-Active Model on My Laptop That Embarrassed a 26B by 46 Points How to Build a Self-Improving Company with AI Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99 3-Part Series: LLM Latency in Production (Part 1) Claude Code: The AI Coding Partner Changing How Developers Build Software Claude Code Pitfalls: Claude Code Won’t Do What You Told It: A Troubleshooting Catalog Full-Stack Data Scientists for the Agentic Coding World Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier How One Spring Boot Optimization Saved Our Startup $30,000 a Year Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works What Is a Reverse Proxy? (And Why Every Backend Developer Should Care) What Claude Opus 4.8 Actually Changes If You’re Building Agents QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing When LLMs Meet Knowledge Graphs on the Battlefield Fine-Tuning is Dead: Why Context Orchestration Won in 2026 5 Things Broke When I Shipped a RAG + MCP Agent to Production. Google Co-Scientist: Hyper Scaling Research and Discovery Microsoft Just Embarrassed Browser Web Agents — 1,000 Lines Made GPT-5.4 Beat Opus 4.6 on 200 Web Tasks The Modern Data Stack Is Broken — Here’s How to Fix It With AI, Governance, and Real Architecture Building Production MCP Servers: What the Spec Won’t Tell You When Should an Agent Stop? The Anatomy of Termination Harness Engineering: The Layer That Matters More Than the Model AI Engineers Who Can’t Debug Are Getting Fired (Here’s How I Debug with Claude Code) Claude Code Memory: Why You Keep Explaining the Same Thing to Claude (and the Five Layers That Fix It) Claude Code Subagents: The Claude Code Feature You Skip Every Day (And Why It Quietly Wrecks Your Sessions) Agentic AI and the SMB Banking Advantage Claude Code: Spec-Driven Development — Why Your AI Coding Sessions Fall Apart at Hour Three The Real Cost of Agentic AI Nobody Budgets For SVM : 40 must visit Interview Questions (Part 2) Your AI Agent Works Perfectly in the Demo. Here Are the 6 Ways It Dies in Production. Unleashing the Power of ONNX for Speedier SBERT Inference Terraform vs CI/CD for Serverless Deployments Merve Noyan Stopped Writing Training Scripts — Her Agent Just Fine-Tuned 18 Models Solo for $11.40 Why Your Sales Forecast Is Always 20% Wrong (And How To Make It 12% Wrong) Genetic Cubic n{C/A} Ratios For Elementary Robotics Design Top 20 AdaBoost Interview Questions & Answers (Part 2 of 2) Agentic AI Vs AI Agents — What Are the Key Differences? LAI #127: The Infrastructure Layer of AI Is Becoming the Product Anthropic Caught Its Own AI Planning to Blackmail Engineers RNNs Cannot Think What Transformers Think Cheaply. ICLR 2026 Proved the Gap Is Exponential. Time Series Made So Easy My Aunt Got It on the Second Read Claude Cowork 101 | Towards AI Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant LangGraph Multi-Agent Architecture: Building a Self-Critiquing AI Debate System AutoML on Autopilot | Towards AI I Ran This Open-Source AI Tool on a Messy Codebase and Got 71x Fewer Tokens — Here Is Exactly What Happened Month in 4 Papers (April 2026) AI Kept Forgetting My Notes. Fixing That Taught Me How It Actually Works. How ChatGPT Makes You Addicted Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A) The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day Building Vector Search? Why FAISS Alone Isn’t Enough TAI #202: GPT-5.5 Moves Codex Into Real Work Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3) AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token. Part 20: Data Manipulation in Multi-Dimensional Aggregation A Fundamental Introduction to Genetic Algorithm -Part Two TAI #200: Anthropic’s Mythos Capability Step Change and Gated Release From Notebook to Production: Running ML in the Real World (Part 4) Sqribble’s Template‑Driven Document Automation Anthropic Just Shipped the Layer That’s Already Going to Zero Long-Term vs Short-Term Memory for AI Agents: A Practical Guide Without the Hype The L1 Loss Gradient, Explained From Scratch Your Postcode Is Deciding Your Care. I Built a Pipeline to Prove It. I Directed AI Agents to Build a Tool That Stress-Tests Incentive Designs. Here’s What It Found. Your System Prompt Is the Product — Not the Feature The LLM Wiki Trend Has a Retention Problem Nobody Mentions Top 20 Data Preparation Interview Questions and Answers (Part 2 of 2) LAI #122: Word Embeddings Started in 1948, Not With Word2Vec Top 15 Computer Vision Datasets [2026] 40 Generative AI Interview Questions That Actually Get Asked in 2026 (With Answers)
Training GPT-2 From Scratch on a GTX1050 | Towards AI
malrsapps · 2026-06-13 · via Towards AI

Author(s): malrsapps

Originally published on Towards AI.

This project started with curiosity rather than a clear plan.

I wanted to understand how language models were actually trained and whether it was possible to build and train one myself on local hardware.

At first, I considered building a small coding assistant. Later, I shifted toward a Turkish tax-focused model. Along the way, I found myself digging into datasets, tokenizers, training loops, GPU limitations, and model architecture.

What began as a simple learning project eventually turned into training GPT-2 from scratch on a GTX1050.

Training GPT-2 From Scratch on a GTX1050
One of the local training runs running at full utilization on a GTX1050-class GPU with only 4GB VRAM.

At the time, I only had access to a GTX1050-class GPU, limited VRAM, and a local Linux setup. I knew the hardware was far from ideal, but I wanted to understand how much experimentation was still possible under constrained conditions.

Long-running experiments also introduced practical concerns such as heat, system stability, and power interruption risks during multi-day training runs.

What started as a small experiment eventually evolved into tokenizer experiments, chained training workflows, runtime recovery systems, schedule-driven orchestration, and a full low-resource experimentation toolkit.

Most importantly, it changed the way I think about training workflows on limited hardware.

The Original Goal Was Completely Different

My first idea was not related to Turkish language models at all.

Initially, I wanted to train a coding-oriented model locally. I started downloading large open-source code datasets — some of them tens of gigabytes in size — and attempted to run training experiments directly on my local machine.

That failed almost immediately.

The datasets were too large, memory usage became unstable, and the overall workflow simply did not fit the hardware constraints I was working with.

At that stage, I realized the first bottleneck was not the model itself.

The real bottleneck was building a workflow that could survive on limited hardware.

After those failed coding-model experiments, I changed direction.

Instead of a coding model, I decided to experiment with a Turkish tax-oriented assistant. Since I work in accounting and tax-related fields, I already had structured domain knowledge and legal text sources available.

I prepared an initial dataset using Turkish tax law documents and attempted a small fine-tuning experiment on GPT2-style models.

The results were terrible.

The problem quickly became obvious:

The base model did not actually understand Turkish well enough.

At first, I assumed I simply needed a better pretrained Turkish model. I searched Hugging Face for Turkish GPT-style models, downloaded several alternatives, and tried multiple fine-tuning runs.

The outputs were still weak.

Some models generated broken Turkish, some produced unstable outputs, and some simply failed to generalize in useful ways.

That period taught me an important lesson:

Not every published pretrained model is necessarily well-trained or practically usable.

At that point, the project direction changed again.

Instead of trying to fine-tune weak Turkish models, I decided to focus on a different problem entirely:

First, I needed to build a GPT-style model that could actually learn Turkish reasonably well on constrained hardware.

Discovering Wikipedia Training

Once I realized the main issue was not fine-tuning but the lack of a strong Turkish base model, my focus shifted completely.

The new goal became much simpler in theory:

Train a small GPT-style model that could at least learn Turkish properly.

Since GPT2-sized architectures were the only realistic option for my hardware, I started exploring how small-scale Turkish pretraining could work locally.

That search eventually led me to Wikipedia datasets.

At the time, Wikipedia looked manageable compared to the massive coding datasets I had attempted earlier. But even then, almost every stage introduced new problems.

Tokenizer training became one of the first major bottlenecks.

I experimented with training tokenizers directly on Turkish Wikipedia text, but repeatedly ran into issues related to vocabulary size, preprocessing stability, and memory constraints. I also discovered that many failures happened before actual model training even started.

In practice, preparing the data pipeline turned out to be harder than launching the training itself.

At various stages, I experimented with streaming-based approaches and alternative loading strategies to avoid memory problems. Some workflows partially worked, others failed unpredictably, and some became too slow to manage comfortably on my hardware.

The deeper I went, the more I realized that low-resource experimentation was fundamentally a workflow engineering problem.

The model architecture itself was only one part of the system.

Why the Workflow Became Chained

One of the biggest turning points came when I started splitting the datasets into smaller parts.

Initially, I experimented with a small number of dataset chunks — around six parts — hoping sequential training could reduce runtime pressure and memory instability.

The idea worked better than expected.

Instead of treating training as one uninterrupted process, I could continue training progressively across multiple dataset parts.

That realization slowly evolved into a chained training workflow.

However, new problems appeared immediately.

Some training parts took an extremely long time to finish. In several experiments, a single dataset part could run for dozens of hours. On my hardware, that created practical problems beyond machine learning itself.

A training session running continuously for two or three days introduced risks such as:

  • unexpected crashes
  • overheating
  • power interruptions
  • unstable runtime behavior
  • inability to use the machine normally during training

At that point, the workflow design started changing again.

I increased the number of dataset parts repeatedly:

  • first 6
  • then 12
  • eventually around 120 smaller parts

Counterintuitively, increasing the number of parts actually made experimentation more manageable.

Smaller parts reduced runtime risk, improved recovery speed, and made debugging significantly easier.

The project gradually shifted from:

“Train a model.”

to:

“Build a system capable of surviving long-running local experiments.”

That distinction ended up shaping the entire toolkit architecture.

The Birth of the Training Loop

An early chained training session showing tokenizer loading, dataset preparation, and the beginning of a local GPT-style training run.

As the number of dataset parts increased, manually managing training sessions became increasingly impractical.

At first, I was launching training runs individually and monitoring them manually. But once the workflow expanded toward dozens — and eventually more than a hundred dataset parts — that approach completely stopped scaling.

I needed a system capable of continuing training automatically while still allowing me to inspect intermediate results when necessary.

That requirement eventually led to the first version of the loop-based training system.

The original idea was relatively simple:

  • specify how many parts to train
  • continue sequentially
  • stop automatically
  • evaluate later

This became especially important because many training sessions were running overnight. Smaller dataset parts reduced risk, but they also created a new operational problem:

The system now needed to coordinate large numbers of sequential experiments reliably.

Over time, the loop system evolved far beyond a simple automation script.

I started adding:

  • runtime logging
  • state tracking
  • summary files
  • continuation logic
  • retry handling

The goal was no longer just training.

The goal became maintaining long-running experimental continuity.

Several runtime files gradually emerged from that process:

  • schedule.json
  • process.json
  • summary.json
  • state.json

Each file solved a different operational problem.

Some tracked:

  • completed parts
  • retries
  • runtime state
  • checkpoint progression
  • training summaries
  • loop continuation

One particularly important improvement was introducing temporary runtime checkpoints.

Instead of writing directly into permanent checkpoints during training, the workflow first used temporary working directories. Once training completed successfully, the results could then be promoted into stable checkpoints.

This reduced the risk of corrupted progression after crashes, interruptions, or unstable runs.

At that point, the project had clearly evolved beyond a simple training script.

It was slowly becoming an orchestration system designed specifically for constrained long-running local experiments.

One completed chained training part with runtime summaries, checkpoint validation, and loop-state reporting enabled.

Dataset Ordering Experiments

Once the chained workflow became more stable, I started focusing more heavily on dataset structure itself.

One of the most interesting questions became:

Does dataset order affect learning quality?

At the time, I was experimenting with three primary dataset categories:

  • dialogue-style text
  • dictionary-style structured language
  • Wikipedia articles

My initial intuition was heavily influenced by human learning patterns.

I wondered whether a GPT-style model might learn more naturally if the datasets followed a progression similar to language acquisition:

  • first conversational dialogue
  • then structured vocabulary
  • then broader encyclopedic information

That idea sounded intuitively correct.

So I reorganized the workflow accordingly.

The chained system eventually evolved into a multi-stage schedule-driven structure where different dataset groups could be trained sequentially using predefined progression rules.

This experimentation phase was one of the reasons the `schedule.json` system became increasingly important.

The workflow was no longer managing only dataset parts.

It now needed to coordinate:

  • dataset categories
  • training stages
  • checkpoint transitions
  • runtime continuity
  • evaluation flow

However, the results were not what I expected.

Despite the pedagogically appealing structure, some of the generated outputs became noticeably worse. In several experiments, the models struggled to learn Turkish fluently during the early stages.

That forced another major rethink.

After further experimentation and research, I gradually reached a different conclusion:

The earliest training stages appeared disproportionately important.

As a result, I reorganized the dataset ordering again.

Wikipedia moved back to the beginning of the pipeline, while smaller specialized datasets became later-stage refinement layers instead.

That restructuring improved the overall language quality significantly compared to several earlier experiments.

Realizing Data Structure Matters

As training quality gradually improved, I became increasingly interested in understanding why certain experiments performed better than others.

At first, most of my attention had been focused on:

  • model settings
  • learning rates
  • schedulers
  • gradient accumulation
  • checkpoint continuity

But over time, I realized the structure of the dataset itself was just as important as the training configuration.

One of the first questions I started asking was surprisingly simple:

Was the model actually receiving enough contextual structure during training?

Initially, many preprocessing stages only extracted raw Wikipedia article text. Titles were often discarded during earlier dataset preparation experiments.

Later, I realized this may have unintentionally weakened contextual learning.

Without article titles, the model was effectively seeing disconnected blocks of text without strong topic anchors. I started experimenting with preserving titles alongside article content so the model could learn stronger contextual associations.

Another major realization involved token boundaries.

At first, many dataset lines had inconsistent lengths and incomplete structures. Some samples ended mid-sentence, others cut paragraphs unpredictably, and many training segments were not aligned cleanly with the tokenizer structure.

This raised an important concern:

The model might be learning fragmented language patterns instead of coherent textual structure.

That observation eventually led to tokenizer-aware chunking experiments.

Instead of splitting data arbitrarily, I began restructuring datasets around token-safe windows designed to preserve cleaner contextual continuity.

This significantly changed the preprocessing pipeline.

The workflow evolved toward:

  • tokenizer-aware splitting
  • structured title + content formatting
  • cleaner paragraph continuity
  • more stable contextual segmentation

At the same time, I also started investigating the linguistic quality of the datasets themselves.

Questions emerged such as:

  • how much non-Turkish content remained?
  • how many foreign character patterns existed?
  • how consistent was the normalization process?
  • how should Turkish-specific ambiguities be handled?

Some normalization decisions became surprisingly nuanced.

For example, Turkish contains distinctions such as:

kar

kâr

which are often typed identically in everyday keyboard usage despite having different meanings depending on context.

At that stage, I intentionally avoided over-engineering certain linguistic corrections. My assumption was that some contextual distinctions might be handled later during fine-tuning rather than fully enforced during preprocessing.

The deeper I went into preprocessing and restructuring, the clearer one thing became:

Training quality was not only about the model architecture.

Data organization, contextual continuity, and preprocessing strategy were shaping the behavior of the system just as heavily as the training loop itself.

The First Truly Promising Outputs

One of the later checkpoints producing fluent Turkish text with strong structural consistency, but unreliable factual reasoning.

After multiple redesigns of the preprocessing pipeline, tokenizer workflow, dataset ordering, and training settings, the outputs finally started improving in noticeable ways.

Even relatively early into some experiments, the model began producing recognizable Turkish sentence structures.

Over time:

  • paragraph structure improved
  • contextual continuation improved
  • grammatical consistency improved
  • topic association became more stable

For the first time, the outputs started feeling less random and more language-like.

The models were no longer generating completely broken text.

They were beginning to form coherent Turkish paragraphs.

That moment was extremely important psychologically because many earlier experiments had failed before reaching that stage.

At the same time, major limitations were still obvious.

The models could imitate style and structure surprisingly well, but factual reliability remained weak.

For example:

  • historical questions often produced historically themed answers
  • poetry-related prompts generated poetic structures
  • topic alignment improved significantly

But the factual correctness was often unreliable.

The models sounded plausible far more often than they were actually correct.

This became especially visible during larger evaluation runs.

As training continued through dozens of chained dataset parts, I started building batch-style inference and evaluation workflows to compare checkpoints against one another.

Different checkpoints often exhibited noticeably different strengths:

  • some generated cleaner Turkish
  • some handled paragraph continuation better
  • some retained more structured formatting
  • some overfit more aggressively toward recent dataset stages

At one stage, I also realized that later dataset parts sometimes disproportionately influenced final outputs.

That observation further reinforced the importance of:

  • dataset structure
  • ordering
  • continuity
  • evaluation methodology

The training process was gradually becoming less about “running epochs” and more about understanding how workflow decisions shaped emergent behavior across long experimental chains.

Realizing the Limits of Small Models

As the experiments progressed, one conclusion became increasingly difficult to ignore:

Language fluency and factual reliability were not the same problem.

By the later training stages, several checkpoints could already generate reasonably structured Turkish paragraphs. In some cases, the outputs even appeared surprisingly convincing at first glance.

But deeper evaluation revealed major weaknesses.

The models frequently produced:

  • hallucinated facts
  • incorrect historical information
  • unstable numerical reasoning
  • fabricated relationships between concepts

The system could imitate language patterns far more easily than it could preserve factual accuracy.

At first, I assumed additional training might solve most of these problems.

However, after comparing many checkpoints across long chained runs, another pattern became obvious:

The issue was not simply insufficient training duration.

The issue was also dataset scale.

Compared to modern large-scale pretraining systems, my datasets were extremely small. Even after extensive preprocessing and restructuring, the corpus remained tiny relative to the scale used in major language model training efforts.

That realization changed the direction of the project once again.

Instead of trying to force a single small GPT-style model to become both:

  • linguistically fluent
  • factually reliable
  • mathematically accurate

I started thinking more about layered systems.

This gradually led me toward ideas involving:

  • retrieval systems
  • classifier models
  • orchestration layers
  • deterministic calculation components
  • specialized fine-tuned submodels

The project was slowly evolving from:

“Train one good GPT model.”

into something closer to:

“Build cooperating systems around constrained local models.”

Tax Fine-Tuning Experiments

The model could sometimes follow the correct tax-related reasoning pattern while still producing incorrect arithmetic results. This became one of the motivations behind exploring deterministic calculation layers and orchestration systems.

Once the Turkish language quality reached a more stable level, I returned to the original motivation behind the project:

Tax-oriented language experimentation.

Instead of attempting to prepare every legal document immediately, I started with a smaller and more controlled subset focused on corporate tax law.

The dataset preparation process became significantly more structured during this stage.

The legal documents were transformed into multiple dataset formats containing combinations of:

  • article numbers
  • section titles
  • chapter information
  • structured legal text
  • contextual metadata

At the same time, I realized another practical limitation:

The legal corpus itself was far too small to support meaningful behavior on its own.

To compensate, I began generating additional structured datasets.

This eventually expanded into:

  • approximately 2000 tax-oriented question-answer pairs
  • arithmetic-oriented synthetic examples
  • structured accounting-style reasoning samples
  • manually prepared dialogue-style examples

I also experimented heavily with different formatting structures.

The same legal content was reorganized into multiple dataset variants using different:

  • input/output structures
  • contextual arrangements
  • instructional phrasing patterns

Another important issue involved context window limitations.

Some legal articles exceeded the practical token limits of the training setup. To preserve continuity while avoiding abrupt truncation, I started experimenting with sliding-window style segmentation approaches.

The goal was to maintain contextual overlap while still respecting tokenizer boundaries and sequence constraints.

At this point, the project had clearly evolved far beyond simple Wikipedia pretraining.

The workflow now included:

  • layered datasets
  • tokenizer-aware preprocessing
  • chained orchestration
  • structured fine-tuning
  • evaluation pipelines
  • runtime recovery systems
  • domain-specific experimentation

What originally started as a local curiosity project was gradually becoming a much broader low-resource experimentation framework.

Fine-Tuning Results and New Architectural Questions

The fine-tuning experiments produced mixed but extremely informative results.

Some outputs became noticeably more relevant to tax-related prompts, especially when the questions closely resembled the structured examples used during training.

In certain cases, the models could generate partially correct explanations or produce answers that contained fragments of accurate legal reasoning.

However, the limitations remained obvious.

The models often generated responses that sounded convincing while still containing factual inaccuracies or fabricated details.

At times, the outputs were approximately correct in structure but unreliable in substance.

This became especially concerning because tax and legal domains are highly sensitive to hallucinations.

A confidently incorrect answer in those areas can easily become more dangerous than a clearly failed response.

During evaluation, another pattern also became increasingly visible:

The most successful outputs often came from:

  • heavily structured question-answer pairs
  • arithmetic-style examples
  • highly formatted datasets

In contrast, raw legal text alone produced much weaker behavior than I initially expected.

That observation reinforced an important lesson:

Small models benefit enormously from structure.

The experiments also revealed another major weakness:

Mathematics and deterministic reasoning remained unstable.

Even when the language generation improved significantly, arithmetic consistency was still unreliable.

After researching this further, I realized this limitation was not unique to my experiments. Even much larger language models often struggle with exact mathematical consistency without external systems or tool-assisted reasoning layers.

That realization shifted my thinking once again.

Instead of trying to force one model to solve every problem alone, I began exploring the idea of cooperative architectures.

I started imagining systems where:

  • one model classifies intent
  • another model handles generation
  • deterministic calculators handle arithmetic
  • retrieval systems provide factual grounding
  • orchestration layers coordinate the workflow

At one stage, I even considered combining:

  • a Turkish BERT-style classifier

with:

  • a GPT-style generator model

The idea was that a classifier-oriented model could first analyze and structure the request before routing it into a generative model more effectively.

At the same time, retrieval-augmented approaches became increasingly appealing.

Instead of expecting a tiny GPT-style model to memorize large legal corpora perfectly, a retrieval layer could supply grounded contextual information dynamically during inference.

This represented another major conceptual transition in the project:

The goal was no longer building a single intelligent model.

The goal was designing systems capable of combining:

  • language generation
  • structured retrieval
  • deterministic logic
  • orchestration
  • specialized submodels

within constrained local environments.

The First Public Release

Eventually, I decided to prepare a smaller public-facing version of the project.

At that stage, my objective was no longer maximizing general intelligence or building a perfectly accurate legal assistant.

Instead, I wanted to demonstrate:

  • the workflow
  • the experimentation process
  • the engineering evolution behind the system

One of the final experiments involved intentionally overfitting a smaller fine-tuned model in a controlled way.

Rather than forcing the model to answer everything, the idea was to encourage behavior where:

  • known patterns could be answered more reliably
  • unsupported questions would ideally fail more conservatively

Surprisingly, this produced some of the most stable behavior observed during the project.

The resulting models were still highly limited, but they demonstrated an important point:

Even constrained local systems could become useful within carefully designed boundaries.

That realization eventually led to the first public repository release.

The public version focused on:

  • workflow visibility
  • tokenizer experimentation
  • chained training concepts
  • local experimentation
  • small-scale GPT-style demonstrations

rather than presenting itself as a production-grade AI system.

In many ways, the most valuable outcome of the project was not the final model itself.

It was understanding how hardware constraints, preprocessing decisions, orchestration design, and dataset structure collectively reshape the entire engineering process around local AI experimentation.

This project was never intended to compete with large-scale commercial systems.

The primary goal was understanding what constrained local experimentation could realistically achieve.

Public demo repository:

GitHub – malrsapps/home-gpu-gpt-training-toolkit: Tiny GPT-style training and local inference demo…

Tiny GPT-style training and local inference demo for consumer hardware. – malrsapps/home-gpu-gpt-training-toolkit

github.com

Published via Towards AI