


























Originally published on Towards AI.
This project started with curiosity rather than a clear plan.
I wanted to understand how language models were actually trained and whether it was possible to build and train one myself on local hardware.
At first, I considered building a small coding assistant. Later, I shifted toward a Turkish tax-focused model. Along the way, I found myself digging into datasets, tokenizers, training loops, GPU limitations, and model architecture.
What began as a simple learning project eventually turned into training GPT-2 from scratch on a GTX1050.
At the time, I only had access to a GTX1050-class GPU, limited VRAM, and a local Linux setup. I knew the hardware was far from ideal, but I wanted to understand how much experimentation was still possible under constrained conditions.
Long-running experiments also introduced practical concerns such as heat, system stability, and power interruption risks during multi-day training runs.
What started as a small experiment eventually evolved into tokenizer experiments, chained training workflows, runtime recovery systems, schedule-driven orchestration, and a full low-resource experimentation toolkit.
Most importantly, it changed the way I think about training workflows on limited hardware.
My first idea was not related to Turkish language models at all.
Initially, I wanted to train a coding-oriented model locally. I started downloading large open-source code datasets — some of them tens of gigabytes in size — and attempted to run training experiments directly on my local machine.
That failed almost immediately.
The datasets were too large, memory usage became unstable, and the overall workflow simply did not fit the hardware constraints I was working with.
At that stage, I realized the first bottleneck was not the model itself.
The real bottleneck was building a workflow that could survive on limited hardware.
After those failed coding-model experiments, I changed direction.
Instead of a coding model, I decided to experiment with a Turkish tax-oriented assistant. Since I work in accounting and tax-related fields, I already had structured domain knowledge and legal text sources available.
I prepared an initial dataset using Turkish tax law documents and attempted a small fine-tuning experiment on GPT2-style models.
The results were terrible.
The problem quickly became obvious:
The base model did not actually understand Turkish well enough.
At first, I assumed I simply needed a better pretrained Turkish model. I searched Hugging Face for Turkish GPT-style models, downloaded several alternatives, and tried multiple fine-tuning runs.
The outputs were still weak.
Some models generated broken Turkish, some produced unstable outputs, and some simply failed to generalize in useful ways.
That period taught me an important lesson:
Not every published pretrained model is necessarily well-trained or practically usable.
At that point, the project direction changed again.
Instead of trying to fine-tune weak Turkish models, I decided to focus on a different problem entirely:
First, I needed to build a GPT-style model that could actually learn Turkish reasonably well on constrained hardware.
Once I realized the main issue was not fine-tuning but the lack of a strong Turkish base model, my focus shifted completely.
The new goal became much simpler in theory:
Train a small GPT-style model that could at least learn Turkish properly.
Since GPT2-sized architectures were the only realistic option for my hardware, I started exploring how small-scale Turkish pretraining could work locally.
That search eventually led me to Wikipedia datasets.
At the time, Wikipedia looked manageable compared to the massive coding datasets I had attempted earlier. But even then, almost every stage introduced new problems.
Tokenizer training became one of the first major bottlenecks.
I experimented with training tokenizers directly on Turkish Wikipedia text, but repeatedly ran into issues related to vocabulary size, preprocessing stability, and memory constraints. I also discovered that many failures happened before actual model training even started.
In practice, preparing the data pipeline turned out to be harder than launching the training itself.
At various stages, I experimented with streaming-based approaches and alternative loading strategies to avoid memory problems. Some workflows partially worked, others failed unpredictably, and some became too slow to manage comfortably on my hardware.
The deeper I went, the more I realized that low-resource experimentation was fundamentally a workflow engineering problem.
The model architecture itself was only one part of the system.
One of the biggest turning points came when I started splitting the datasets into smaller parts.
Initially, I experimented with a small number of dataset chunks — around six parts — hoping sequential training could reduce runtime pressure and memory instability.
The idea worked better than expected.
Instead of treating training as one uninterrupted process, I could continue training progressively across multiple dataset parts.
That realization slowly evolved into a chained training workflow.
However, new problems appeared immediately.
Some training parts took an extremely long time to finish. In several experiments, a single dataset part could run for dozens of hours. On my hardware, that created practical problems beyond machine learning itself.
A training session running continuously for two or three days introduced risks such as:
At that point, the workflow design started changing again.
I increased the number of dataset parts repeatedly:
Counterintuitively, increasing the number of parts actually made experimentation more manageable.
Smaller parts reduced runtime risk, improved recovery speed, and made debugging significantly easier.
The project gradually shifted from:
“Train a model.”
to:
“Build a system capable of surviving long-running local experiments.”
That distinction ended up shaping the entire toolkit architecture.
As the number of dataset parts increased, manually managing training sessions became increasingly impractical.
At first, I was launching training runs individually and monitoring them manually. But once the workflow expanded toward dozens — and eventually more than a hundred dataset parts — that approach completely stopped scaling.
I needed a system capable of continuing training automatically while still allowing me to inspect intermediate results when necessary.
That requirement eventually led to the first version of the loop-based training system.
The original idea was relatively simple:
This became especially important because many training sessions were running overnight. Smaller dataset parts reduced risk, but they also created a new operational problem:
The system now needed to coordinate large numbers of sequential experiments reliably.
Over time, the loop system evolved far beyond a simple automation script.
I started adding:
The goal was no longer just training.
The goal became maintaining long-running experimental continuity.
Several runtime files gradually emerged from that process:
Each file solved a different operational problem.
Some tracked:
One particularly important improvement was introducing temporary runtime checkpoints.
Instead of writing directly into permanent checkpoints during training, the workflow first used temporary working directories. Once training completed successfully, the results could then be promoted into stable checkpoints.
This reduced the risk of corrupted progression after crashes, interruptions, or unstable runs.
At that point, the project had clearly evolved beyond a simple training script.
It was slowly becoming an orchestration system designed specifically for constrained long-running local experiments.
Once the chained workflow became more stable, I started focusing more heavily on dataset structure itself.
One of the most interesting questions became:
Does dataset order affect learning quality?
At the time, I was experimenting with three primary dataset categories:
My initial intuition was heavily influenced by human learning patterns.
I wondered whether a GPT-style model might learn more naturally if the datasets followed a progression similar to language acquisition:
That idea sounded intuitively correct.
So I reorganized the workflow accordingly.
The chained system eventually evolved into a multi-stage schedule-driven structure where different dataset groups could be trained sequentially using predefined progression rules.
This experimentation phase was one of the reasons the `schedule.json` system became increasingly important.
The workflow was no longer managing only dataset parts.
It now needed to coordinate:
However, the results were not what I expected.
Despite the pedagogically appealing structure, some of the generated outputs became noticeably worse. In several experiments, the models struggled to learn Turkish fluently during the early stages.
That forced another major rethink.
After further experimentation and research, I gradually reached a different conclusion:
The earliest training stages appeared disproportionately important.
As a result, I reorganized the dataset ordering again.
Wikipedia moved back to the beginning of the pipeline, while smaller specialized datasets became later-stage refinement layers instead.
That restructuring improved the overall language quality significantly compared to several earlier experiments.
As training quality gradually improved, I became increasingly interested in understanding why certain experiments performed better than others.
At first, most of my attention had been focused on:
But over time, I realized the structure of the dataset itself was just as important as the training configuration.
One of the first questions I started asking was surprisingly simple:
Was the model actually receiving enough contextual structure during training?
Initially, many preprocessing stages only extracted raw Wikipedia article text. Titles were often discarded during earlier dataset preparation experiments.
Later, I realized this may have unintentionally weakened contextual learning.
Without article titles, the model was effectively seeing disconnected blocks of text without strong topic anchors. I started experimenting with preserving titles alongside article content so the model could learn stronger contextual associations.
Another major realization involved token boundaries.
At first, many dataset lines had inconsistent lengths and incomplete structures. Some samples ended mid-sentence, others cut paragraphs unpredictably, and many training segments were not aligned cleanly with the tokenizer structure.
This raised an important concern:
The model might be learning fragmented language patterns instead of coherent textual structure.
That observation eventually led to tokenizer-aware chunking experiments.
Instead of splitting data arbitrarily, I began restructuring datasets around token-safe windows designed to preserve cleaner contextual continuity.
This significantly changed the preprocessing pipeline.
The workflow evolved toward:
At the same time, I also started investigating the linguistic quality of the datasets themselves.
Questions emerged such as:
Some normalization decisions became surprisingly nuanced.
For example, Turkish contains distinctions such as:
kar
kâr
which are often typed identically in everyday keyboard usage despite having different meanings depending on context.
At that stage, I intentionally avoided over-engineering certain linguistic corrections. My assumption was that some contextual distinctions might be handled later during fine-tuning rather than fully enforced during preprocessing.
The deeper I went into preprocessing and restructuring, the clearer one thing became:
Training quality was not only about the model architecture.
Data organization, contextual continuity, and preprocessing strategy were shaping the behavior of the system just as heavily as the training loop itself.
After multiple redesigns of the preprocessing pipeline, tokenizer workflow, dataset ordering, and training settings, the outputs finally started improving in noticeable ways.
Even relatively early into some experiments, the model began producing recognizable Turkish sentence structures.
Over time:
For the first time, the outputs started feeling less random and more language-like.
The models were no longer generating completely broken text.
They were beginning to form coherent Turkish paragraphs.
That moment was extremely important psychologically because many earlier experiments had failed before reaching that stage.
At the same time, major limitations were still obvious.
The models could imitate style and structure surprisingly well, but factual reliability remained weak.
For example:
But the factual correctness was often unreliable.
The models sounded plausible far more often than they were actually correct.
This became especially visible during larger evaluation runs.
As training continued through dozens of chained dataset parts, I started building batch-style inference and evaluation workflows to compare checkpoints against one another.
Different checkpoints often exhibited noticeably different strengths:
At one stage, I also realized that later dataset parts sometimes disproportionately influenced final outputs.
That observation further reinforced the importance of:
The training process was gradually becoming less about “running epochs” and more about understanding how workflow decisions shaped emergent behavior across long experimental chains.
As the experiments progressed, one conclusion became increasingly difficult to ignore:
Language fluency and factual reliability were not the same problem.
By the later training stages, several checkpoints could already generate reasonably structured Turkish paragraphs. In some cases, the outputs even appeared surprisingly convincing at first glance.
But deeper evaluation revealed major weaknesses.
The models frequently produced:
The system could imitate language patterns far more easily than it could preserve factual accuracy.
At first, I assumed additional training might solve most of these problems.
However, after comparing many checkpoints across long chained runs, another pattern became obvious:
The issue was not simply insufficient training duration.
The issue was also dataset scale.
Compared to modern large-scale pretraining systems, my datasets were extremely small. Even after extensive preprocessing and restructuring, the corpus remained tiny relative to the scale used in major language model training efforts.
That realization changed the direction of the project once again.
Instead of trying to force a single small GPT-style model to become both:
I started thinking more about layered systems.
This gradually led me toward ideas involving:
The project was slowly evolving from:
“Train one good GPT model.”
into something closer to:
“Build cooperating systems around constrained local models.”
Once the Turkish language quality reached a more stable level, I returned to the original motivation behind the project:
Tax-oriented language experimentation.
Instead of attempting to prepare every legal document immediately, I started with a smaller and more controlled subset focused on corporate tax law.
The dataset preparation process became significantly more structured during this stage.
The legal documents were transformed into multiple dataset formats containing combinations of:
At the same time, I realized another practical limitation:
The legal corpus itself was far too small to support meaningful behavior on its own.
To compensate, I began generating additional structured datasets.
This eventually expanded into:
I also experimented heavily with different formatting structures.
The same legal content was reorganized into multiple dataset variants using different:
Another important issue involved context window limitations.
Some legal articles exceeded the practical token limits of the training setup. To preserve continuity while avoiding abrupt truncation, I started experimenting with sliding-window style segmentation approaches.
The goal was to maintain contextual overlap while still respecting tokenizer boundaries and sequence constraints.
At this point, the project had clearly evolved far beyond simple Wikipedia pretraining.
The workflow now included:
What originally started as a local curiosity project was gradually becoming a much broader low-resource experimentation framework.
The fine-tuning experiments produced mixed but extremely informative results.
Some outputs became noticeably more relevant to tax-related prompts, especially when the questions closely resembled the structured examples used during training.
In certain cases, the models could generate partially correct explanations or produce answers that contained fragments of accurate legal reasoning.
However, the limitations remained obvious.
The models often generated responses that sounded convincing while still containing factual inaccuracies or fabricated details.
At times, the outputs were approximately correct in structure but unreliable in substance.
This became especially concerning because tax and legal domains are highly sensitive to hallucinations.
A confidently incorrect answer in those areas can easily become more dangerous than a clearly failed response.
During evaluation, another pattern also became increasingly visible:
The most successful outputs often came from:
In contrast, raw legal text alone produced much weaker behavior than I initially expected.
That observation reinforced an important lesson:
Small models benefit enormously from structure.
The experiments also revealed another major weakness:
Mathematics and deterministic reasoning remained unstable.
Even when the language generation improved significantly, arithmetic consistency was still unreliable.
After researching this further, I realized this limitation was not unique to my experiments. Even much larger language models often struggle with exact mathematical consistency without external systems or tool-assisted reasoning layers.
That realization shifted my thinking once again.
Instead of trying to force one model to solve every problem alone, I began exploring the idea of cooperative architectures.
I started imagining systems where:
At one stage, I even considered combining:
with:
The idea was that a classifier-oriented model could first analyze and structure the request before routing it into a generative model more effectively.
At the same time, retrieval-augmented approaches became increasingly appealing.
Instead of expecting a tiny GPT-style model to memorize large legal corpora perfectly, a retrieval layer could supply grounded contextual information dynamically during inference.
This represented another major conceptual transition in the project:
The goal was no longer building a single intelligent model.
The goal was designing systems capable of combining:
within constrained local environments.
Eventually, I decided to prepare a smaller public-facing version of the project.
At that stage, my objective was no longer maximizing general intelligence or building a perfectly accurate legal assistant.
Instead, I wanted to demonstrate:
One of the final experiments involved intentionally overfitting a smaller fine-tuned model in a controlled way.
Rather than forcing the model to answer everything, the idea was to encourage behavior where:
Surprisingly, this produced some of the most stable behavior observed during the project.
The resulting models were still highly limited, but they demonstrated an important point:
Even constrained local systems could become useful within carefully designed boundaries.
That realization eventually led to the first public repository release.
The public version focused on:
rather than presenting itself as a production-grade AI system.
In many ways, the most valuable outcome of the project was not the final model itself.
It was understanding how hardware constraints, preprocessing decisions, orchestration design, and dataset structure collectively reshape the entire engineering process around local AI experimentation.
This project was never intended to compete with large-scale commercial systems.
The primary goal was understanding what constrained local experimentation could realistically achieve.
github.com
Published via Towards AI
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。




