Spam detection datasets are surprisingly bad once you move outside English.
Most public datasets are:
- tiny,
- outdated,
- English-only,
- SMS-only,
- or missing real-world spam patterns.
Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.
So I built SpamShield Datasets — a multilingual spam detection corpus designed for real-world NLP systems.
It currently contains 149,359 messages across 23 languages, with support for both binary spam detection and category-level classification.
- Dataset: SpamShield Datasets
Why I Built This
I was experimenting with multilingual moderation systems and quickly realized something:
Most spam datasets completely fail at:
- Hinglish/code-mixed text
- Unicode obfuscation
- multilingual phishing
- scam-style promotions
- adversarial spam formatting
Real spam does not look clean.
People intentionally distort words using:
- leetspeak
- invisible Unicode characters
- mixed scripts
- emoji stuffing
- transliterated language
- fake urgency patterns
And almost no open dataset covered this properly.
So I started collecting, cleaning, normalizing, and structuring multilingual spam corpora into a single unified dataset.
That eventually became SpamShield Datasets.
Dataset Overview
The dataset currently contains:
| Metric | Value |
|---|---|
| Total Messages | 149,359 |
| Ham Messages | 72,439 |
| Spam Messages | 76,920 |
| Languages | 23 |
| Formats | JSONL + Parquet |
| License | CC-BY-4.0 |
The schema is intentionally simple:
{
"text": "Congratulations! You've won a free iPhone.",
"label": 1,
"category": "spam"
}
Where:
-
label = 0→ ham -
label = 1→ spam
Supported Languages
SpamShield currently includes:
- Arabic
- Bengali
- Chinese
- Dutch
- English
- French
- German
- Hinglish
- Indonesian
- Italian
- Japanese
- Javanese
- Korean
- Marathi
- Norwegian
- Portuguese
- Punjabi
- Russian
- Spanish
- Swedish
- Turkish
- Ukrainian
- Urdu
I specifically wanted the dataset to include:
- low-resource languages,
- mixed-script content,
- and code-mixed communication styles.
Because that is how people actually communicate online.
How the Dataset Is Structured
The dataset repository contains:
README.md- language-wise JSONL files
combined.parquet- filtering scripts
- metadata and processing utilities
I provided two formats intentionally.
1. JSONL Files
Each language has its own JSONL file.
This is useful when:
- training language-specific models,
- debugging,
- or performing dataset analysis.
Example:
{
"text": "Free recharge available now!",
"label": 1,
"category": "marketing"
}
2. Combined Parquet File
The repository also includes:
combined.parquet
This is the recommended format for large-scale training.
Why Parquet?
Because:
- it loads faster,
- uses less storage,
- supports columnar access,
- and works extremely well with ML pipelines.
Especially when training multilingual transformers.
Synthetic Augmentation
One thing I want to mention honestly:
About 20% of the dataset is synthetically augmented.
I used techniques like:
- paraphrasing,
- translation,
- back-translation,
- Unicode variation,
- and leetspeak mutation.
Why?
Because modern spam constantly mutates itself.
If you only train on perfectly clean spam examples, your model performs badly against real-world adversarial spam.
The goal was robustness — not just benchmark accuracy.
Spam Categories
Instead of only binary labels, I also included category-level labels like:
- phishing
- scam
- crypto
- marketing
- giveaway
- promo
- adult
- job_scam
This makes the dataset useful for:
- moderation systems,
- risk scoring,
- scam-type classification,
- and advanced filtering pipelines.
Loading the Dataset
Using the Parquet file is very straightforward.
import pandas as pd
df = pd.read_parquet("combined.parquet")
print(df.shape)
print(df["label"].value_counts())
Filtering by language:
english = df[df["language"] == "English"]
print(len(english))
Challenges While Building It
The hardest parts were honestly:
- normalization,
- deduplication,
- and balancing quality across languages.
Spam text is messy.
Different datasets had:
- different schemas,
- different encodings,
- different label styles,
- and inconsistent formatting.
Some datasets had:
- only spam,
- broken Unicode,
- or duplicated messages thousands of times.
A lot of time went into cleaning and standardizing everything.
Acknowledgments
SpamShield Datasets was built using multiple publicly available open-source spam and ham datasets from the NLP and cybersecurity community.
The original datasets were carefully:
- filtered,
- cleaned,
- normalized,
- deduplicated,
- reformatted,
- and curated into a unified multilingual structure.
Additional processing was done to improve consistency across languages, schemas, encodings, and labeling formats.
I would like to thank all researchers, dataset maintainers, and open-source contributors whose work made this project possible. Open datasets are one of the biggest reasons independent research and experimentation can still happen at scale.
This project mainly focuses on:
- multilingual unification,
- dataset curation,
- schema standardization,
- quality filtering,
- and robustness-oriented augmentation for real-world spam detection systems.
If you found this project useful, consider giving it a star. It genuinely helps support future updates and improvements.
Reference Links
- Dataset: SpamShield Datasets
- Dataset Card / README: View Documentation
- License: CC-BY-4.0
-
Recommended File:
combined.parquet
Final Thoughts
Spam detection is becoming much harder.
Modern spam is:
- multilingual,
- adaptive,
- adversarial,
- and increasingly AI-generated.
I wanted to create something that was actually useful for real-world NLP systems instead of another tiny benchmark dataset.
SpamShield Datasets is still evolving, but I hope it helps researchers and developers build stronger multilingual moderation systems.
If you want to experiment with multilingual spam detection, adversarial filtering, or moderation pipelines, feel free to check it out.
Support
Building and maintaining multilingual datasets takes a significant amount of time for:
- cleaning,
- balancing,
- validation,
- augmentation,
- and formatting.
If this dataset helped your project or research, consider starring or sharing it. That support genuinely motivates future development.
Thanks for reading.





















