惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Creating a Custom Grid Editor tool in Unreal Engine A practitioner's guide to getting more value out of AI coding: agent quality & token optimization How to Handle Telegram Albums in grammY RAG 시스템 실전 구축 (v38) Beyond Pip Install: Why Your AI Agent Needs a "Hermetic" Life-Support System to Survive Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) Running ASR for smart homes in the NPU of Intel processors "Building a CI/CD Pipeline From Scratch: A Practical Guide for Developers (with GitHub Actions)" SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) How to Extract Your Full Team Hierarchy from HubSpot (the API doesn't expose it) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown .klickd v4.0.0 — Portable AI memory with constraints, strict schemas, and test vectors We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we? AI Visibility Tools, Math Proofs, and Stripped Guardrails Shape Developer Landscape How AI and Electronics Are Changing Healthcare Devices: The Future of Smart Healthcare Author: Shivam Wakade | Founder, PrivSR Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler
I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages
Arjun M · 2026-05-26 · via DEV Community

Spam detection datasets are surprisingly bad once you move outside English.

Most public datasets are:

  • tiny,
  • outdated,
  • English-only,
  • SMS-only,
  • or missing real-world spam patterns.

Meanwhile, actual spam today is multilingual, code-mixed, obfuscated, and platform-adaptive.

So I built SpamShield Datasets — a multilingual spam detection corpus designed for real-world NLP systems.

It currently contains 149,359 messages across 23 languages, with support for both binary spam detection and category-level classification.



Why I Built This

I was experimenting with multilingual moderation systems and quickly realized something:

Most spam datasets completely fail at:

  • Hinglish/code-mixed text
  • Unicode obfuscation
  • multilingual phishing
  • scam-style promotions
  • adversarial spam formatting

Real spam does not look clean.

People intentionally distort words using:

  • leetspeak
  • invisible Unicode characters
  • mixed scripts
  • emoji stuffing
  • transliterated language
  • fake urgency patterns

And almost no open dataset covered this properly.

So I started collecting, cleaning, normalizing, and structuring multilingual spam corpora into a single unified dataset.

That eventually became SpamShield Datasets.


Dataset Overview

The dataset currently contains:

Metric Value
Total Messages 149,359
Ham Messages 72,439
Spam Messages 76,920
Languages 23
Formats JSONL + Parquet
License CC-BY-4.0

The schema is intentionally simple:

{
  "text": "Congratulations! You've won a free iPhone.",
  "label": 1,
  "category": "spam"
}

Enter fullscreen mode Exit fullscreen mode

Where:

  • label = 0 → ham
  • label = 1 → spam

Supported Languages

SpamShield currently includes:

  • Arabic
  • Bengali
  • Chinese
  • Dutch
  • English
  • French
  • German
  • Hinglish
  • Indonesian
  • Italian
  • Japanese
  • Javanese
  • Korean
  • Marathi
  • Norwegian
  • Portuguese
  • Punjabi
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian
  • Urdu

I specifically wanted the dataset to include:

  • low-resource languages,
  • mixed-script content,
  • and code-mixed communication styles.

Because that is how people actually communicate online.


How the Dataset Is Structured

The dataset repository contains:

  • README.md
  • language-wise JSONL files
  • combined.parquet
  • filtering scripts
  • metadata and processing utilities

I provided two formats intentionally.

1. JSONL Files

Each language has its own JSONL file.

This is useful when:

  • training language-specific models,
  • debugging,
  • or performing dataset analysis.

Example:

{
  "text": "Free recharge available now!",
  "label": 1,
  "category": "marketing"
}

Enter fullscreen mode Exit fullscreen mode


2. Combined Parquet File

The repository also includes:

combined.parquet

Enter fullscreen mode Exit fullscreen mode

This is the recommended format for large-scale training.

Why Parquet?

Because:

  • it loads faster,
  • uses less storage,
  • supports columnar access,
  • and works extremely well with ML pipelines.

Especially when training multilingual transformers.


Synthetic Augmentation

One thing I want to mention honestly:

About 20% of the dataset is synthetically augmented.

I used techniques like:

  • paraphrasing,
  • translation,
  • back-translation,
  • Unicode variation,
  • and leetspeak mutation.

Why?

Because modern spam constantly mutates itself.

If you only train on perfectly clean spam examples, your model performs badly against real-world adversarial spam.

The goal was robustness — not just benchmark accuracy.


Spam Categories

Instead of only binary labels, I also included category-level labels like:

  • phishing
  • scam
  • crypto
  • marketing
  • giveaway
  • promo
  • adult
  • job_scam

This makes the dataset useful for:

  • moderation systems,
  • risk scoring,
  • scam-type classification,
  • and advanced filtering pipelines.

Loading the Dataset

Using the Parquet file is very straightforward.

import pandas as pd

df = pd.read_parquet("combined.parquet")

print(df.shape)
print(df["label"].value_counts())

Enter fullscreen mode Exit fullscreen mode

Filtering by language:

english = df[df["language"] == "English"]
print(len(english))

Enter fullscreen mode Exit fullscreen mode


Challenges While Building It

The hardest parts were honestly:

  • normalization,
  • deduplication,
  • and balancing quality across languages.

Spam text is messy.

Different datasets had:

  • different schemas,
  • different encodings,
  • different label styles,
  • and inconsistent formatting.

Some datasets had:

  • only spam,
  • broken Unicode,
  • or duplicated messages thousands of times.

A lot of time went into cleaning and standardizing everything.


Acknowledgments

SpamShield Datasets was built using multiple publicly available open-source spam and ham datasets from the NLP and cybersecurity community.

The original datasets were carefully:

  • filtered,
  • cleaned,
  • normalized,
  • deduplicated,
  • reformatted,
  • and curated into a unified multilingual structure.

Additional processing was done to improve consistency across languages, schemas, encodings, and labeling formats.

I would like to thank all researchers, dataset maintainers, and open-source contributors whose work made this project possible. Open datasets are one of the biggest reasons independent research and experimentation can still happen at scale.

This project mainly focuses on:

  • multilingual unification,
  • dataset curation,
  • schema standardization,
  • quality filtering,
  • and robustness-oriented augmentation for real-world spam detection systems.

If you found this project useful, consider giving it a star. It genuinely helps support future updates and improvements.


Reference Links


Final Thoughts

Spam detection is becoming much harder.

Modern spam is:

  • multilingual,
  • adaptive,
  • adversarial,
  • and increasingly AI-generated.

I wanted to create something that was actually useful for real-world NLP systems instead of another tiny benchmark dataset.

SpamShield Datasets is still evolving, but I hope it helps researchers and developers build stronger multilingual moderation systems.

If you want to experiment with multilingual spam detection, adversarial filtering, or moderation pipelines, feel free to check it out.


Support

Building and maintaining multilingual datasets takes a significant amount of time for:

  • cleaning,
  • balancing,
  • validation,
  • augmentation,
  • and formatting.

If this dataset helped your project or research, consider starring or sharing it. That support genuinely motivates future development.

Thanks for reading.