惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fortinet All Blogs
腾讯CDC
B
Blog
Recorded Future
Recorded Future
V
Visual Studio Blog
WordPress大学
WordPress大学
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
PCI Perspectives
PCI Perspectives
I
InfoQ
博客园 - 聂微东
博客园 - 【当耐特】
宝玉的分享
宝玉的分享
T
Tailwind CSS Blog
T
The Blog of Author Tim Ferriss
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
Blog — PlanetScale
Blog — PlanetScale
Microsoft Security Blog
Microsoft Security Blog
雷峰网
雷峰网
aimingoo的专栏
aimingoo的专栏
Hugging Face - Blog
Hugging Face - Blog
人人都是产品经理
人人都是产品经理
云风的 BLOG
云风的 BLOG
P
Proofpoint News Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
DataBreaches.Net
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
有赞技术团队
有赞技术团队
C
Check Point Blog
Stack Overflow Blog
Stack Overflow Blog
MyScale Blog
MyScale Blog
Google DeepMind News
Google DeepMind News
量子位
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - Franky
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LangChain Blog
G
Google Developers Blog
U
Unit 42
Recent Announcements
Recent Announcements
L
Lohrmann on Cybersecurity
P
Palo Alto Networks Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
MongoDB | Blog
MongoDB | Blog
K
Kaspersky official blog
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
Cyberwarzone
Cyberwarzone

The GitHub Blog

GitHub Copilot CLI for Beginners: Overview of common slash commands How we made GitHub Copilot CLI more selective about delegation GitHub availability report: May 2026 Making secret scanning more trustworthy: Reducing false positives at scale Give GitHub Copilot CLI real code intelligence with language servers From one-off prompts to workflows: How to use custom agents in GitHub Copilot CLI GitHub for Beginners: Answers to some common questions GitHub Universe is back: All together now, in the agentic era GitHub Copilot app: The agent-native desktop experience Still a developer. Just outside. Our latest GitHub Shop collection is here. GitHub for Beginners: Getting started with Git and GitHub in VS Code GitHub recognized as a Leader in the Gartner® Magic Quadrant™ for Enterprise AI Coding Agents for the third year in a row Beyond the engine: 10 open source projects shaping how games actually get made Building GitHub’s next chapter in accessibility Investigation update: GitHub Enterprise Server signing key rotation Take your local GitHub sessions anywhere Building a general-purpose accessibility agent—and what we learned in the process Raising the bar: Quality, shared responsibility, and the future of GitHub’s bug bounty program GitHub availability report: April 2026 From latency to instant: Modernizing GitHub Issues navigation performance Dungeons & Desktops: 10 roguelikes that never die (because their communities won’t let them) GitHub Copilot individual plans: Introducing flex allotments in Pro and Pro+, and a new Max plan Dungeons & Desktops: Building a procedurally generated roguelike with GitHub Copilot CLI GitHub for Beginners: Getting started with OSS contributions Why age assurance laws matter for developers How researchers are using GitHub Innovation Graph data to reveal the “digital complexity” of nations Improving token efficiency in GitHub Agentic Workflows Agent pull requests are everywhere. Here’s how to review them. Validating agentic behavior when “correct” isn’t deterministic Welcome to Maintainer Month: Celebrating the people behind the code Register now for OpenClaw: After Hours @ GitHub GitHub Copilot CLI for Beginners: Interactive v. non-interactive mode GitHub for Beginners: Getting started with Markdown Securing the git push pipeline: Responding to a critical remote code execution vulnerability Highlights from Git 2.54 Building an emoji list generator with the GitHub Copilot CLI Bringing more transparency to GitHub’s status page How GitHub uses eBPF to improve deployment safety Build a personal organization command center with GitHub Copilot CLI Developer policy update: Intermediary liability, copyright, and transparency Hack the AI agent: Build agentic AI security skills with the GitHub Secure Code Game How exposed is your code? Find out in minutes—for free GitHub for Beginners: Getting started with GitHub Pages GitHub Copilot CLI for Beginners: Getting started with GitHub Copilot CLI GitHub availability report: March 2026 GitHub Universe is back: We want you to take the stage GitHub Copilot CLI combines model families for a second opinion The uphill climb of making diff lines performant Securing the open source supply chain across GitHub Run multiple agents at once with /fleet in Copilot CLI Agent-driven development in Copilot Applied Science GitHub for Beginners: Getting started with GitHub security What’s coming to our GitHub Actions 2026 security roadmap
Accelerating researchers and developers building multilingual AI with a new open dataset
Natalie Guevara · 2026-06-16 · via The GitHub Blog

Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever.

Today, GitHub is publishing the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories.

The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers.

What’s in the dataset

The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories. For each public repository, we provide:

  • Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample. We exclude texts under 20 characters.
  • Classifications for each text source, from fastText, gcld3, and lingua-py, each with a confidence score. The dataset only includes classifications with >0.5 confidence.
  • Repository metadata: creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.

We deliberately did not collapse the three classifiers into a single label. Different classifiers have different coverage and confidence calibration, especially for lower-resource languages. By exposing all three, we let you decide how strict you want to be. Want a high-precision Greek subset? Require all three classifiers to agree above some confidence threshold. Want broad recall for an exploratory study of Romance languages? One classifier may be enough.

What you can build with it

The dataset is designed for the kind of work that’s hard to do with general web text:

  • Discover repositories likely to contain developer documentation or collaboration in specific languages.
  • Study how non-English developer communities use issues, pull requests, and READMEs.
  • Build evaluation sets for AI coding tools, doc generators, or review assistants that need to behave well across languages.
  • Encourage decision-makers to expand language coverage for new developer tools and AI features using data-backed arguments on the rich multilingual diversity of developers.
  • Measure representation of European and other underrepresented languages in open source.

Some caveats

Language identification is hard, especially in software repositories. Repository text is often short. It may include badges, templates, installation commands, code snippets, usernames, or mixed-language content. A 150-character sample may not represent the whole repository. Classifiers also vary in coverage and calibration, especially for lower-resource languages.

That is why the dataset should not be treated as a ground-truth benchmark for language identification. Instead, it is designed as a transparent discovery tool. Users can inspect classifications, confidence scores, and sources, then choose the precision and recall tradeoffs that fit their own research or development workflow.

The dataset also should not be used to infer sensitive attributes about repository owners, contributors, or communities. The signals are repository-level metadata, not person-level attributes.

Why open multilingual data matters

Today, many European languages remain underrepresented in the online text used to build and evaluate AI systems. That creates a risk that AI tools work well for some developers, languages, and communities, while leaving others behind. Open data can help close that gap. We built this dataset because developer content is different from general web text. READMEs, issues, and pull requests contain the language of software collaboration: installation instructions, bug reports, feature requests, review comments, and community norms. That context can help build AI systems that better understand how developers actually work.

By making multilingual developer-content signals easier to find and analyze, this dataset gives researchers, open source developers, and model builders another tool for studying language representation in software development. It can help identify gaps, support better evaluation, and inform more inclusive AI tools for developers across Europe and beyond. It also reflects a broader principle: Building AI for developers should include the communities, languages, and workflows developers actually use.

What’s next

We’ll be discussing the dataset, and the broader importance of open data for multilingual AI, at the Open Innovation Dialogue Hub in Strasbourg on June 16. The event is co-organized by the Microsoft Open Innovation Center, the Council of Europe, and GitHub, and will bring together policymakers, researchers, cultural institutions, and open innovation leaders to discuss AI, linguistic diversity, cultural heritage, and open data.

Multilingual AI needs multilingual developer communities. We hope this dataset helps more people study, support, and build for them. By releasing it under CC0-1.0 on GitHub, we’re inviting researchers, open source maintainers, and model builders to use it, critique it, extend it, and build evaluation sets and tools on top of it.

If you do something interesting with it, we’d love to hear about it.

Written by

Kevin Xu

Staff Software Engineer, CELA

Explore more from GitHub

Docs

Docs

Everything you need to master GitHub, all in one place.

Go to Docs

GitHub

GitHub

Build what’s next on GitHub, the place for anyone from anywhere to build anything.

Start building

Customer stories

Customer stories

Meet the companies and engineering teams that build with GitHub.

Learn more

GitHub Universe 2026

GitHub Universe 2026

Join us October 28-29 in San Francisco or online for GitHub Universe, our flagship developer event uniting people, agents, and the world’s code.

Register now