GitHub is making it easier for developers and researchers to build AI that understands code collaboration across languages. The company has released a new open, repository-level dataset designed to identify and categorize multilingual content found within public GitHub repositories.
Related startups
This new multilingual AI dataset, published under CC0-1.0, covers over 40 million repositories. It provides metadata on the language of README files, the most-commented issue, and the most-commented pull request, along with repository statistics like stars, forks, and license information.
Bridging the Language Gap in AI Development
While English dominates developer communication, the dataset reveals significant non-English activity. Portuguese leads in README languages, while Korean is prevalent in issue discussions. This data is crucial for building AI tools that don't leave non-English speaking developers behind.
