The dataset was extracted from Public Git Archive and consists of:
- 49 million distinct identifiers - 1 GB
- identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.
CSV, columns:
num_files- number of files where the identifier was foundnum_occ- number of times the identifier was found overallnum_repos- number of repositories in which the identifier was foundtoken- the value of the identifiertoken_split- the splitted parts using the sourced-ml heuristics
All the stats correspond to the HEAD revision of each repository in PGA.
- Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.