Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Name

Last commit message

Last commit date

Identifiers

Paper (accepted to ML4P'18).

The dataset was extracted from Public Git Archive and consists of:

49 million distinct identifiers - 1 GB
identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.

CSV, columns:

All the stats correspond to the HEAD revision of each repository in PGA.

Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.

Provide feedback