This repository provides a complete pipeline to collect and store data from 345 curated code sample repositories. By executing the Jupyter notebooks provided in the notebook/
directory, you can store structured data locally in a PostgreSQL database, enabling analysis and research on the evolution of code samples software.
The collected data is organized across the following relational tables:
- Ecosystems – Software development ecosystems (e.g., Spring, AWS, Azure).
- Organizations – GitHub organizations owning the repositories.
- Repositories – Basic data of each repository.
- Commits – Data such as sha, message and timestamp.
- Files - General data of files of repositories, such as name and type.
- Commit Files – File's data that is from specific commits
- Hunks – Code-level changes (diffs) between commit versions.
To create the dataset locally:
- Install PostgreSQL in your machine.
- Clone this repository.
- Insert your local postgres database password in a
.env
(see.env.example
). - Navigate to the
notebook/
folder. - Execute the notebooks in the following order:
0_setup.ipynb
1_ecosystems.ipynb
2_organizations.
ipynb3_repositories.ipynb
4_commits.ipynb
5_files.ipynb
6_cfs.ipynb
7_hunks.ipynb
Total runtime : Between 4 to 7 hours depending on your system. Basic data (from ecosystems to files) is usually available within 10 minutes.
- In order to extract the data, repositories are cloned in bare mode, reducing storage the needed.
- The resulting database is approximately 1.5 GB in size.
- Most of the Jupyter notebook files use 100% of CPU resources for optimized multi-threading.
- You can customize the
playground.py
file to get data from repositories without processing the full dataset.