CSD-Code-Sample-Dataset

This repository provides a complete pipeline to collect and store data from 345 curated code sample repositories. By executing the Jupyter notebooks provided in the notebook/ directory, you can store structured data locally in a PostgreSQL database, enabling analysis and research on the evolution of code samples software.

Dataset Overview

The collected data is organized across the following relational tables:

Ecosystems – Software development ecosystems (e.g., Spring, AWS, Azure).
Organizations – GitHub organizations owning the repositories.
Repositories – Basic data of each repository.
Commits – Data such as sha, message and timestamp.
Files - General data of files of repositories, such as name and type.
Commit Files – File's data that is from specific commits
Hunks – Code-level changes (diffs) between commit versions.

Usage

To create the dataset locally:

Install PostgreSQL in your machine.
Clone this repository.
Insert your local postgres database password in a .env (see .env.example).
Navigate to the notebook/ folder.
Execute the notebooks in the following order:
- 0_setup.ipynb
- 1_ecosystems.ipynb
- 2_organizations.ipynb
- 3_repositories.ipynb
- 4_commits.ipynb
- 5_files.ipynb
- 6_cfs.ipynb
- 7_hunks.ipynb

Total runtime : Between 4 to 7 hours depending on your system. Basic data (from ecosystems to files) is usually available within 10 minutes.

Notes

In order to extract the data, repositories are cloned in bare mode, reducing storage the needed.
The resulting database is approximately 1.5 GB in size.
Most of the Jupyter notebook files use 100% of CPU resources for optimized multi-threading.
You can customize the playground.py file to get data from repositories without processing the full dataset.

Database Documentation

ER Diagram

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
ai		ai
data		data
kappa		kappa
models		models
notebook		notebook
playground		playground
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
code_samples.csv		code_samples.csv
golden_set.json		golden_set.json
output_for_llama3_model.txt		output_for_llama3_model.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSD-Code-Sample-Dataset

Dataset Overview

Usage

Notes

Database Documentation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Cafeo-Group/CSD-Code-Sample-Dataset

Folders and files

Latest commit

History

Repository files navigation

CSD-Code-Sample-Dataset

Dataset Overview

Usage

Notes

Database Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages