Skip to content

Ardark27/SSKG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scientific Software Knowledge Graphs (SSKG)

README IN PROGRESS

Introduction

This tool verifies the link between a scientific paper and a software repository. It accomplishes this by locating the URL of the software repository within the scientific paper. It then extracts the repository's metadata to find any URLs associated with scientific papers and checks if they lead back to the original paper. If a bidirectional link is established, it marks it as "bidirectional".

Dependencies

  • Python 3.9

Usage

Install the required dependencies by running:

pip install -r requirements.txt

Highly recommended steps:

somef configure

You will be asked to provide:

The repository is divided into 3 directories:

1.  Apis
2.  Pdf_info_extractor
3.  Modelado

Api's

The directory is organized into subdirectories corresponding to the API of each tool utilized: GitHub, Zenodo, OpenAire, and OpenAlex. These API's extract the necessary DOI's and are processed by api_doi_collector.py which leaves the transformed and filtered DOI's within doi_filtered.csv, with the first column being the DOI and the second the provenance.

Arxiv_downloader.py takes the doi_filtered.csv and downloads the PDF file if a arxiv ID is found within the doi_filtered.csv DOI field. If no arxivID is found, the DOI is added to the dois_not_arxiv.csv.

The DOI's within dois_not_arxiv.csv are then used within oa_url_pdf_extractor.py which locates a unpaywall url and creates a new csv: doi_not_arxiv_with_pdf_url.csv with the DOI, unpaywall url and other information.

Finally, all pdf's are downloaded through: oa_pdf_downloader.py generating a download list.

To view an example please refer to dois.ipynb

Pdf_info_extractor

After the PDF files have been downloaded to the pdf_info_extractor/data_pdf/ directory, we can proceed with extracting the GitHub URLs. To accomplish this, we utilize either GROBID or Tika as part of the tool.

The URL extraction process is performed by the pipeline.py script, which selects either Tika or GROBID based on the user's preference. It extracts the URLs from each PDF file and generates a separate text file for each PDF within the ./data_github_urls directory. These text files are then passed to SOMEF, which generates a JSON file containing metadata for the repositories.

To execute the pipeline, make sure you are in the pdf_info_extractor directory.

Modelado (Modelling)

Once the urls have been extracted and the repositories downloaded. We can the proceed to check whether or not the pdf/software repository is "Bi-directional". This is done by using bidireccional.py

License

This project is licensed under the MIT License.