3. Saffron outputs

This page gives a description of the files generated by a Saffron run (also described in FORMATS.md). If Saffron is run using the command line interface, these files will be located in the output folder specified in the running command. If using the Web interface, they will be generated in the ./web/data folder.

Terms (terms.json)

Each element in this file represents a single term extracted from the corpus. The file contains the following annotations:

term_string: The string that names the term (must be unique)
occurrences: The total number of occurrences of a term in the corpus
matches: The number of documents in the corpus containing this term
score: The importance of the term to this corpus, based on the scoring functions chosen in the configuration settings
status: #deprecated (Whether the term was validated or not (see the Review mode documentation). The default is set to none)
morphologicalVariationList: A list of alternative (morphological variants) forms of this term string found in the corpus
string: The form of this variant
original_term: The form of this variant

Doc-Terms (doc-terms.json)

This file shows all the relationships between documents and terms.

document_id: A unique string to identify the document, made up of the document filename (preceded by _zip_filename if the dataset is submitted as a .zip file)
term_string: The string that names the term (must be unique)
occurrences: The number of occurrences of the term in the single document

Term-Sim (term-sim.json)

This file gathers and compares all pairs of terms extracted in the previous stage. Each element describes one edge, ie. a relation between two terms, and their similarity score (see the pairwise scoring step for more explanation on how this is calculated).

term1_id: The first term's term string
term2_id: The second term's term string
similarity: The similarity of the two terms
status: #deprecated

If authors are present in the original corpus as metadata. The following two files will be generated

Author-Terms (authors-terms.json)

An edge linking an author to a term.

author_id: The ID of the author
term_id: The term string of the term
matches: The number of times this term is used in documents by this author
occurrences: The number of occurrences of the term by this author
paper_count: The number of documents from the author containing this term
tfidf: The Term Frequency-Inverse Research Frequency (See "Domain adaptive extraction of topical hierarchies for Expertise Mining" (Georgeta Bordea (2013)) for evaluations of different methods)
score: The score of this linking
researcher_score: The score for author's ranking for this particular term

Author-Sim (author-sim.json)

An edge linking pairs of authors together.

author1_id: The ID of the first author
author2_id: The ID of the second author
similarity: The similarity score between these authors

Taxonomy (taxonomy.json)

If the corresponding line was uncommented from the saffron.sh file or that the user interface is used, this file will be generated. Not that the algorithms used to create the taxonomy (and described here ) are different from the ones used to create the knowledge graph (generated in the kg.json and kg.rdf). This file represents the whole taxonomy. Each element describes a term and how it is related to other terms in the taxonomy. The file contains the following attributes:

root: The term string of this term or "HEAD_TERM" for the root of the taxonomy
score: The weighting given to the root term
linkScore: The likelihood of the link from this term to its root being correct
children: A list of children of this node (these are also Taxonomy objects)
status: #deprecated
parent: #deprecated

Knowledge Graph (kg.json)

This file contains only the taxonomy extracted by the algorithms used for the whole knowledge graph creation, therefore representing the skos:boarder relations only (see the description of kg.rdf below for the whole representation of the knowledge graph)

root: The term string of this term or "HEAD_TERM" for the root of the taxonomy
score: The weighting given to the root term
linkScore: The likelihood of the link from this term to its root being correct
children: A list of children of this node (these are also Taxonomy objects)
status: #deprecated
parent: #deprecated
synonymyClusters: Clusters together terms categorized as synonyms by the algorithm. Only one term from each cluster is represented in the taxonomy tree.

Knowledge Graph (kg.rdf)

skos:Concept: The uri of the term, in the form "http://saffron.insight-centre.org/rdf/term/TERM"
rdfs:label: The label used for the term

Knowledge Graph relations extracted:

saffron:partOf: Term identified as sharing a relation "part-of" (partonomy relationship) with the skos:Concept it is included in, in the graph.
saffron:wholeOf: Term identified as sharing a relation "whole-of" (partonomy relationship) with the skos:Concept it is included in, in the graph.
skos:broader: Term identified as sharing a taxonomic relation with the skos:Concept it is included in, in the graph.
saffron:synonymy: Identifies a term as a synonym of the skos:Concept it is included in, in the graph.

This resource has been funded by Science Foundation Ireland under Grant SFI/12/RC/2289_P2 for the Insight SFI Research Centre for Data Analytics. © 2020 Data Science Institute - National University of Ireland Galway

Provide feedback

Saved searches

Use saved searches to filter your results more quickly