Skip to content

3. Saffron outputs

cecrob edited this page Dec 1, 2022 · 6 revisions

This page gives a description of the files generated by a Saffron run (also described in FORMATS.md). If Saffron is run using the command line interface, these files will be located in the output folder specified in the running command. If using the Web interface, they will be generated in the ./web/data folder.

Terms (terms.json)

Each element in this file represents a single term extracted from the corpus. The file contains the following annotations:

  • term_string: The string that names the term (must be unique)
  • occurrences: The total number of occurrences of a term in the corpus
  • matches: The number of documents in the corpus containing this term
  • score: The importance of the term to this corpus, based on the scoring functions chosen in the configuration settings
  • status: #deprecated (Whether the term was validated or not (see the Review mode documentation). The default is set to none)
  • morphologicalVariationList: A list of alternative (morphological variants) forms of this term string found in the corpus
  • string: The form of this variant
  • original_term: The form of this variant

Doc-Terms (doc-terms.json)

This file shows all the relationships between documents and terms.

  • document_id: A unique string to identify the document, made up of the document filename (preceded by _zip_filename if the dataset is submitted as a .zip file)
  • term_string: The string that names the term (must be unique)
  • occurrences: The number of occurrences of the term in the single document

Term-Sim (term-sim.json)

This file gathers and compares all pairs of terms extracted in the previous stage. Each element describes one edge, ie. a relation between two terms, and their similarity score (see the pairwise scoring step for more explanation on how this is calculated).

  • term1_id: The first term's term string
  • term2_id: The second term's term string
  • similarity: The similarity of the two terms
  • status: #deprecated

If authors are present in the original corpus as metadata. The following two files will be generated

Author-Terms (authors-terms.json)

An edge linking an author to a term.

  • author_id: The ID of the author
  • term_id: The term string of the term
  • matches: The number of times this term is used in documents by this author
  • occurrences: The number of occurrences of the term by this author
  • paper_count: The number of documents from the author containing this term
  • tfidf: The Term Frequency-Inverse Research Frequency (See "Domain adaptive extraction of topical hierarchies for Expertise Mining" (Georgeta Bordea (2013)) for evaluations of different methods)
  • score: The score of this linking
  • researcher_score: The score for author's ranking for this particular term

Author-Sim (author-sim.json)

An edge linking pairs of authors together.

  • author1_id: The ID of the first author
  • author2_id: The ID of the second author
  • similarity: The similarity score between these authors

Taxonomy (taxonomy.json)

If the corresponding line was uncommented from the saffron.sh file or that the user interface is used, this file will be generated. Not that the algorithms used to create the taxonomy (and described here ) are different from the ones used to create the knowledge graph (generated in the kg.json and kg.rdf). This file represents the whole taxonomy. Each element describes a term and how it is related to other terms in the taxonomy. The file contains the following attributes:

  • root: The term string of this term or "HEAD_TERM" for the root of the taxonomy
  • score: The weighting given to the root term
  • linkScore: The likelihood of the link from this term to its root being correct
  • children: A list of children of this node (these are also Taxonomy objects)
  • status: #deprecated
  • parent: #deprecated

Knowledge Graph (kg.json)

This file contains only the taxonomy extracted by the algorithms used for the whole knowledge graph creation, therefore representing the skos:boarder relations only (see the description of kg.rdf below for the whole representation of the knowledge graph)

  • root: The term string of this term or "HEAD_TERM" for the root of the taxonomy
  • score: The weighting given to the root term
  • linkScore: The likelihood of the link from this term to its root being correct
  • children: A list of children of this node (these are also Taxonomy objects)
  • status: #deprecated
  • parent: #deprecated
  • synonymyClusters: Clusters together terms categorized as synonyms by the algorithm. Only one term from each cluster is represented in the taxonomy tree.

Knowledge Graph (kg.rdf)

Knowledge Graph relations extracted:

  • saffron:partOf: Term identified as sharing a relation "part-of" (partonomy relationship) with the skos:Concept it is included in, in the graph.
  • saffron:wholeOf: Term identified as sharing a relation "whole-of" (partonomy relationship) with the skos:Concept it is included in, in the graph.
  • skos:broader: Term identified as sharing a taxonomic relation with the skos:Concept it is included in, in the graph.
  • saffron:synonymy: Identifies a term as a synonym of the skos:Concept it is included in, in the graph.