KG Construction

Knowledge Graph Construction

The primary steps involved in this process include:

Create Knowledge Representation
Build Knowledge Graph

Create Knowledge Representation

The first step to building a knowledge graph is to design the blueprint or knowledge representation. For PheKnowLator, we consulted with a PhD-level biologist when developing our knowledge representation of the mechanisms underlying human disease. An example knowledge representation is shown in the figure below.

Build Knowledge Graph

The knowledge graph build algorithm has been designed to run from three different stages of development: full (runs the full knowledge graph build, except graph closure), partial (runs the build algorithm up through merging ontologies adding edge data, which excludes closure, the removal of metadata, and the creation of edge lists), and post-closure (searches for a closed knowledge graph .owl file and then performs the steps to remove owl semantics metadata and create edge lists).

Select a Build Type:

Build Type	Description	Use Cases
`full`	Runs all build steps in the algorithm	You want to build a knowledge graph and will not use a reasoner.
`partial`	Runs all of the build steps in the algorithm through adding edges Node metadata can always be added to a `partial` built knowledge graph by running the build as `post-closure`	You want to build a knowledge graph and plan to run a reasoner over it. You want to build a knowledge graph, but do not want to include node metadata, filter OWL semantics, or generate triple lists.
`post-closure`	Assumes that a reasoner was run over a knowledge graph and that the remaining build steps should be applied to a closed knowledge graph. The remaining build steps include determining whether OWL semantics should be filtered and creating and writing triple lists	You have run the `partial` build, ran a reasoner over it, and now want to complete the algorithm. You want to use the algorithm to process metadata and OWL semantics for an externally built knowledge graph.

STEP 1: Prepare Input Dependency Documents

Wiki Page: Dependencies

The current system uses three documents as instructions for building the knowledge graph. For detailed information on these documents, including examples, please see the Dependencies Wiki page. The primary dependency document is resource_info.txt.

STEP 2: Download and process Input Data

Ontology Data
Wiki Page: Dependencies
Jupyter Notebook: Ontology_Cleaning.ipynb

All ontology data sources listed in ontology_source_list.txt dependency document will be automatically downloaded. The OWLTools command-line tool is used to download all ontologies. This tool is useful because it ensures that all secondary ontologies imported by the primary ontology are also downloaded and merged.

Linked Open Data
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb

All non-ontology data sources listed in the edge_source_list.txt file will be automatically downloaded and pre-processed.

STEP 3: Merge Ontologies

Merge ontologies using the OWLTools API. Sometimes errors only exist in the presence of other ontologies. The most common error after merging ontology files is punning.

STEP 4: Build Edge Lists from Non-Ontology Data

Wiki Page: Dependencies
Data: subclass_construction_map.pkl

New edges can be added to the knowledge graph using two different approaches: (1) Instance-based - Asserting a new relation between an individual data point and an instance of an ontology class; (2) Subclass-based - Asserting a new relation between the subclass of the ontology class and an individual of type owl:Class. Please see the README (resources/construction_approach/README.md) for specific details regarding this method.

Instance-based
Data that is not part of an existing ontology is connected to an existing ontology class by creating an instance of an existing ontology class via rdf:Type and then connecting the data to that instance of the ontology class.

EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats ➞ Migraine

Would require adding:

isSubstanceThatTreats(Morphine, x1)
Type(x1, Migraine)

In this example, Morphine is an ontology data node from ChEBI and Migraine is a Human Phenotype Ontology term. This would result in the following triples, assuming that both Morphine and Migraine are existing ontology concepts:

UUID1 = MD5(Morphine + isSubstanceThatTreats + Migraine + "subject")
UUID2 = MD5(Morphine + isSubstanceThatTreats + Migraine + "object")

UUID1, rdf:type, Morphine
UUID1, rdf:type, owl:NamedIndividual

UUID2, rdf:type, Migraine
UUID2, rdf:type, owl:NamedIndividual

UUID1, isSubstanceThatTreats, UUID2

Subclass-based
Data that is not part of an existing ontology is connected to an existing ontology class via rdfs:subClassOf. This method allows the newly added data to have rdf:type owl:Class.

EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn ➞ Influenza Virus Induced Apoptosis

Would require adding:

participatesIn(TGFB1, Influenza Virus Induced Apoptosis)
subClassOf(Influenza Virus Induced Apoptosis, Influenza A pathway)
Type(Influenza Virus Induced Apoptosis, owl:Class)

Where TGFB1 is a Protein Ontology term and Influenza Virus Induced Apoptosis is a non-ontology data node from Reactome. In this example, Influenza A Pathway is an existing Pathway Ontology class. This would result in the following triples, assuming that TGFB1 is an existing ontology concept:

UUID1 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis)
UUID2 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis + owl:Restriction)

Influenza Virus Induced Apoptosis, rdfs:subClassOf, Influenza A Pathway
Influenza Virus Induced Apoptosis, rdf:type, owl:Class

UUID1, rdfs:subClassOf, TGFB1
UUID1, rdfs:subClassOf, UUID2
UUID2, rdf:type, owl:Restriction
UUID2, owl:someValuesFrom, Influenza Virus Induced Apoptosis
UUID2, owl:onProperty, participatesIn

A table is provided below showing the different triples that are added as function of edge type (i.e. class-class vs. class-instance vs. instance-instance) and relation strategy (i.e. relations only or relations + inverse relations).

STEP 5: Handling Knowledge Graph Relations and Entity Metadata

Jupyter Notebook: Data_Preparation.ipynb

Relations
Wiki Page: Dependencies

PheKnowLator can be built using a single set of provided relations with or without the inclusion of each relation's inverse by leveraging the owl:inverseOf property. For example:

location of owl:inverseOf located in
located in owl:inverseOf location of

Entity Metadata
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb

Before building a knowledge graph, one may need to prepare files needed to create mappings between identifiers and/or to filter input edge data sources. The Jupyter Notebook referenced above provides several detailed examples of how these data were created for the knowledge graphs available for the v2.0.0 build.

The knowledge graph can be built with or without the inclusion of instance entity metadata (i.e. labels, descriptions or definitions, and synonyms).

{
    'nodes': {
        'http://www.ncbi.nlm.nih.gov/gene/1': {
            'Label': 'A1BG',
            'Description': "A1BG has locus group protein-coding' and is located on chromosome 19 (19q13.43).",
            'Synonym': 'HYST2477alpha-1B-glycoprotein|HEL-S-163pA|ABG|A1B|GAB'} ... },
    'relations': {
        'http://purl.obolibrary.org/obo/RO_0002533': {
            'Label': 'sequence atomic unit',
            'Description': 'Any individual unit of a collection of like units arranged in a linear order',
            'Synonym': 'None'} ... }
}

STEP 6: Remove OWL Semantics

Wiki Page: Dependencies

The knowledge graph can be built with or without the inclusion of edges that contain OWL Semantics. For information on how OWL-encoded classes and triples are filtered, please see the OWL-NETS 2.0 wiki.

STEP 7: Generate Knowledge Graph Output

We provide several different types of output, each of which is described briefly below. Please note that in order to create the logic (XXXX_OWL_LogicOnly.nt) and annotation (XXXX_OWL_AnnotationsOnly.nt) subsets of each graph and be able to combine them (XXXX_OWL.nt) we have added a namespace to all BNode or anonymous nodes. More specifically, there are two kinds of pkt namespaces you will find within these files:

https://github.com/callahantiff/PheKnowLator/pkt/. This namespace is used for all non-ontology data defined owl:Class and owl:NamedIndividual objects that are added in order to integrate non-ontological entities (see here for more information).
https://github.com/callahantiff/PheKnowLator/pkt/bnode/. This namespace is used for all existing BNode or anonymous nodes and is applied to these types of entities prior to subsetting an input graph.

To remove the second type of namespacing from BNode that are part of the original ontologies used in each build, you can run the code shown below:

from pkt.utils import removes_namespace_from_bnodes

# remove bnode namespaces
updated_graph = removes_namespace_from_bnodes(org_graph)

Please also note that for all builds prior to v3.0.2, there are 2,008 nodes in the NodeLabels.txt files that contain foreign characters. While there is now code in place to prevent this error from happening in the future, there is also a solution to account for the prior builds. The (bad_node_patch.json) file contains a dictionary where the outer keys are the entity_uri and the puter values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:

key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

The code to identify the nodes with erroneous foreign characters is shown below:

import re
import pandas as pd

# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`

# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)

# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()

Table. Knowledge Graph Build Output

File	Details
`PheKnowLator_MergedOntologies.owl`	Description	This RDF/XML formatted file only contains the baseline set of cleaned merged ontologies.
`PheKnowLator_MergedOntologies.owl`	Example Output	<?xml version="1.0"?> <rdf:RDF xmlns="http://purl.obolibrary.org/obo/chebi.owl#" xml:base="http://purl.obolibrary.org/obo/chebi.owl" xmlns:chebi="http://purl.obolibrary.org/obo/chebi/" xmlns:refont="http://purl.obolibrary.org/obo/uberon/refont/" xmlns:obo2="http://www.geneontology.org/formats/oboInOwl#http://purl.obolibrary.org/obo/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:cellline1="http://www.ebi.ac.uk/cellline#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:swrlb="http://www.w3.org/2003/11/swrlb#" ... >

OWL Builds The OWL builds store the complete expressive graph for either the `subclass` or `instance` builds.

`XXXX_OWL_LogicOnly.nt`	Description	This N-Triples formatted file contains the logical axioms for the baseline set of cleaned merged ontologies and all non-ontology edges. It does not contains any annotation assertions (i.e., metadata like labels, definitions, and synonyms). This file contains the minimum logical subset needed to run a deductive logic reasoner.
`XXXX_OWL_LogicOnly.nt`	Example Output	<https://github.com/callahantiff/PheKnowLator/pkt/N1008c5d52d72c407c8e1fe6960cc079c> <http://purl.obolibrary.org/obo/RO_0002511> <https://github.com/callahantiff/PheKnowLator/pkt/Ndd2e5c34e5200f57748b92ce48e01e97> . <http://purl.obolibrary.org/obo/HP_0025154> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> . <https://github.com/callahantiff/PheKnowLator/pkt/N354f816a252cbb880e55791e2f6c6c57> <http://purl.obolibrary.org/obo/RO_0002606> <https://github.com/callahantiff/PheKnowLator/pkt/N3390f9ec251ef7dc03acc8f7131f44dd> . <https://github.com/callahantiff/PheKnowLator/pkt/N99e5d2b45fed4e35dfeca4adc3efd5f6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> . <http://purl.obolibrary.org/obo/UBERON_0034871> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <https://github.com/callahantiff/PheKnowLator/pkt/bnode/Naaf1e1ac9eb14bae931889cdfadf1fb2> . ...

`XXXX_OWL_AnnotationsOnly.nt`	Description	This N-Triples formatted file contains annotation assertions (i.e., metadata like labels, definitions, and synonyms) for the baseline set of cleaned merged ontologies and all non-ontology edges.
`XXXX_OWL_AnnotationsOnly.nt`	Example Output	<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000504742> <http://www.w3.org/2000/01/rdf-schema#label> "SLC4A9-202" . <http://purl.obolibrary.org/obo/CLO_0017167> <http://www.w3.org/2000/01/rdf-schema#seeAlso> "OMIM: 168600"^^<http://www.w3.org/2001/XMLSchema#string> . <https://github.com/callahantiff/PheKnowLator/pkt/bnode/N3794abf456e345b3bb974563deb1e42d> <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "MESH:D000820"^^<http://www.w3.org/2001/XMLSchema#string> . <http://purl.obolibrary.org/obo/GO_1990556> <http://www.geneontology.org/formats/oboInOwl#created_by> "vw" . <http://purl.obolibrary.org/obo/UBERON_0004784> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "lower chamber of heart anatomical wall" . ...

`XXXX_OWL.nt`	Description	This N-Triples formatted file contains the baseline set of cleaned merged ontologies and all non-ontology edges. It contains the minimum logical subset (`XXXX_OWL_LogicOnly.nt`) and all annotation assertions (`XXXX_OWL_AnnotationsOnly.nt`). This file contains all OWL semantics needed to run a deductive logic reasoner.
`XXXX_OWL.nt`	Example Output	<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000504742> <http://www.w3.org/2000/01/rdf-schema#label> "SLC4A9-202" . <http://purl.obolibrary.org/obo/CLO_0017167> <http://www.w3.org/2000/01/rdf-schema#seeAlso> "OMIM: 168600"^^<http://www.w3.org/2001/XMLSchema#string> . <https://github.com/callahantiff/PheKnowLator/pkt/bnode/N3794abf456e345b3bb974563deb1e42d> <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "MESH:D000820"^^<http://www.w3.org/2001/XMLSchema#string> . <http://purl.obolibrary.org/obo/GO_1990556> <http://www.geneontology.org/formats/oboInOwl#created_by> "vw" . <http://purl.obolibrary.org/obo/UBERON_0004784> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "lower chamber of heart anatomical wall". ...

`XXXX_OWL_NetworkxMultiDiGraph.gpickle`	Description	This file is a NetworkX MultiDiGraph representation of the same content that is stored in the `XXXX_OWL.nt` file. Note that this representation includes keys for nodes and edges (`node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")`). Each edge also has a default weight of 0.0.
`XXXX_OWL_NetworkxMultiDiGraph.gpickle`	Example Output	Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before: `import networkx as nx from rdflib import URIRef # read in graph f = 'XXXX_OWL_NetworkxMultiDiGraph.gpickle' kg = nx.read_gpickle(f) # look up nodes kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')] kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]`

`XXXX_OWL_Triples_Identifiers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files, but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI.
`XXXX_OWL_Triples_Identifiers.txt`	Example Output	subject predicate object <https://github.com/callahantiff/PheKnowLator/pkt/N1f1d61aed39aa7c2fd9ad2b40a23dce0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> <https://github.com/callahantiff/PheKnowLator/pkt/N4e21b014fe4347facaec2a309eafcf3b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/UBERON_0008952> <https://github.com/callahantiff/PheKnowLator/pkt/N4e756731643dcfdc7fbb6cc6aa898b59> <http://purl.obolibrary.org/obo/RO_0002200> <https://github.com/callahantiff/PheKnowLator/pkt/N855ce51e1cbada67ff58bac057e628cc> <https://github.com/callahantiff/PheKnowLator/pkt/N016dbc163f9535349d961768267afe35> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> <http://purl.obolibrary.org/obo/CHEBI_154851> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/CHEBI_50699> ...

`XXXX_OWL_Triples_Integers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the `XXXX_OWL_Triples_Identifiers.txt` file is that the identifier URIs have been mapped to integers.
`XXXX_OWL_Triples_Integers.txt`	Example Output	`subject predicate object 1 2 3 4 2 5 6 7 8 9 2 3 10 11 12 ...`

`XXXX_OWL_Triples_Integer_Identifier_Map.json`	Description	This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the `XXXX_OWL_Triples_Identifiers.txt` file to the `XXXX_OWL_Triples_Integers.txt` file.
`XXXX_OWL_Triples_Integer_Identifier_Map.json`	Example Output	`{"<https://github.com/callahantiff/PheKnowLator/pkt/N55ef15b2f8a12726db7caa5567c2632f>": 398640, "<https://github.com/callahantiff/PheKnowLator/pkt/Nd3e86eb584157041fa49617139ce5d4c>": 398641, "<https://github.com/callahantiff/PheKnowLator/pkt/N36f2ad24b20497da3d219f819ee7e37c>": 398642, "<https://github.com/callahantiff/PheKnowLator/pkt/N13fc5884c3f7e166f5bc2469f79f4b01>": 398643, "<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000409200>": 398644 ...}`

`XXXX_OWL_NodeLabels.txt`	Description	This tab-delimited `.txt` file contains metadata on all nodes and relations in the N-Triples, gpickle, and `XXXX_OWL_Triples_Identifiers.txt` files. It contains the following columns: entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a `owl:Class`, `owl:NamedIndividual`, `owl:ObjectProperty`, or `owl:AnnotationProperty`) integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the integer assignment from the `XXXX_OWL_Triples_Integers.txt` file) entity_uri (e.g., "GO_0048252") label (e.g. "lauric acid metabolic process") description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.") synonym (e.g., "lauric acid metabolism\|n-dodecanoic acid metabolic process\|n-dodecanoic acid metabolism") NOTE. There will be entries in this file that contain values of "NA" for the `entity_type` column. This is expected for these types of builds; a value of "NA" is used for all nodes and relations that are not an `owl:Class`, `owl:NamedIndividual`, `owl:ObjectProperty` or `owl:AnnotationProperty`.
`XXXX_OWL_NodeLabels.txt`	Example Output	entity_type integer_id entity_uri label description/definition synonym NODES 375312 <http://www.ncbi.nlm.nih.gov/gene/58155> PTBP2 (human) A protein coding gene PTBP2 in human. None NODES 6297907 <https://www.ncbi.nlm.nih.gov/snp/rs10902762> NM_000203.5(IDUA):c.60G>A (p.Ala20=) This variant is a germline/unknown single nucleotide variant located on chromosome 4 (NC_000004.12, start:987144/stop:987144 positions, cytogenetic location:4p16.3) and has clinical significance 'Benign'. This entry is for the GRCh38 and was last reviewed on Nov 26, 2020 with review status 'criteria provided, multiple submitters, no conflicts'None NA 7892255 <https://github.com/callahantiff/PheKnowLator/pkt/N707b36b2731f5ca97561eeb17e1fb039> NA NA NA RELATIONS 2057563 <http://purl.obolibrary.org/obo/RO_0002002> has boundary a relation between a material entity and a 2D immaterial entity (the boundary), in which the boundary delimits the material entity None RELATIONS 958453 <http://purl.obolibrary.org/obo/RO_0002444> parasite of None direct parasite of ...

OWL-NETS Builds The OWL-NETS files have undergone a transformation decodes all OWL semantics in order to create a graph that only contains biologically relevant nodes and edges and is much more useful for inductive types of machine learning. For more information on this transformation see: https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0
`XXXX_OWLNETS.nt`	Description	This N-Triples formatted file contains the OWL-NETS transformed build.
`XXXX_OWLNETS.nt`	Example Output	<http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> . <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0000673> . <http://purl.obolibrary.org/obo/CHEBI_154851> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/CHEBI_50699> . <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0001217> . <http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> . ...

`XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle`	Description	This file is a NetworkX MultiDiGraph representation of the same content that is stored in the `XXXX_noOWL_OWLNETS.nt` file. Note that this representation includes keys for nodes and edges (`node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")`). Each edge also has a default weight of 0.0.
`XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle`	Example Output	Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before: `import networkx as nx from rdflib import URIRef # read in graph f = 'XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle' kg = nx.read_gpickle(f) # look up nodes kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')] kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]`

`XXXX_OWLNETS_decoding_dict.pkl`	Description	This dictionary stores details about the OWL-NETS transformation. Specifically, it contains metadata that can be used to reverse the transformation.
`XXXX_OWLNETS_decoding_dict.pkl`	Example Output	{disjointWith: (rdflib.term.BNode('N0fb945ed26b14180907e29b5ffa1403e'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.BNode('N4a694a93a05843c3a0492587318538ca')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033946'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033947')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0001628'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0006764')) filtered_triples: (rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000641330'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0001025'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0000178')), (rdflib.term.URIRef('http://www.ncbi.nlm.nih.gov/gene/23362'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002511'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000518315')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q6ZVK8'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_63715')) ... }

`XXXX_OWLNETS_Triples_Identifiers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files, but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI.
`XXXX_OWLNETS_Triples_Identifiers.txt`	Example Output	subject predicate object <http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0000673> <http://purl.obolibrary.org/obo/CHEBI_154851> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/CHEBI_50699> <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0001217> <http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> ...

`XXXX_OWLNETS_Triples_Integers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files, but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the `XXXX_noOWL_Triples_Identifiers.txt` file is that the identifier URIs have been mapped to integers.
`XXXX_OWLNETS_Triples_Integers.txt`	Example Output	`subject predicate object 1 2 3 4 5 6 7 5 8 9 5 10 11 12 13 ...`

`XXXX_OWLNETS_Triples_Integer_Identifier_Map.json`	Description	This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the `XXXX_noOWL_Triples_Identifiers.txt` file to the `XXXX_noOWL_Triples_Integers.txt` file.
`XXXX_OWLNETS_Triples_Integer_Identifier_Map.json`	Example Output	`{"<http://purl.obolibrary.org/obo/CHEBI_59626>": 763807, "<http://purl.obolibrary.org/obo/CHEBI_138446>": 763808, "<http://purl.obolibrary.org/obo/GO_0039685>": 763809, "<http://purl.obolibrary.org/obo/CHEBI_37269>": 763810, "<http://purl.obolibrary.org/obo/HP_0025531>": 763811 ...}`

`XXXX_OWLNETS_NodeLabels.txt`	Description	This tab-delimited `.txt` file contains metadata on all nodes and relations in the N-Triples, gpickle, and `XXXX_OWLNETS_Triples_Identifiers.txt` files. It contains the following columns: entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a `owl:Class`, `owl:NamedIndividual`, `owl:ObjectProperty`, or `owl:AnnotationProperty`) integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the integer assignment from the `XXXX_noOWL_Triples_Integers.txt` file) entity_uri (e.g., "GO_0048252") label (e.g., "lauric acid metabolic process") description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.") synonym (e.g., "lauric acid metabolism\|n-dodecanoic acid metabolic process\|n-dodecanoic acid metabolism")
`XXXX_OWLNETS_NodeLabels.txt`	Example Output	entity_type integer_id entity_uri label description/definition synonym NODES 260743 <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000651614> IL6ST-216 Transcript IL6ST-216 is classified as type 'nonsense_mediated_decay'. None NODES 289592 <https://www.ncbi.nlm.nih.gov/snp/rs116659770> NM_001440.4(EXTL3):c.1324G>C (p.Val442Leu) This variant is a germline single nucleotide variant located on chromosome 8 (NC_000008.11, start:28717383/stop:28717383 positions, cytogenetic location:8p21.1) and has clinical significance 'Benign/Likely benign'. This entry is for the GRCh38 and was last reviewed on Nov 20, 2020 with review status 'criteria provided, multiple submitters, no conflicts'. None NODES 45199 <https://www.ncbi.nlm.nih.gov/snp/rs375573986> NM_000098.3(CPT2):c.399A>G (p.Pro133=) This variant is a germline single nucleotide variant located on chromosome 1 (NC_000001.11, start:53210073/stop:53210073 positions, cytogenetic location:1p32.3) and has clinical significance 'Likely benign'. This entry is for the GRCh38 and was last reviewed on Aug 30, 2020 with review status 'criteria provided, multiple submitters, no conflicts'. None RELATIONS 107080 <http://purl.obolibrary.org/obo/RO_0002492> existence ends during Relation between continuant c and occurrent s, such that every instance of c ceases to exist during some s, if it does not die prematurely. ceases_to_exist_during RELATIONS 189912 <http://purl.obolibrary.org/obo/VO_0000529> has vaccine adjuvant a type of 'has vaccine component' relation that is specifically for vaccine adjuvant component None ...

Purified OWL-NETS Builds The purified version of an OWL-NETS build is designed to convert the base OWL-NETS build into a version that is completing consistent with a specific construction approach. For example, if the build is `instance`-based, then all `rdfs:subClassOf` relations are converted to `rdf:type` and for all triples where an `rdfs:subClassOf` relation occurred we add `rdf:type` relations between the object of this triple and all of its ancestors. For a `subclass`-based build, we implement the same procedure but replace all occurrences of `rdf:type` with `rdfs:subClassOf`. Please note that these build types are considered experimental as we are still in the process of fully testing them.

`XXXX_OWLNETS_XXXX_purified_OWLNETS.nt`	Description	This N-Triples formatted file contains the purified OWL-NETS transformed build.
`XXXX_OWLNETS_XXXX_purified_OWLNETS.nt`	Example Output	<http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> . <http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> . <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000551077> <http://purl.obolibrary.org/obo/RO_0001025> <http://purl.obolibrary.org/obo/CLO_0000652> . <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000672281> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/SO_0001503> . <http://purl.obolibrary.org/obo/MONDO_0011070> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0002714> . ...

`XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle`	Description	This file is a NetworkX MultiDiGraph representation of the same content that is stored in the `XXXX_OWLNETS_XXXX_purified_OWLNETS.nt` file. Note that this representation includes keys for nodes and edges (`node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")`). Each edge also has a default weight of 0.0.
`XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle`	Example Output	Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before: `import networkx as nx from rdflib import URIRef # read in graph f = 'XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle' kg = nx.read_gpickle(f) # look up nodes kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')] kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]`

`XXXX_OWLNETS_XXXX_purified_decoding_dict.pkl`	Description	This dictionary stores details about the purified OWL-NETS transformation. Specifically, it contains metadata that can be used to reverse the transformation.
`XXXX_OWLNETS_XXXX_purified_decoding_dict.pkl`	Example Output	{disjointWith: (rdflib.term.BNode('N0fb945ed26b14180907e29b5ffa1403e'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.BNode('N4a694a93a05843c3a0492587318538ca')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033946'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033947')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0001628'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0006764')) filtered_triples: (rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000641330'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0001025'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0000178')), (rdflib.term.URIRef('http://www.ncbi.nlm.nih.gov/gene/23362'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002511'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000518315')), (rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q6ZVK8'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_63715')) ... }

`XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files, but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI.
`XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt`	Example Output	subject predicate object <http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544> <http://www.w3.org/2000/01/rdf#type> <http://purl.obolibrary.org/obo/SO_0000673> <http://purl.obolibrary.org/obo/CHEBI_154851> <http://www.w3.org/2000/01/rdf#type> <http://purl.obolibrary.org/obo/CHEBI_50699> <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544> <http://www.w3.org/2000/01/rdf#type> <http://purl.obolibrary.org/obo/SO_0001217> <http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> ...

`XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt`	Description	This tab-delimited text file contains the same information as the `.nt` and `.gpickle` files, but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the `XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt` file is that the identifier URIs have been mapped to integers.
`XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt`	Example Output	`subject predicate object 1 2 3 4 5 6 7 5 8 9 5 10 11 12 13 ...`

`XXXX_OWLNETS_XXXX_purified_Triples_Integer_Identifier_Map.json`	Description	This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the `XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt` file to the `XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt` file.
	Example Output	`{"<http://purl.obolibrary.org/obo/CHEBI_59626>": 763807, "<http://purl.obolibrary.org/obo/CHEBI_138446>": 763808, "<http://purl.obolibrary.org/obo/GO_0039685>": 763809, "<http://purl.obolibrary.org/obo/CHEBI_37269>": 763810, "<http://purl.obolibrary.org/obo/HP_0025531>": 763811 ...}`

`XXXX_OWLNETS_XXXX_purified_NodeLabels.txt`	Description	This tab-delimited `.txt` file contains metadata on all nodes and relations in the N-Triples, gpickle, and `XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt` files. It contains the following columns: entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a `owl:Class`, `owl:NamedIndividual`, `owl:ObjectProperty`, or `owl:AnnotationProperty`) integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the integer assignment from the `XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt` file) entity_uri (e.g., "GO_0048252") label (e.g., "lauric acid metabolic process") description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.") synonym (e.g., "lauric acid metabolism\|n-dodecanoic acid metabolic process\|n-dodecanoic acid metabolism")
`XXXX_OWLNETS_XXXX_purified_NodeLabels.txt`	Example Output	entity_type integer_id entity_uri label description/definition synonym NODES 260743 <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000651614> IL6ST-216 Transcript IL6ST-216 is classified as type 'nonsense_mediated_decay'. None NODES 289592 <https://www.ncbi.nlm.nih.gov/snp/rs116659770> NM_001440.4(EXTL3):c.1324G>C (p.Val442Leu) This variant is a germline single nucleotide variant located on chromosome 8 (NC_000008.11, start:28717383/stop:28717383 positions, cytogenetic location:8p21.1) and has clinical significance 'Benign/Likely benign'. This entry is for the GRCh38 and was last reviewed on Nov 20, 2020 with review status 'criteria provided, multiple submitters, no conflicts'. None NODES 45199 <https://www.ncbi.nlm.nih.gov/snp/rs375573986> NM_000098.3(CPT2):c.399A>G (p.Pro133=) This variant is a germline single nucleotide variant located on chromosome 1 (NC_000001.11, start:53210073/stop:53210073 positions, cytogenetic location:1p32.3) and has clinical significance 'Likely benign'. This entry is for the GRCh38 and was last reviewed on Aug 30, 2020 with review status 'criteria provided, multiple submitters, no conflicts'. None RELATIONS 107080 <http://purl.obolibrary.org/obo/RO_0002492> existence ends during Relation between continuant c and occurrent s, such that every instance of c ceases to exist during some s, if it does not die prematurely. ceases_to_exist_during RELATIONS 189912 <http://purl.obolibrary.org/obo/VO_0000529> has vaccine adjuvant a type of 'has vaccine component' relation that is specifically for vaccine adjuvant component None ...