Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: JSON-LD / Schema.org mapping #378

Closed
mojodna opened this issue Jan 25, 2019 · 20 comments
Closed

RFC: JSON-LD / Schema.org mapping #378

mojodna opened this issue Jan 25, 2019 · 20 comments

Comments

@mojodna
Copy link
Collaborator

mojodna commented Jan 25, 2019

As part of my work on STAC Browser, I've just merged preliminary JSON-LD support intended to facilitate indexing, searching, and display by Google Dataset Search.

I've tried to follow their guidelines, mapping Catalogs and Collections to schema.org DataCatalogs and Items to Datasets.

Catalog / Collection → DataCatalog

{
  "@context": "https://schema.org/",
  "@type": "DataCatalog",

  // required
  name: catalog.title,
  description: catalog.description, // as HTML

  // recommended
  identifier: catalog.properties["sci:doi"] || catalog.id,
  citation: catalog.properties["sci:citation"], // if available
  keywords: catalog.keywords,
  isBasedOn: catalog.url, // canonical STAC catalog URL (JSON)
  version: catalog.version,
  url: <STAC Browser URL>,
  // if available
  workExample: this.properties["sci:publications"].map(p => ({
     identifier: p.doi,
     citation: p.citation
   })),

  // if license is "proprietary"
  license: catalog.links.find(x => x.rel === "license").href,
  // if license is SPDX-compatible
  license: `https://spdx.org/licenses/${catalog.license}.html`,

  // if a spatial extent is available
  spatialCoverage = {
    "@type": "Place",
    geo: {
      "@type": "GeoShape",
      box: catalog.extent.spatial.join(" ")
    }
  },

  // if a temporal extent is available
  temporalCoverage: catalog.extent.temporal.map(x => x || "..").join("/"),

  // if a parent catalog is defined:
  isPartOf: {
    "@type": "DataCatalog",
    name: parent.title || parent.id, // if available
    isBasedOn: parent.url,
    url: <STAC Browser URL>
  },

  // for each child catalog:
  hasPart: {
    "@type": "DataCatalog",
    name: child.title,
    isBasedOn: child.url,
    url: <STAC Browser URL>
  },

  // for each referenced item:
  dataset: {
    identifier: item.id, // if available; requires loading the Item
    name: item.properties.title || item.id, // if available; requires loading the Item
    isBasedOn: item.url,
    url: <STAC Browser URL>
  }
}

providers are mapped according to roles (when multiple roles are specified, the provider is duplicated):

  • licensorcopyrightHolder
  • producerproducer
  • processorcontributor
  • hostprovider

and rendered as:

{
  // ...
  [mapped role]: {
    description: provider.description, // if available
    name: provider.name,
     url: provider.url // if available
  }
}

Item → Dataset

{
  "@context": "https://schema.org/",
  "@type": "Dataset",

  // required
  name: item.properties.title || item.id,
  description: this.properties.description, // if available

  // recommended
  identifier: item.properties["sci:doi"] || item.id,
  citation: catalog.properties["sci:citation"], // if available
  keywords: collection.keywords || rootCatalog.keywords, // inherit collection / root catalog keywords, if available
  // if license is "proprietary"
  license: [item.links, collection.links, rootCatalog.links].find(x => x.rel === "license").href,
  // if license is SPDX-compatible
  license: `https://spdx.org/licenses/${item.properties["item:license"] || collection.license || rootCatalog.license}.html`,
  isBasedOn: item.url, // canonical STAC item URL (JSON)
  url: <STAC Browser URL>,
  // if available
  workExample: this.properties["sci:publications"].map(p => ({
     identifier: p.doi,
     citation: p.citation
   })),
  image: item.assets.thumbnail,

  // for associated collections + parent catalogs
  includedInDataCatalog: {
    isBasedOn: c.href,
    url: <STAC Browser URL>
  },

  spatialCoverage: {
    "@type": "Place",
    geo: {
      "@type": "GeoShape",
      box: item.bbox.join(" ")
    }
  },

  temporalCoverage: this.properties["dtr:start_datetime"]
    ? [
        this.properties["dtr:start_datetime"],
        this.properties["dtr:end_datetime"]
      ]
        .map(x => x || "..")
        .join("/")
    : item.properties.datetime,

  // for each asset in item.assets
  distribution: {
    contentUrl: asset.href,
    fileFormat: asset.type,
    name: asset.title
  }
};

This implementation is live (with pre-rendered HTML) at https://planet.stac.cloud. Hopefully in the coming days it will be better indexed by Google (I've submitted the sitemap), including by Dataset Search, at which point we can see how well this mapping does at being rendered.

Meanwhile, the OpenLink Structured Data Sniffer extension for Chrome will extract JSON-LD to allow inspection.

Thoughts?

Refs #285

@m-mohr
Copy link
Collaborator

m-mohr commented Jan 25, 2019

Great work. Will dig deeper into it next week.

One quick note: If the license field is set to proprietary the url from the links should be taken.

@mojodna
Copy link
Collaborator Author

mojodna commented Jan 26, 2019

Good call, thanks. I've updated the mapping above as well as STAC Browser.

@m-mohr
Copy link
Collaborator

m-mohr commented Jan 26, 2019

Great!

Another idea I had: The temporalCoverage for the Item could also be composed as an interval from the fields provided by the Datetime-Range-Extension (dtr).

The citation field from schema.org and publications field from our sci-extensions seem to map quite good for both Items and Catalogs.

@mojodna
Copy link
Collaborator Author

mojodna commented Jan 28, 2019

I've mapped sci:citation to citation and publications to workExample (my read is that the relationship is the inverse of citation).

@joshfix
Copy link
Contributor

joshfix commented Feb 4, 2019

If we're interested in natively supporting json-ld in the STAC JSON output:

http://geojson.org/geojson-ld/

@m-mohr
Copy link
Collaborator

m-mohr commented Feb 4, 2019

Is GeoJSON-LD compatible with what Google is asking for?

@joshfix
Copy link
Contributor

joshfix commented Feb 4, 2019

Good question. Also, are they looking for JSON-LD only in HTML files?

@mojodna
Copy link
Collaborator Author

mojodna commented Feb 7, 2019

Is GeoJSON-LD compatible with what Google is asking for?

I don't think so. The docs refer to https://schema.org/GeoShape. It'd be nice if they supported both though (and maybe they do).

Also, are they looking for JSON-LD only in HTML files?

Yes, in the body of <script type="application/ld+json"> tags. I don't think they crawl JSON at the moment.

@cholmes cholmes added this to the 0.8.0 milestone Apr 20, 2019
@cholmes
Copy link
Contributor

cholmes commented Apr 22, 2019

Discussed on the call on April 22.
Status of this is that google doesn't actually use all the tags that it implies in their dataset search stuff. The consensus is just 'catalogs' should be exposed, as that's the level of 'search' they do there now.

It isn't clear exactly how to put this into the spec, as it is just one implementation for now.
Decision was made to keep this issue around, but to update the spec to talk at a more basic level about exposing one's data as HTML, etc. Likely in the catalog best practices document.

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 23, 2019

We always claim in presentations that "Portals > Google", so I feel this is quite an important point that we should add to the repo at some point and make it part of the releases. Maybe it could just start as an extension, maybe an NPM script, that just easily converts STAC repos or files to JSON-LD.

@dazza-codes
Copy link

Apologies for cross-posting here from another related issue to keep track of this development too.

I have experience with ontologies and linked-data solutions, having worked with the ontology group at Stanford and the Stanford library on several linked-data projects. However, there is a lot of work to be done on GIS ontologies and linked-data for catalog systems. The technology that is most aligned with STAC is obviously JSON-LD, but defining the context for STAC needs some work. I'm generally open to interests and discussions about EOS metadata standards (CF) and GIS metadata as ontologies and linked data. Related projects:

Also, with regard to linking data with publications:

@dazza-codes
Copy link

https://ai.google/research/people/NatalyaNoy is the person to contact at Google for more insights and related research projects on dataset discovery.

@dazza-codes
Copy link

dazza-codes commented Apr 23, 2019

Relevant search for any 'Geo' concepts in schema.org:

I don't know specifically if there is any explicit alignment or mapping between schema.org and geojson-ld, but it is always possible to publish using any relevant ontology. If they are aligned, it makes it easier.

If not, it might require separate documents for each of schema.org and geojson-ld. AFAICT, there is no explicit identification of geojson-ld concepts with a broader ontology in the docs at http://geojson.org/geojson-ld/vocab.html - maybe it's operating like a vocabulary rather than an ontology? It's flat - http://geojson.org/geojson-ld/vocab.rdf

See also:

@gkellogg - any thoughts on json-ld recommendations for Geo/GIS?

@gkellogg
Copy link

Note that GeoJSON-LD notes problems with JSON-LD 1.0 for representing coordinates which are defined using lists of lists, which is unsupported in JSON-LD 1.0. JSON-LD 1.1 does support lists of lists. But, JSON-LD 1.1 is not yet a recommendation, although the feature is widely implemented.

@m-mohr
Copy link
Collaborator

m-mohr commented Nov 6, 2019

We should throw this in a best practice document for 1.0

@m-mohr m-mohr modified the milestones: future, 1.0.0 Nov 6, 2019
@cholmes cholmes modified the milestones: 1.0.0-beta1, future Apr 6, 2020
@m-mohr m-mohr self-assigned this Aug 24, 2020
@m-mohr
Copy link
Collaborator

m-mohr commented Aug 24, 2020

I'll take this over and update the schema.org mapping as I'll want to make this available for each Collection on STAC Index.

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 4, 2023

@m-mohr m-mohr closed this as completed Apr 4, 2023
@cboettig
Copy link

cboettig commented Apr 4, 2023

Thanks @m-mohr !

Just a note schema.org is consumed by many other tools than search engines, and it would be really desirable for these communities to have a simple utility that goes from stac JSON -> schema JSON without websites / HTML involved. (e.g. see https://github.com/ESIPFed/science-on-schema.org).

happy to continue the discussion over on stac-browser if that's best, but don't want to be dragging in issues that are of little interest to the browser part of stac-browser....

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 4, 2023

@cboettig No one except for STAC Browser came up with an implementation in the last years afaik, so we thought it's of little interest (last comment: 2020) and closed here. Having a standalone mapping implementation would be great though, but it's likely nothing that will live in this repo, but in other places, e.g. stac-utils, stac-extensions or so. Feel free to take over the work and if there's a mapping separate of STAC Browser, I'm happy to adopt it. :-) (There are some open questions anyway, e.g. GDS doesn't use the DataCatalog it seems?)

@rob-metalinkage
Copy link

Part of GeoDCAT work is defining JSON-LD in mappings from OGC API Records. A logical extension would be be a GeoDCAT-STAC profile with context to map STAC elements to GeoDCAT. We would seek to inform GeoDCAT design with the elements that are of general utility and cannot be mapped to DCAT already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants