Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link INSDC to LOD federations #429

Open
pbuttigieg opened this issue May 13, 2024 · 6 comments
Open

Link INSDC to LOD federations #429

pbuttigieg opened this issue May 13, 2024 · 6 comments

Comments

@pbuttigieg
Copy link
Collaborator

pbuttigieg commented May 13, 2024

https://github.com/enasequence/ena-experiment-checklist/tree/main/data%2Fschema

Build ODIS compatible specifications based on ENA's exploratory work, above

@pbuttigieg
Copy link
Collaborator Author

pbuttigieg commented Jun 13, 2024

@Woolly-at-EBI pinging you to follow up on our plan formed at DOF2024

Let's think about how to get ENA sequence metadata linked in via a schema.org/JSON-LD model.

Do you have example records that conform to your schemata?

@Woolly-at-EBI
Copy link

Yesterday, I talked with Colman (ENA product owner and also my line manager). He is fine for me to investigate and explore what needs to be done. So that is obvious good news.

At ENA there is a direction of travel to use much more JSON for input and output. The SWAGGER API allows JSON output. BioSchema was created and is maintained by our sister group(we share offices).

The JSON schema for the sequence experiment metadata in the GitHub from the top link was created from a manually created JSON file (YAML would have probably been a better option!. That work is as you note exploratory. It is continuing with collaboration of the GA4GH team, I will be running a short hackathon at the GSC conference to work on the core terms more. David B. and I think that we are almost there with the core terms now.

FYI: At ENA an aligned active small piece of development in the ecosystem is a replacement checklist editor. This will allow the continued and easier maintenance of ENA sample checklists, with the requirements that we can create and maintain other checklists e.g. EGA sample checklists, Biosample sample checklists and more relevant to this ticket: ENA sequence experiment checklist. Indeed I will be reviewing the requirement list again with the developers later today.

@Woolly-at-EBI
Copy link

so @pbuttigieg I am thinking what is the next useful step?
I have sequence experiment examples that pass this exploratory JSON-LD schema/ The final solution may be similar or rather different. You are right we have to start somewhere.
I will provide two mocked up "real" examples here for 2 different experiment types with the accompanying reference to the actions sequence records. Aiming to by end of 19th July.

@pbuttigieg
Copy link
Collaborator Author

@Woolly-at-EBI

I am thinking what is the next useful step?

The steps to link to ODIS (and similar LOD-driven federations) are summarised here: https://book.oceaninfohub.org/gettingStarted.html

I have sequence experiment examples that pass this exploratory JSON-LD schema/ The final solution may be similar or rather different. You are right we have to start somewhere.

I think we have to convert/adapt the JSON-LD schema you have to interoperate with schema.org semantics, as specified in the schema:Dataset type.

Some examples here, here and here.

At ENA there is a direction of travel to use much more JSON for input and output. The SWAGGER API allows JSON output. BioSchema was created and is maintained by our sister group(we share offices).

Bioschemas should be compatible with vanilla schema.org, but I've noticed some odd modelling in Bioschemas. I'd be a little careful, and use vanilla schema.org wherever possible.

The JSON schema for the sequence experiment metadata in the GitHub from the top link was created from a manually created JSON file (YAML would have probably been a better option!. That work is as you note exploratory. It is continuing with collaboration of the GA4GH team, I will be running a short hackathon at the GSC conference to work on the core terms more.

Perhaps we can align these activities - if we make sure the futher development of this exploratory work dovetails with JSON-LD/schema.org compliance, we'll be creating something widely interoperable.

FYI: At ENA an aligned active small piece of development in the ecosystem is a replacement checklist editor. This will allow the continued and easier maintenance of ENA sample checklists, with the requirements that we can create and maintain other checklists e.g. EGA sample checklists, Biosample sample checklists and more relevant to this ticket: ENA sequence experiment checklist. Indeed I will be reviewing the requirement list again with the developers later today.

This is both promising and a little concerning - if tooling is developed before a good data exchange model (i.e. data formats and semantics) is set, then organisations like the ENA are loth to change things, even if they don't interoperate with others. I would strongly encourage that we get the JSON-LD/schema.org modelling and templates settled first, and avoid INSDC or ENA specific types, properties, etc which no other systems will understand without custom coding (xref the missing value story at the GSC)

@Woolly-at-EBI
Copy link

Finally started...
https://docs.google.com/document/d/19IoPj-Y0_J2ZRr6zr5jhJb438d_CjsNv9sjG53TEdhI/edit?usp=sharing

JSON generated for study(=project) and read_run objects.

Next steps:
make pilot JSON-LD for study
make pilot JSON-LD for read_run
Generate sample level metadata as JSON and then JSON-LD (will have to be selective of metadata)

@pbuttigieg
Copy link
Collaborator Author

Thanks @Woolly-at-EBI - could we have examples with real values to develop from? I can then fit them in to the right schema.org slots and they'll be more spottable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants