Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP of taxonomical+packaging strategy to handle collections of same concept (but incompatible or overly complex to exchange together) #47

Open
fititnt opened this issue Jul 22, 2022 · 0 comments

Comments

@fititnt
Copy link
Member

fititnt commented Jul 22, 2022

TL;DR of relevant context

  • Taxonomical defaults greatly simplify reuse of tooling even for unknown data
    • Even in case of specialized areas with complex rules, this approach could allow, instead of user learn configure a new tool, others "shape/redistribute" a recommended collection poiting to that default for their user base
      • This approach also means (at least for public data) as soon as a new world region have more specialized use case, the way we use stricter ontology to describe it, it makes the logic reusable for other places
  • Both tabular/traditional databases and on graph databases (because RDF "default IRIs subjects" are far useful of merged data with everything) need consistency
    • e.g. the user don't need like provided data (or might have own data) but simplify A LOT decent defaults
  • Some providers of thematic data naturally have own conventions (and they may already distribute such data for several regions)
    • This topic alone means average end user would love "swap" data providers while no need to learn too much the details
      • These details still necessary for who republish; but they're simplified by end users testing things fast
  • Breaking "same" collections (but either different provider or heterogeneous form of publish it) which end result make then result on same final entry points also allow comparability for republishes
    • Also reduce package size. This will matter a lot.
      • and if we do it somewhat smarter, even RDF format instead of repeat EVERY time the source, the time was revalidated, etc, in addition to the data, we could simply distribute some "update query" to expand that dataset with such metadata
      • The analogy here would be what user would need to read the documentation about minor details, but instead of using natural language we add such rules as another file which is machine readable
        • How ever this would require some advanced ways to "find" related compatible data which cannot be only by the "default" path

1. The challenge

As the title says, while we have a very rudimentary skeleton on @MDCIII by now only Places, different from #43 and #44, which means create other conventions of default entry points, the challenge here is we will have data in the same collection which is "incompatible" with others on same group AND this incompatibly is not really political (something with agree with a point of view allows decide source of truth)

By incompatible, is not really incompatible. RDF would allow pretty much anything, but most users will not want too many levels of details because even if we could document in natural language, this would not allow it to be automatically validated. But on SQL tables, even if we could automate the merge of data, the user would need to do advanced data cleaning to for example know duplicates.

2. The plan

We need to improve the way to describe URNs which, if imported, would point to the same entry point targets (aka RDF nodes or SQL database names). And this needs to be done smart enough as if we would have kept old content updated with as little effort as possible AND without breaking existing things by adding new. Obviously we can make global schema changes (its optimized for that), but still need planning things that even latest usable version would not make sense on tabular format.

Note that at this point, we're already going to allow implementer initialize massive amount of very structured data in ways they can start doing queries and make inferences. But the title points to something we can't solve, but can automate more than half way the work users would have.

In practices, this likely means plan how would be the naming of the folders of the packages (like the GitHub repositories). We might need some suffix or some way to index variants and allow users find them.

2.1 Why better do this sooner

If we don't break this way, things sooner would get too complicated to explain to end users. And we're heavily optimizing to make things understandable.

I know we could in fact eventually automate even creation of documentation with examples of queries, but even if we could do this, as the this type of challenge is not about mere political decision, but heterogeneous, this would make documentation inviable to automate. Incompatible packages of same collection could have documentation individually about how to query then, but a mere addition of one heterogeneous dataset could make every other documentation too hard to differentiate both.

3. Example use cases

3.1 Places

Turns out it was naive to think that even the same official reference will have only one recommendation. Using Brazil alone (which in theory does not even have significant territorial disputes) things like calculating the centroid already is impraticable. Like there's an island over 1000km in the ocean that if we get a map that has it, the Brazil centroid would change.

Obviously it's a good idea to take in account disputes or "world view based on the reference of a country", but the average case already needs to be flexible. Also very often users would prefer to take only one of the versions that co-exist as official if they don't want too many levels of details.

3.1.1. ...statistics by places

Pretty much any statistics already would need some way to break in different collections. But if we take this approach here, this means we would likely also break them broken by package instead of pushing all on the same groups.

3.1.2. Points of interest (like hospitals, schools, etc)

Turns out different sources will not only have different metadata and be community or based directly on governments data, but... will be invisible to enforce then point to the perfectly same "node".

In practice, this means that the best granted consistency will always be the collection entrypoint (e.g. database name or RDF prefix) but how user would be sure two collections are talking about the same subject... That's complicated. It's very complicated.

3.1.3. ... OpenStreetMap POIs outdated / data hoarding / lack of engagement with locals

Despite the humanitarian non-profits being considered "a partner" (often means European or North America), the actual heavy work tends to be done directly on OpenStreetMap and by people not related at all with them. There's even a call to actions to decolonize aid inside the mapping community and I understand why. The Asians in particular are very upset with how things are done.

By breaking in packages, we would also allow bridges to be re validated like OpenStreetMap and Wikidata, even if the country level is likely to have far more details. And I say this because the current "humanitarian-like" use of OpenStreetMap seems to be more focused on discourse than on updatability or relevance to actually be used by the local community.

While streets, rivers and other geographic features tend to be stable (maybe name changes) and taken seriously by other uses of OSM, POIs (points of interest) used by humanitarians are not. Even if volunteers could keep up to date, it is less clear to me how new people can deal with data added several years ago since the way used would scale in the long term. Places like Brazil tend to synchronize data with official sources or get engagement with local developers if they had to do something similar, but I wouldn't be surprised if the discourse at international level to engage locals made OSM data even less updated than any locally led initiative. And I don't think this is a problem of the OSM community, but how internationals love the idea of taking credit without caring for the medium to long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant