Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37

Open
fititnt opened this issue Apr 29, 2022 · 5 comments
Labels
archiva-farmatis archīva fōrmātīs; /formats of files/@eng-Latn; About (new) data formats to package dictionaries librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data

Comments

@fititnt
Copy link
Member

fititnt commented Apr 29, 2022


Current know context at the moment

  • The New exported format: JSON-LD metadata to explain the CSVs, using W3C Tabular Data (Basic implementation only)  #36 , if followed strictly, would allow creating a package importable to some database. But I'd we do it, would require duplicate more CSVs on each focused base of dictionaries
  • The New exported format: frictionlessdata Tabular Data Package + Data Package Catalogs #35 , from frictionless, have an experimental feature (just done a quick test, and it somewhat works) which allows write a populated SQLite database from an datapackage.json
  • The entire 1603 already designed to be friendly to allow users have everything as local copy
    • Different from generic datasets most data portals ingest, the dictionaries we do are very structured
      • The fact we use 1603 as global prefix, if the dictionaries already are on a database, users could use other global prefixes to ingest actual data and then use SQL to manipulate/transform real world data (an alternative to work CSVs directly)
  • The way we already structured the dictionaries, some from [1603:1] already are required to generate each Cōdex. _They already somewhat have an implicit schema, but the CLIs can work with plain text (the CSVs)

Idea of this issue

TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)

Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.

However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.

However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.

Other comments

The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.

Open room for conventions to store Common Operational Datasets (at least COD-ABs)

While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.

Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.

Before go to CODs, means optimize dictionaries that explain then

To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.

Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.

@fititnt fititnt added the archiva-farmatis archīva fōrmātīs; /formats of files/@eng-Latn; About (new) data formats to package dictionaries label Apr 29, 2022
fititnt added a commit that referenced this issue Apr 29, 2022
fititnt added a commit that referenced this issue Apr 30, 2022
@fititnt
Copy link
Member Author

fititnt commented Apr 30, 2022

hummmmmmmmmmm

Captura de tela de 2022-04-29 23-45-23

@fititnt
Copy link
Member Author

fititnt commented Apr 30, 2022

The cli we use to query the main 1603_1_1.py (and glue other operations, including exporting Codex) not surprisingly is getting a lot of options. For sake of future reference, I will leave here what was at this moment.

Since some options like --codex_de are shared, I think it is better to create options that state explicitly what operation is under active use instead of implicitly assuming from most common uses.

The full example (need click)

./99999999/0/1603_1.py --help
fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ ./999999999/0/1603_1.py --help
usage: 1603_1 [-h] [--punctum-separato-de-resultatum [RESULTATUM_SEPARATO]]
              [--punctum-separato-de-fontem [FONTEM_SEPARATO]] [--de-archivum]
              [--data-apothecae-ad [DATA_APOTHECAE_AD]] [--data-apothecae-ex DATA_APOTHECAE_EX]
              [--data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO]
              [--data-apothecae-formato [{datapackage,sqlite}]] [--dictionaria-numerordinatio]
              [--codex-de [CODEX_DE]] [--objectivum-linguam [OBJECTIVUM_LINGUAM]]
              [--auxilium-linguam AUXILIUM_LINGUAM] [--codex-copertae] [--codex-in-tabulam-json]
              [--status-quo] [--status-in-markdown] [--status-in-datapackage]
              [--ex-librario [EX_LIBRARIO]] [--ex-opere-temporibus [EX_OPERE_TEMPORIBUS]]
              [--quaero-ix_n1603ia [QUAERO_IX_N1603IA]]
              [--quaero-numerordinatio [QUAERO_NUMERORDINATIO]] [--in-limitem [IN_LIMITEM]]
              [--in-ordinem [{numerordinatio,chaos}]] [--objectivum-formatum-asciidoctor]
              [infile]

Explain the dictionaries

positional arguments:
  infile                HXL file to read (if omitted, use standard input).

optional arguments:
  -h, --help            show this help message and exit
  --punctum-separato-de-resultatum [RESULTATUM_SEPARATO]
                        Character(s) used as separator for generate output. Used only for tabular
                        results. Defaults to tab " "
  --punctum-separato-de-fontem [FONTEM_SEPARATO]
                        Character(s) used as separator from input file Used only for tabular
                        results. Defaults to comma ","

Archivum:
  (DEFAULT USE) Use archive as source (directory not ready yet)

  --de-archivum         Parse single archive

Data apothēcae:
  data apothēcae. (One) Warehouse of datasets. Compile selected dictionaries to a single place (likely single database entry point)

  --data-apothecae-ad [DATA_APOTHECAE_AD]
                        Path to file (or reference to database) to store result
  --data-apothecae-ex DATA_APOTHECAE_EX
                        Comma-separated list of dictionaries to initialize
  --data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO
                        Path to file with list (one item per line) of dictionaries to initialize
  --data-apothecae-formato [{datapackage,sqlite}]
                        Output format. Default will try make a guess from --data-apothecae-ad
                        pattern.

Dictionaria:
  Generate dictionaries. No input required (uses disk 1603 and 999999999/1603 data files)

  --dictionaria-numerordinatio
                        Dictionary of all possible values on stricter Numerordĭnātĭo (HXLStantad
                        container)
  --objectivum-formatum-asciidoctor
                        (Default) Output Asciidoctor format

Codex:
  Book/manual creation

  --codex-de [CODEX_DE]
                        Generate documentation of dictionaries
  --objectivum-linguam [OBJECTIVUM_LINGUAM]
                        Target natural language (use if not auto-detected). Must be like {ISO
                        639-3}-{ISO 15924}. Example: arb-Arab. Default: mul-Zyyy
  --auxilium-linguam AUXILIUM_LINGUAM
                        Define auxiliary languages Must be like {ISO 639-3}-{ISO 15924}. Example:
                        "ina-Latn,ile-Latn" Accepts multiple values.
  --codex-copertae      Pre-calculate the codex, but only generate Codex cover (SVG)
  --codex-in-tabulam-json
                        Pre-calculate the codex, but only generate Tabular Data (MediaWiki syntax 1)
                        (JSON). See https://www.mediawiki.org/wiki/Help:Tabular_Data

Status quō:
  Calculate current situation. Used to take other actions. Requires --codex-de 1603_NN_NN (focused Codex). Works with --quaero-ix_n1603ia.

  --status-quo          Compute the status quo, using a codex as initial reference
  --status-in-markdown  Return status in Markdown (instead of YAML)
  --status-in-datapackage
                        Return status in frictionless datapackage.json. With --ex-librario returns
                        profile data-package-catalog. (low level of details)
  --ex-librario [EX_LIBRARIO]
                        Status novō. New state. Persist changes if necessary

Opus temporibus:
  Crontab/cronjob information 

  --ex-opere-temporibus [EX_OPERE_TEMPORIBUS]
                        ex opere temporibus. Out of work times (crontab)
  --quaero-ix_n1603ia [QUAERO_IX_N1603IA]
                        Query ix_n1603ia. Rudimentar && (AND) and || (OR). Use var<1 to test 0 or
                        undefined. Query ix_n1603ia. Filter. Ex. "{publicum}>10 && {internale}<1"
  --quaero-numerordinatio [QUAERO_NUMERORDINATIO]
                        Query Numerordĭnātĭo. Additional filter list for focused base of
                        dictionaries. Ideal to check if some groups meet other filters. Example: if
                        result return empty and other queries are to check if need to fetch again
                        from Wikidata Q, then you assume no new fetch is necessary
  --in-limitem [IN_LIMITEM]
                        /Against the limit of/. Limit maximum number of cron jobs to show.
  --in-ordinem [{numerordinatio,chaos}]
                        /Against arrangement (ordering) of/. Sort result list to this rule. Options:
                        numerordinatio=natural order; chaos=random order

Exemplōrum gratiā:
    printf "#item+conceptum+codicem,#item+rem+i_qcc+is_zxxx+ix_wikiq" | ./999999999/0/1603_1.py --de-archivum
    cat 1603/1/1/1603_1_1.no1.tm.hxl.csv | ./999999999/0/1603_1.py --de-archivum
    ./999999999/0/1603_1.py --de-archivum 1603/1/1/1603_1_1.no1.tm.hxl.csv

    ./999999999/0/1603_1.py --dictionaria-numerordinatio

    ./999999999/0/1603_1.py --codex-de 1603_63_101

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-copertae

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-in-tabulam-json

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="cdn"

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="locale" --status-in-markdown

    ./999999999/0/1603_1.py --codex-de 1603_63_101 --ex-opere-temporibus='cdn'

    ./999999999/0/1603_1.py --ex-opere-temporibus='cdn' --quaero-ix_n1603ia='({publicum}>=9)&&({victionarium_q}>9)'

    ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.datapackage.json'

    ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.sqlite'

Edited: text formating.

fititnt added a commit that referenced this issue Apr 30, 2022
@fititnt fititnt added the librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data label May 14, 2022
@fititnt fititnt changed the title New exported format: SQL database populated with dictionaries data (experimental feature) New exported format + data warehouse strategy: SQL database populated with dictionaries data (experimental feature) May 18, 2022
fititnt added a commit that referenced this issue May 24, 2022
…es to SQLite (still with some off-by-one error)
fititnt added a commit that referenced this issue May 24, 2022
…ite (still need to fix issue when tables have over 200 columns, e.g. with translations)
@fititnt fititnt changed the title New exported format + data warehouse strategy: SQL database populated with dictionaries data (experimental feature) New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) Jun 4, 2022
@fititnt
Copy link
Member Author

fititnt commented Jun 27, 2022

Oh f**k me.

Except by SQLite (which is quite flexible) PostgreSQL (as it likely to happens with other databases) okay with # at start and +'s (e.g. allows entire HXL hashtags) but the columns we're adding which allows RDF mappings without any additional metadata are getting over at least 100 characters (like #item+rem+i_qcc+is_zxxx+rdf_a_obo_bfo29+rdf_p_obo_bfo124_s5002+rdf_p_obo_bfo171_s5000+rdf_s_u2200_s5001) and we could need even more than this.

TODO

In addition to a RDF with HXL and RDF with BCP47 language tags, we will need... something which could still be compatible, but if user would ingest it on SQL databases, the new reversible identifier should be predictable and reversible.

On database format, we may need to have some additional way only to store the relations of the pivots without need this become part of the header alone (as we do with with tabular format on CSV)

@fititnt
Copy link
Member Author

fititnt commented Jun 27, 2022

By the way, the naming strategy for the tables already is very, very compact and predictable. They may be weird for humans, but as most users would use better interfaces, the use of numbers is a non-issue.

However, the way to encode the columns (at least outside SQLite or plain CSV) would need more fine tunning. Anyway, we can still take more time to make the imports to PostgresSQL/MySQL (traditional databases).

Anyway, we're already looking into R2RML (https://www.w3.org/TR/r2rml/) or, more generally, Ontology Based Data Access (OBDA), so this naturally would already take some time to think about the way to name the columns on relational database format.

Most strategies the R2RML is created by humans (even if with graphical interfaces) but we cannot do this way. So the way to name the columns migth also be done in such way that simplify R2RML

fititnt added a commit that referenced this issue Jun 28, 2022
fititnt added a commit that referenced this issue Jun 28, 2022
fititnt added a commit that referenced this issue Jun 28, 2022
@fititnt
Copy link
Member Author

fititnt commented Jun 30, 2022

Done first test. The FKs are not optimized, but now is viable to also load some more standard SQL databases.

TL;DR:

  • At the moment, we're using the friccionless framework (https://framework.frictionlessdata.io/docs/tutorials/formats/sql-tutorial/#sqlite) to use datapackages as signal of WHAT should be exported. Actually this could be used to export to much, much more formats (beyond the standards on frictionless)
  • We have several versions of CSVs on disk, but while we could load all then on the databases, we're likely to only import what user actually want.
    • This means even if we give the users some default datapackages, very likely they would want customize it.
  • 140 UN m49 entry points already generate over 470 tables (one per administrative division). This was just a quick test (but it works)

1. Generate the datapackage (this step allow select what should be imported from CSVs to other storages)

DATA_APOTHECAE_MINIMIS=1 ./999999999/0/1603_1.py --methodus='data-apothecae' --data-apothecae-ad-stdout --data-apothecae-formato='datapackage' --data-apothecae-ex-suffixis='no1.bcp47.csv' --data-apothecae-ex-praefixis='1603_45_16' > ./apothecae~1603_45_16.datapackage.json

1.1 SQLite (already was working)

SQLite is quite permissive on what it accepts. It was already allowing before data fields with over 60 characters (PostgreSQL and others don't)

./999999999/0/frictionless_to_sqlite.py --datapackage='apothecae~1603_45_16.datapackage.json' --sqlite='999999/0/apothecae~1603_45_16.sqlite'

Captura de tela de 2022-06-30 03-34-30

1.2 PostgreSQL

Note: this requires PostgreSQL already installed (SQLite can work with file based access)

./999999999/0/frictionless_to_postgresql.py --datapackage='apothecae~1603_45_16.datapackage.json' --postgresql='fititnt:pass@localhost/mdciii'

Captura de tela de 2022-06-30 03-33-06

1.3 Microsoft Excel (friccionless actually also accept Libreoffice, but not implemented yet)

Turns out that Excel can load fine over 470 sheets. The file on disk is 17.3MB, but I think it is using over 470mb of RAM via Libreoffice.

However, at least if the final intent would be use with command line after, I really would recommend anyone export every sheet to CSV or other storage, because opening and closing sheet by sheet via command line would require a startup time very, very significative (the way it is stored is a zip with XMLs, not as optimized as would be CSVs or SQLite)

time ./999999999/0/frictionless_to_excel.py --datapackage='apothecae~1603_45_16.datapackage.json' --excel='999999/0/apothecae~1603_45_16.xlsx'

# real	1m56,411s
# user	1m55,135s
# sys	0m1,042s

Captura de tela de 2022-06-30 03-46-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archiva-farmatis archīva fōrmātīs; /formats of files/@eng-Latn; About (new) data formats to package dictionaries librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data
Projects
None yet
Development

No branches or pull requests

1 participant