Mapping MD to CW #125

aarppe · 2024-07-12T01:30:18Z

How the extensions to the ALTLab version of the Maskwacîs Dictionary were planned, there are two fields that are intended to help establish that an entry in MD can be matched to an entry in CW (at some static point).

CW_lemma indicates that the MD entry has (at some point) been mapped to an entry in CW. "Lemma" here means "entry head" in the lexicographical sense, rather than "baseform" in the computational sense. There are 7241 such MD entries, cf.

cat crk/dicts/Maskwacis_altlab.tsv | gawk -F"\t" 'NR>=2 { if($12!="") print; }' | wc -l
    7241

Sometimes CW_lemma by itself is not sufficient to provide an unambiguous match with a CW entry, so some additional information is needed. Early on, we used the full English definitions as manually copied from CW to the MD database, but since those are under continuous editing, they are not reliable on the long term. See:

cat crk/dicts/Maskwacis_altlab.tsv | gawk -F"\t" 'BEGIN { while((getline < "crk/dicts/Wolvengrey.tsv")!=0) { gsub("ý","y",$1); entry[$1]++; } } { if($12 in entry) print entry[$12]; }' | sort | uniq -c
4282 1
 144 2
  12 3
   1 4

Alternatively, the MD part-of-speech codes (noun, verb, etc.) might not be specific enough to disambiguate subtypes of verbs that can have the same baseform lemmas, so we might need to manually add this information to the ALTLab version of MD, perhaps using the field MD_class.

MD_lemma (and its associates MD_stem and MD_class) were created to provide the necessary ingredients for including in the LEXC code those MD entries that could not be mapped to CW. There are 2566 such cases.

cat crk/dicts/Maskwacis_altlab.tsv | gawk -F"\t" 'NR>=2 { if($16!="") print; }' | wc -l
    2565

Increasingly, these entries originally missing from CW have yet been added there, so for FST generation purposes our script checks if the combination of the MD_lemma and MD_class map with the \sro and \ps fields in CW, in which case they are not added to the LEXC code.
Besides the above, there are a number of entries in MD that are neither mapped to CW, nor provided with an MD_lemma, etc. While these would not be included in the FST, they could nevertheless yet be included in the *.importjson, but without getting a paradigm.

cat crk/dicts/Maskwacis_altlab.tsv | gawk -F"\t" 'NR>=2 { if($12=="" && $16=="") print; }' | wc -l
     140

Similar comparisons for LEXC inclusion have not yet been completed in the case of AECD to CW (but not MD).

The text was updated successfully, but these errors were encountered:

fbanados · 2024-07-15T16:00:00Z

Just a note, that some of the entries with a CW_Lemma do not have matching entry in CW, but are an inflected form of a word in CW, e.g. âcim, where the definition in CW_Definition matches âcimêw instead, which is sufficient to match the entry as a formOf âcimêw.

fbanados · 2024-07-24T23:45:43Z

Most of the entries have been merged. Current status is:

30094 entries from CW
6060 CW entries updated with MD data
3045 new entries from MD
5510 CW+MD entries updated with AECD data
1707 new entries from AECD
113 AECD entries skipped (likely spelling alternatives)
3145 AECD entries that have no standardized lemma (+?) are currently being discarded.

fbanados · 2024-07-24T23:57:29Z

Some extra work could be done to merge entries.

aarppe · 2024-07-26T15:34:23Z

3045 new entries from MD

Are there any entries from MD that are not incorporated at all? In my scripting I noticed that there were perhaps just above 300 cases where an MD entry maps to multiple CW entries, which would need manual disambiguation.

3145 AECD entries that have no standardized lemma (+?) are currently being discarded.

Concerning these, we'd want to run an old script of mine scrutinizing AECD with the FST again, and see if we could reduce the number of unanalyzed cases (with +?).

fbanados · 2024-07-26T17:04:35Z

Are there any entries from MD that are not incorporated at all? In my scripting I noticed that there were perhaps just above 300 cases where an MD entry maps to multiple CW entries, which would need manual disambiguation.

For the importjson currently in itwewina.app, yes. Any entry that does not have a unique mapping to CW (!=1). For the new importjson, All entries are added, if entries need manual disambiguation they are included as a separate entry. An example of this behaviour change is âhkosiw, which in itwewina.app only has two CW entries (one with two senses and the other one without). Because there's two, the old behaviour would just give up merging the entries and thus the definitions in MD and AECD would not be merged. New behaviour is to have 4 entries. Eventually we would have a way to force them to correctly merge, but I believe 4 entries to be preferrable to missing entries (especially when someone deselects CW as a source)

Concerning these, we'd want to run an old script of mine scrutinizing AECD with the FST again, and see if we could reduce the number of unanalyzed cases (with +?).

That would definitely be helpful.

aarppe · 2024-08-11T11:34:41Z

I'm also thinking whether we should introduce for each of the dictionaries a persistent unique identifier, that would allow us to link entries unambiguously? This could be based on some existing information, such as the entry head and lexical category, plus then an index to deal with ambiguity - or then it could be just a numeric code. I'd be inclined to consider a transparent PID, but I can be convinced otherwise.

fbanados · 2024-08-12T20:26:32Z

We can definitely add these identifiers to the dictionaries that do not change (MD, AECD, etc.). To account for ambiguities, the index should be recorded in the entries themselves to avoid issues like swapping entries in the source files. We can just add a column on the source TSV files for the identifier, whichever we choose. I do not have particular preferences either way.

I am less certain about introducing identifiers in the CW toolbox files, we could have a discussion about that as well.

aarppe added documentation Improvements or additions to documentation meta Issues for tracking issues source:CW Arok Wolvengrey's Cree Words source:MD Maskwacîs Dictionary aggregation Changes to the aggregation algorithm labels Jul 12, 2024

aarppe mentioned this issue Jul 12, 2024

Refactor: Make process of collection independent of the choice of main dictionary #124

Open

7 tasks

fbanados added the ready-for-review label Jul 24, 2024

fbanados mentioned this issue Jul 25, 2024

AECD atim not merged with CW and MD #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping MD to CW #125

Mapping MD to CW #125

aarppe commented Jul 12, 2024

fbanados commented Jul 15, 2024

fbanados commented Jul 24, 2024

fbanados commented Jul 24, 2024

aarppe commented Jul 26, 2024

fbanados commented Jul 26, 2024

aarppe commented Aug 11, 2024

fbanados commented Aug 12, 2024

Mapping MD to CW #125

Mapping MD to CW #125

Comments

aarppe commented Jul 12, 2024

fbanados commented Jul 15, 2024

fbanados commented Jul 24, 2024

fbanados commented Jul 24, 2024

Some extra work could be done to merge entries.

aarppe commented Jul 26, 2024

fbanados commented Jul 26, 2024

aarppe commented Aug 11, 2024

fbanados commented Aug 12, 2024