Improve MARC Author name importing #9797

hornc · 2024-08-25T21:36:03Z

Adds more tests and improves string handling on MARC imports.

Fixes Some org author names are being incorrectly rearranged around commas on import #9789 Some org author names are being incorrectly rearranged around commas on import -- fixes the alternate script issues around 7XX fields. The org splitting was already fixed in Sort out Author types on import #9601
Fixes MARC 100 vs 700 author / contributor inconsistency #7723
Removes redundant personal_name field if it is an exact duplicate of name
Removes contributions: str field and uses author, and role (when it is specified in source)
Uses original script for author names, and places transliterations under alternate_names
Replaces Unicode \u00AB strings with UTF-8 in test data for ease of verification / understanding the expected results

It seems 7xx has no direct relation to a contributor 2nd-class author. Most of the existing examples represent equal co-authors.

Also, it looks like contributor: str is a deprecated field given internetarchive/openlibrary-client@7e45d51

It's not yet clear to me how to identify a statement of responsibility as either an Author or Contributor. The current Author schema supports "roles", which matches source data better, e.g. 700$e " Relator term".

Initial Questions, and this PR's answers:

Who should be in authors? : 1xx (non-repeatable) and 7xx (repeatable) values (without duplicates)
Who should be in contributons? ... no one, deprecate this field. contributions: str field isn't used by the UI(?), and should be replaced by the more informative contributor -- see internetarchive/openlibrary-client@7e45d51
- I question whether this new field should be pluralised because it is a list.
- also, what makes a name a contributor? Authors already have roles, and source data doesn't appear to make author/contributor distinctions either.
In what script should the main entries be? : Main name in original script, transliterations/variations in alternate_names
Are these requirements consistent, and implemented consistently? I think this PR makes things more consistent than before.

Technical

Testing

Screenshot

Stakeholders

@tfmorris

openlibrary/catalog/marc/parse.py

openlibrary/catalog/add_book/tests/test_load_book.py

openlibrary/catalog/marc/tests/test_data/bin_expect/uoft_4351105_1626.json

hornc · 2024-09-19T01:59:31Z

mypy is wasting my time on this.

it's ridiculous that these tests can't run locally in Docker, which is how everything else in the project runs, mypy is available in the container via requirements_test.txt , but it's not functional. It's presence in the container seems to be an accident.

This project uses Docker to take advantage of a controlled dev and test environment, I do not use venv for this project, because there is no need to. I'm not even a great fan of mypy type checking, it's a bit of a disaster on this project which still has tons of legacy code which has barely been upgraded from Python 2. It's forcing to put bandaids on unrelated code that functions, when much of it needs a deep refactor. It's frustrating to have to type hint everything when you've added a few hints to code you understand and are actively changin. I have backed out type hints on some methods before because it opened a can of worms of errors I just couldn't untangle all the way back to source.

There seems to be no integration between the Docker env, the git pre-commit env, and the local test env (which apparently is something that is required to be outside of Docker), it's not documented how to run the full test suite locally, and I think having a separate test/dev environment to run half the tests is a mistake, even if it were documented.

hornc · 2024-09-19T02:14:08Z

mypy's error messages aren't even very good.

The last error is:

openlibrary/catalog/marc/parse.py:501: error: Incompatible types in assignment
(expression has type "list[str]", target has type "str")  [assignment]
                author['alternate_names'] = [alt_name]
                                            ^~~~~~~~~~

It seems to be incorrectly inferring target has type "str" because I can't find anything that is explicitly claiming that.

author['alternate_names'] is supposed to be list[str] Python can handle that, but mypy has a problem. We have a JSON schema to set this all out, and Python dicts used to be good enough to represent structures like this, are we supposed to refactor everything to use statically typed objects now to represent these json-like objects we've inherited from the codebase?

hornc · 2024-09-19T03:12:32Z

idk, now these tests are failing non-deterministically because of 500s from some external Vercel web app I've never heard of because ... javascript bundle size. What kind of testing hell is this? How many more gates?

mypy test already run as part of the pre-commit hooks, and the output is easier to read there. It just adds duplicate noise to the tests-py output

fixes #9789

for #7723

hornc · 2024-09-19T23:07:34Z

I have rebased against the main branch and that has fixed the javascript tests. I have also squashed the pre-commit CI test noise changes, and now all tests are passing.

tfmorris

I took a quick pass through and it's definitely an improvement. The two issues that I see are:

figuring out which 7xx's are authors and which aren't
normalizing contributor roles / relators

I don't claim to have magic solutions for either, so just highlighting the issues.

tfmorris · 2024-09-20T18:14:07Z

openlibrary/catalog/marc/tests/test_data/xml_expect/cu31924091184469.json

      "name": "Homer",
      "entity_type": "person"
+    },
+    {
+      "name": "Buckley, Theodore William Aldis",


The by_statement indicates that this is a translator, who probably shouldn't be promoted to author, but I'm not sure how this case can be detected with at natural language processing of the by statement.

tfmorris · 2024-09-20T18:22:53Z

openlibrary/catalog/marc/tests/test_data/bin_expect/warofrebellionco1473unit_meta.json

+      "birth_date": "1849",
+      "name": "Cowles, Calvin D.",
+      "entity_type": "person",
+      "role": "comp"


Standard MARC relator is "com". Is this an OpenLibrary specific code or are these not being normalized?

tfmorris · 2024-09-20T18:23:44Z

openlibrary/catalog/marc/tests/test_data/bin_expect/zweibchersatir01horauoft_meta.json

+      "birth_date": "1787",
+      "death_date": "1855",
+      "entity_type": "person",
+      "role": "tr. [and] ed"


For these to be translatable, they're going to need to be standardized/normalized. MARC relators are "trl" and "edt"

tfmorris · 2024-09-20T18:25:21Z

openlibrary/catalog/marc/tests/test_data/xml_expect/00schlgoog.json

      "title": "gaon",
-      "personal_name": "Yehudai ben Na\u1e25man",
+      "personal_name": "Yehudai ben Naḥman",
      "role": "supposed author",


I think the standard relator for this might be "att" for "Attributed name"

tfmorris · 2024-09-20T18:25:33Z

openlibrary/catalog/marc/tests/test_data/xml_expect/00schlgoog.json

+      "name": "Schlosberg, Leon",
+      "date": "d. 1899",
+      "entity_type": "person",
+      "role": "ed"


tfmorris · 2024-09-20T18:25:55Z

openlibrary/catalog/marc/tests/test_data/xml_expect/warofrebellionco1473unit.json

+      "birth_date": "1849",
+      "name": "Cowles, Calvin D.",
+      "entity_type": "person",
+      "role": "comp"


tfmorris · 2024-09-20T18:26:25Z

openlibrary/catalog/marc/tests/test_data/xml_expect/zweibchersatir01horauoft.json

+      "birth_date": "1787",
+      "death_date": "1855",
+      "entity_type": "person",
+      "role": "tr. [and] ed"


hornc · 2024-09-23T03:11:57Z

@tfmorris good comments on the various role abbreviations. I was trying to get this merged to get the test files mostly stable before addressing the author role expansion.

There was existing functionality that took 100$e "Relator term"s and populated the role as entered in the MARC, but it was barely ever encountered on a 100 field. By reusing the same code for 700$e these are getting picked up now, which is what is reflected in the test expectations.

The $e field appears to be unstructured text, with non-standard abbreviations, as you note above. There is also a $4 "Relationship" subfield that uses the controlled 3-char codes. I've pushed a draft PR that begins to make use of $4 as well to #9901 , as a new feature.

I think most of your comments are covered by the new PR, except maybe "att" for "Attributed name" -- I'll look into that further. Again, I don't really understand why there are two almost identical realtor / relationship subfields. The $4 looks more useful and controlled for our purposes.

My plan was to:

use $4 codes and expand them to standard terms if present
otherwise make a minimal effort to recognise and convert some non-standard $e terms
fall back to importing $e literally as presumably human-readable terms or abbreviations

If this PR is merged, the rebased version of #9901 should make these changes clear, and we can discuss role expansion in detail there?

scottbarnes

@hornc, everything looks good to me, save for the removal of mypy running as part of the GitHub workflow. I will ask about that today as I don't think that's a unilateral decision I can make, though I will note that it has caused you a lot of trouble here.

cdrini · 2024-09-30T19:39:43Z

We just chatted about this and:

Can remove mypy; already in pre-commit
Likely can’t remove doctest

@scottbarnes to expand :P

scottbarnes · 2024-10-01T13:56:08Z

.github/workflows/python_tests.yml

@@ -48,8 +48,6 @@ jobs:
        run: |
          git fetch --no-tags --prune --depth=1 origin master
          make test-py
-          source scripts/run_doctests.sh


@hornc, I brought this up at the weekly ABC call and as Drini mentioned removing mypy is okay here, but we'd like to keep running the doctests.

@scottbarnes the majority of the tests run in source scripts/run_doctests.sh are already running in make test-py

Running them twice does not add value, it simply takes longer and gives a false impression of 'doing more testing'. It also makes it harder to debug and locate when a test fails if it is failing in two locations.

In general, I don't think the OL project's doctests are maintained or current. The majority of the tests run in /run_doctests.sh are not doctests anyway.

The correct place to put OL Python tests are in the standard test files to be run via make test-py

I believe that this has already been done, but it's hard to review since the bulk of what runs is simple duplication.

The ignore list in scripts/run_doctests.sh seems totally arbitrary. This script should be deprecated, and if there is anything still of value in doctests, they should be highlighted and moved to more appropriate locations, so they can be maintained properly.

Ah, thank you for pointing out that the they're already running from make test-py. I will try to look at this more closely tomorrow.

@hornc, I created #9929 to address this. We will discuss this during milestone planning, and I will mention that I think that issue is blocking this PR, which is itself blocking another PR, if I understand correctly.

hornc commented Aug 25, 2024

View reviewed changes

openlibrary/catalog/marc/parse.py Outdated Show resolved Hide resolved

hornc force-pushed the MARC_tests branch 3 times, most recently from 77b4d4c to 4f18edd Compare August 25, 2024 22:47

hornc commented Aug 25, 2024

View reviewed changes

openlibrary/catalog/add_book/tests/test_load_book.py Outdated Show resolved Hide resolved

hornc force-pushed the MARC_tests branch 4 times, most recently from 3171df3 to 3d44aab Compare August 26, 2024 02:48

hornc added the Theme: MARC records label Aug 26, 2024

hornc mentioned this pull request Aug 26, 2024

Improve MARC string importing, part A #9806

Merged

hornc force-pushed the MARC_tests branch 4 times, most recently from 9fc7c52 to f5dc6c5 Compare September 16, 2024 02:09

hornc changed the title ~~Improve MARC string importing~~ Improve MARC Author name importing Sep 16, 2024

hornc commented Sep 16, 2024

View reviewed changes

openlibrary/catalog/marc/tests/test_data/bin_expect/uoft_4351105_1626.json Outdated Show resolved Hide resolved

hornc force-pushed the MARC_tests branch 13 times, most recently from ba029fb to 08c92bd Compare September 19, 2024 01:47

hornc force-pushed the MARC_tests branch 2 times, most recently from 6c1813b to 55c8002 Compare September 19, 2024 03:02

hornc added 11 commits September 20, 2024 10:51

remove duplicate running github tests

d59e917

mypy test already run as part of the pre-commit hooks, and the output is easier to read there. It just adds duplicate noise to the tests-py output

add failing test for #9789 original language issue

38b9cff

consolidate contributer name fn and refactor

b0e2875

fix tests

279bc27

remove contributor already reflected as an author

cfdb3cd

original script for entity name

05e1379

fixes #9789

rearrange to fix 880_Nihon_no_chasho.mrc test

102cf35

don't duplicate name + personal_name if identical

9fa4e50

update tests for "name" in original script

4307bad

update expectations for 880_arabic_french_many_linkages

503f989

for #7723

deprecate contributions(str), use authors w/ role

8583d8c

hornc force-pushed the MARC_tests branch from 55c8002 to 8583d8c Compare September 19, 2024 22:52

hornc marked this pull request as ready for review September 19, 2024 22:54

hornc requested review from cclauss and removed request for cclauss September 19, 2024 23:08

tfmorris reviewed Sep 21, 2024

View reviewed changes

hornc mentioned this pull request Sep 23, 2024

Expand author roles #9901

Draft

cdrini assigned scottbarnes Sep 23, 2024

scottbarnes reviewed Sep 30, 2024

View reviewed changes

scottbarnes reviewed Oct 1, 2024

View reviewed changes

scottbarnes mentioned this pull request Oct 3, 2024

Move doctests into Makefile and run with test-py #9929

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MARC Author name importing #9797

Improve MARC Author name importing #9797

hornc commented Aug 25, 2024 •

edited

Loading

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

tfmorris left a comment

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

tfmorris Sep 20, 2024

hornc commented Sep 23, 2024

scottbarnes left a comment

cdrini commented Sep 30, 2024

scottbarnes Oct 1, 2024

hornc Oct 1, 2024

scottbarnes Oct 1, 2024

scottbarnes Oct 3, 2024

Improve MARC Author name importing #9797

Are you sure you want to change the base?

Improve MARC Author name importing #9797

Conversation

hornc commented Aug 25, 2024 • edited Loading

Technical

Testing

Screenshot

Stakeholders

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

hornc commented Sep 19, 2024

tfmorris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hornc commented Sep 23, 2024

scottbarnes left a comment

Choose a reason for hiding this comment

cdrini commented Sep 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hornc commented Aug 25, 2024 •

edited

Loading