Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

Closed
mekarpeles opened this issue Mar 15, 2018 · 16 comments
Assignees
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Milestone

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Mar 15, 2018

There are 2.6M items on the Internet Archive Openlibraries Wishlist. We want to make sure each of these books has a corresponding catalog entry in openlibrary.org.

Steps

See: https://archive.org/details/open_libraries_wish_list
https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_1.csv.zip

@thisismattmiller -- can get a master csv which collates isbn10, isbn13, and oclc (into one row)?

We want to add all of these records into Open Library so we can query by any of these identifier fields and retrieve the metadata for the book.

Also, do we have metadata on these books?

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] 2018 Goals labels Mar 15, 2018
@mekarpeles
Copy link
Member Author

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

@thisismattmiller
Copy link

thisismattmiller commented Mar 20, 2018

All data for the wish list is found in the sqlite database in my home directory (/home/mattmiller/sqlite_database) there are example scripts in there for serialization.
Field information is found in this google sheet: https://docs.google.com/spreadsheets/d/1GDATWbgncmQzDaTJVdJU1kVcRhJIuMs0zHUnoCITED0/edit?usp=sharing

The basic SQL query to get everything that will appear on the wish list is:
SELECT * FROM data where flagged_author == 0 and flagged_publisher == 0 and no_author == 0 and no_publisher == 0

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

Basic JSON serialization of the wish list can also be found in /home/mattmiller/wish_list

@mekarpeles mekarpeles added this to the 2018 Q2 milestone Mar 21, 2018
@tfmorris
Copy link
Contributor

tfmorris commented Apr 3, 2018

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

Noooo. No more fake subjects! They're worse than fake news. "In my little sparkly library" is not a subject. A subject is something that a book is about like "French politics" or "Himalayan mountains."

@tfmorris
Copy link
Contributor

tfmorris commented Apr 3, 2018

What is the basis for these 2.6 million wishes? ie where did they come from?

@mekarpeles
Copy link
Member Author

mekarpeles commented Apr 10, 2018

@tfmorris -- got it, no subject :)

We've been using subject instead of list because our lists are unscalable/broken/challenging to fix. A new db entry is added every time a seed is added to a list causing huge db bloat. We've been experimenting w/ a new lists db (which is being used to power our "Want to Read", "Already Read", etc. Reading Log feature) which is not backed by infogami.

The list comes from here: https://blog.archive.org/2018/03/14/lets-build-a-great-digital-library-together-starting-with-a-wishlist/

@mekarpeles
Copy link
Member Author

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

@thisismattmiller, is this also the case (i.e. isbn synonyms are included) in the wish_list_march_2018.ndjson?

If so, would you be able to generate a copy of the json that does not have synonyms? In order to import into Open Library, we'll need the exact book metadata / isbn, etc

@thisismattmiller
Copy link

Sorry I missed this notification.

Yes, that would be possible to exclude the synonyms. In /home/mattmiller/sqlite_database/scripts/serialize_basic.py starting on line 200 there is:

			# see if we haver other print versions available to add to this record
			if row['classify_related_print_editions'] is not None:
				row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
				for e in row['classify_related_print_editions']:
					for isbn in e['isbn']:
						if isbn not in added_isbn:
							if int(isbn) in have_lookup:
								skiped_isbns+=1
							else:
								# overwrite the obj and add this new one in
								obj['isbn13'] = isbn
								obj['isbn10'] = to_isbn10(isbn)
								obj['oclc'] = e['oclc']
								obj['language'] = e['language']
								added_isbn[isbn] = True
								# write it out
								out_json.write(json.dumps(obj)+'\n')
								out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])
						else:
							# print('already added',isbn)
							pass


		else:

			#this was not a print version but there might be related print versions we have collected
			# see if we haver other print versions available to add to this record
			if row['classify_related_print_editions'] is not None:
				row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
				for e in row['classify_related_print_editions']:
					for isbn in e['isbn']:
						if isbn not in added_isbn:

							if int(isbn) in have_lookup:
								skiped_isbns+=1
							else:
								# overwrite the obj and add this new one in
								obj['isbn13'] = isbn
								obj['isbn10'] = to_isbn10(isbn)
								obj['oclc'] = e['oclc']
								obj['language'] = e['language']
								added_isbn[isbn] = True
								# write it out
								out_json.write(json.dumps(obj)+'\n')
								out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])

						else:
							# print('already added',isbn)
							pass

Just comment that out and run it and it will not populate out the classify related titles aka isbn synonyms.
-Matt

@sbshah97
Copy link
Contributor

@thisismattmiller what would be an easy way to separate out Editions from Works on the Wishlist such that we include just a single Work (from Multiple Editions).

@thisismattmiller
Copy link

In the SQLite DB you would need to limit it to only things that:
has_classify = 1
And then collapse rows together that have the same classify_work_id value.

The classify_work_id is a work identifier, so if two editions have the same id it is the same work.

@mekarpeles
Copy link
Member Author

Opening until we've had a chance to process the remaining Wishlist works (and add them to OL)

@mekarpeles mekarpeles reopened this Jun 19, 2018
@tfmorris
Copy link
Contributor

The more I look at this list, the more suspicious of it become. I was going to say it was English & North American biased, but I haven't found any evidence that it includes anything except English. At OpenLibrary we've been pushing for increased diversity, so it'd be sad to see Internet Archive's influence reverse that. Diversity is important.

There are a bunch of references to a private database which apparently contains more information than the public CSVs. Is there a reason for this lack of transparency? Can we get a dump of all the metadata available?

In looking at https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_3_provenance.tsv.zip which seems to contain the most information available, the ISBNs from Library Link seem to be completely disjoint with all the other sources which is exceedingly odd. Is it accurate?

I have many more questions and concerns, but with access to the metadata I could probably answer the questions myself (and figure out if the concerns are justified).

@mekarpeles
Copy link
Member Author

mekarpeles commented Sep 15, 2018

At OpenLibrary we've been pushing for increased diversity

Agreed -- p.s. one experiment I'm trying to push is on-the-fly per-page translation of each of our books using a translation API.

We have several different projects in the mix which are not mutually exclusive w/ this Open Libraries Wishlist. This wishlist is for a very specific program (openlibraries.online).

We are independently funding large e.g. LGBT collections, Indian collections, etc. These just happen to be separate initiatives which are independently funded.

I think we're unlikely to see a huge multi-lingual emphasis in the openlibraries specific push. I think IA is focusing on books a lot of libraries have for this initiative, as a way to potentially help them migrate online. It would be great if we could build a defensible library system any library could contribute to a la the Open Content Alliance back in 2008 cerca OL's inception:
http://www.infotoday.com/searcher/jan08/Ashmore_Grogg.shtml

@mekarpeles mekarpeles assigned hornc and unassigned sbshah97 Oct 29, 2018
@tfmorris
Copy link
Contributor

OK, so it's US English only and that's not going to change.
Can we at least get access to the metadata so we can judge its quality?

@tfmorris
Copy link
Contributor

I propose the OpenLibrary defer this task until Internet Archive is more transparent about this list. I heard a (hyperlocal) podcast just the other day touting how OpenLibrary could benefit the low income marginalized in Africa, Asia, etc. An American community library "wish" list isn't at all relevant to them.

@hornc
Copy link
Collaborator

hornc commented Feb 15, 2019

Some stats,
Out of 2000 randomly sampled English books on the Wishlist:
239 NOT FOUND 12% (not found on OL or elsewhere)
203 CREATED 10% (Record created from new metadata)
1558 FOUND 78% (OL already had a record for the ISBN)

Out of 2000 randomly sampled Non-English ISBNs on the Wishlist:
822 NOT FOUND 40% (not found on OL or elsewhere)
289 CREATED 15% (Record created from new metadata)
888 FOUND 45% (OL already had a record for the ISBN)

75% of the ~1.5M ISBNs are English, 12% are German, which seems to be the next biggest category. I'm planning on creating a tool to analyse ISBN lists by allocated Agency (which in most cases equates to country), so we can get a better indication of book sources from any large list of ISBNs.
These are some other counts sampling various countries:
fra=38516
ger=184027
jap=6922
former-ussr=15114
ind=6725
ita=26585
mex=3352

Importing the Non-English ISBNs into OL seems to give a slightly higher new item rate. The most obvious source of non-English items on the wishlist is probably the international Wikipedia citation lists that were used (items with more than one citation). The bulk of the wishlist is already on OL, so this task is about filling in the gaps, and the biggest gap is in the non-English books.

If there are good quality sources of diverse bulk ISBNs (or full metadata!) we can import, point me to them and I can run the imports in parallel.

@mekarpeles
Copy link
Member Author

I think @hornc finished this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

No branches or pull requests

5 participants