Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

mekarpeles · 2018-03-15T19:19:13Z

There are 2.6M items on the Internet Archive Openlibraries Wishlist. We want to make sure each of these books has a corresponding catalog entry in openlibrary.org.

Steps

Download the wishlistdataset from Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869 (@mekarpeles will provide an updated link)
We will take the first 1,000 items in the wishlist (as a test) and see (2.a.) how many are in Open Library...
- As a first run, we can check these 1,000 wishlist items against the Open Library books API: https://openlibrary.org/dev/docs/api/books. This won't scale to 2.6M editions which is how many are on the wishlist. So, instead, we'll need to download + process the editions data dump https://openlibrary.org/data/ol_dump_editions_latest.txt.gz

See: https://archive.org/details/open_libraries_wish_list
https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_1.csv.zip

@thisismattmiller -- can get a master csv which collates isbn10, isbn13, and oclc (into one row)?

We want to add all of these records into Open Library so we can query by any of these identifier fields and retrieve the metadata for the book.

Also, do we have metadata on these books?

The text was updated successfully, but these errors were encountered:

mekarpeles · 2018-03-15T19:35:33Z

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

thisismattmiller · 2018-03-20T11:36:19Z

All data for the wish list is found in the sqlite database in my home directory (/home/mattmiller/sqlite_database) there are example scripts in there for serialization.
Field information is found in this google sheet: https://docs.google.com/spreadsheets/d/1GDATWbgncmQzDaTJVdJU1kVcRhJIuMs0zHUnoCITED0/edit?usp=sharing

The basic SQL query to get everything that will appear on the wish list is:
SELECT * FROM data where flagged_author == 0 and flagged_publisher == 0 and no_author == 0 and no_publisher == 0

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

Basic JSON serialization of the wish list can also be found in /home/mattmiller/wish_list

tfmorris · 2018-04-03T22:25:40Z

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

Noooo. No more fake subjects! They're worse than fake news. "In my little sparkly library" is not a subject. A subject is something that a book is about like "French politics" or "Himalayan mountains."

tfmorris · 2018-04-03T22:32:20Z

What is the basis for these 2.6 million wishes? ie where did they come from?

mekarpeles · 2018-04-10T19:42:47Z

@tfmorris -- got it, no subject :)

We've been using subject instead of list because our lists are unscalable/broken/challenging to fix. A new db entry is added every time a seed is added to a list causing huge db bloat. We've been experimenting w/ a new lists db (which is being used to power our "Want to Read", "Already Read", etc. Reading Log feature) which is not backed by infogami.

The list comes from here: https://blog.archive.org/2018/03/14/lets-build-a-great-digital-library-together-starting-with-a-wishlist/

mekarpeles · 2018-05-13T05:32:56Z

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

@thisismattmiller, is this also the case (i.e. isbn synonyms are included) in the wish_list_march_2018.ndjson?

If so, would you be able to generate a copy of the json that does not have synonyms? In order to import into Open Library, we'll need the exact book metadata / isbn, etc

thisismattmiller · 2018-05-22T13:29:45Z

Sorry I missed this notification.

Yes, that would be possible to exclude the synonyms. In /home/mattmiller/sqlite_database/scripts/serialize_basic.py starting on line 200 there is:

			# see if we haver other print versions available to add to this record
			if row['classify_related_print_editions'] is not None:
				row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
				for e in row['classify_related_print_editions']:
					for isbn in e['isbn']:
						if isbn not in added_isbn:
							if int(isbn) in have_lookup:
								skiped_isbns+=1
							else:
								# overwrite the obj and add this new one in
								obj['isbn13'] = isbn
								obj['isbn10'] = to_isbn10(isbn)
								obj['oclc'] = e['oclc']
								obj['language'] = e['language']
								added_isbn[isbn] = True
								# write it out
								out_json.write(json.dumps(obj)+'\n')
								out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])
						else:
							# print('already added',isbn)
							pass


		else:

			#this was not a print version but there might be related print versions we have collected
			# see if we haver other print versions available to add to this record
			if row['classify_related_print_editions'] is not None:
				row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
				for e in row['classify_related_print_editions']:
					for isbn in e['isbn']:
						if isbn not in added_isbn:

							if int(isbn) in have_lookup:
								skiped_isbns+=1
							else:
								# overwrite the obj and add this new one in
								obj['isbn13'] = isbn
								obj['isbn10'] = to_isbn10(isbn)
								obj['oclc'] = e['oclc']
								obj['language'] = e['language']
								added_isbn[isbn] = True
								# write it out
								out_json.write(json.dumps(obj)+'\n')
								out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])

						else:
							# print('already added',isbn)
							pass

Just comment that out and run it and it will not populate out the classify related titles aka isbn synonyms.
-Matt

sbshah97 · 2018-05-23T00:37:02Z

@thisismattmiller what would be an easy way to separate out Editions from Works on the Wishlist such that we include just a single Work (from Multiple Editions).

thisismattmiller · 2018-05-23T11:07:18Z

In the SQLite DB you would need to limit it to only things that:
has_classify = 1
And then collapse rows together that have the same classify_work_id value.

The classify_work_id is a work identifier, so if two editions have the same id it is the same work.

mekarpeles · 2018-06-19T05:08:48Z

Opening until we've had a chance to process the remaining Wishlist works (and add them to OL)

tfmorris · 2018-09-15T18:14:58Z

The more I look at this list, the more suspicious of it become. I was going to say it was English & North American biased, but I haven't found any evidence that it includes anything except English. At OpenLibrary we've been pushing for increased diversity, so it'd be sad to see Internet Archive's influence reverse that. Diversity is important.

There are a bunch of references to a private database which apparently contains more information than the public CSVs. Is there a reason for this lack of transparency? Can we get a dump of all the metadata available?

In looking at https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_3_provenance.tsv.zip which seems to contain the most information available, the ISBNs from Library Link seem to be completely disjoint with all the other sources which is exceedingly odd. Is it accurate?

I have many more questions and concerns, but with access to the metadata I could probably answer the questions myself (and figure out if the concerns are justified).

mekarpeles · 2018-09-15T19:49:33Z

At OpenLibrary we've been pushing for increased diversity

Agreed -- p.s. one experiment I'm trying to push is on-the-fly per-page translation of each of our books using a translation API.

We have several different projects in the mix which are not mutually exclusive w/ this Open Libraries Wishlist. This wishlist is for a very specific program (openlibraries.online).

We are independently funding large e.g. LGBT collections, Indian collections, etc. These just happen to be separate initiatives which are independently funded.

I think we're unlikely to see a huge multi-lingual emphasis in the openlibraries specific push. I think IA is focusing on books a lot of libraries have for this initiative, as a way to potentially help them migrate online. It would be great if we could build a defensible library system any library could contribute to a la the Open Content Alliance back in 2008 cerca OL's inception:
http://www.infotoday.com/searcher/jan08/Ashmore_Grogg.shtml

tfmorris · 2018-11-22T23:18:59Z

OK, so it's US English only and that's not going to change.
Can we at least get access to the metadata so we can judge its quality?

tfmorris · 2019-02-13T05:43:28Z

I propose the OpenLibrary defer this task until Internet Archive is more transparent about this list. I heard a (hyperlocal) podcast just the other day touting how OpenLibrary could benefit the low income marginalized in Africa, Asia, etc. An American community library "wish" list isn't at all relevant to them.

hornc · 2019-02-15T07:03:41Z

Some stats,
Out of 2000 randomly sampled English books on the Wishlist:
239 NOT FOUND 12% (not found on OL or elsewhere)
203 CREATED 10% (Record created from new metadata)
1558 FOUND 78% (OL already had a record for the ISBN)

Out of 2000 randomly sampled Non-English ISBNs on the Wishlist:
822 NOT FOUND 40% (not found on OL or elsewhere)
289 CREATED 15% (Record created from new metadata)
888 FOUND 45% (OL already had a record for the ISBN)

75% of the ~1.5M ISBNs are English, 12% are German, which seems to be the next biggest category. I'm planning on creating a tool to analyse ISBN lists by allocated Agency (which in most cases equates to country), so we can get a better indication of book sources from any large list of ISBNs.
These are some other counts sampling various countries:
fra=38516
ger=184027
jap=6922
former-ussr=15114
ind=6725
ita=26585
mex=3352

Importing the Non-English ISBNs into OL seems to give a slightly higher new item rate. The most obvious source of non-English items on the wishlist is probably the international Wikipedia citation lists that were used (items with more than one citation). The bulk of the wishlist is already on OL, so this task is about filling in the gaps, and the biggest gap is in the non-English books.

If there are good quality sources of diverse bulk ISBNs (or full metadata!) we can import, point me to them and I can run the imports in parallel.

mekarpeles · 2019-04-02T19:00:03Z

I think @hornc finished this!

mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] 2018 Goals labels Mar 15, 2018

mekarpeles added this to the 2018 Q2 milestone Mar 21, 2018

mekarpeles assigned sbshah97 May 6, 2018

sbshah97 mentioned this issue May 29, 2018

WIP: IA Wishlist Bot internetarchive/openlibrary-bots#7

Merged

mekarpeles closed this as completed in internetarchive/openlibrary-bots#7 Jun 12, 2018

mekarpeles reopened this Jun 19, 2018

mekarpeles assigned hornc and unassigned sbshah97 Oct 29, 2018

mekarpeles closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

mekarpeles commented Mar 15, 2018 •

edited by scottbarnes

Loading

mekarpeles commented Mar 15, 2018

thisismattmiller commented Mar 20, 2018 •

edited

Loading

tfmorris commented Apr 3, 2018

tfmorris commented Apr 3, 2018

mekarpeles commented Apr 10, 2018 •

edited

Loading

mekarpeles commented May 13, 2018

thisismattmiller commented May 22, 2018

sbshah97 commented May 23, 2018

thisismattmiller commented May 23, 2018

mekarpeles commented Jun 19, 2018

tfmorris commented Sep 15, 2018

mekarpeles commented Sep 15, 2018 •

edited

Loading

tfmorris commented Nov 22, 2018

tfmorris commented Feb 13, 2019

hornc commented Feb 15, 2019

mekarpeles commented Apr 2, 2019

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

Comments

mekarpeles commented Mar 15, 2018 • edited by scottbarnes Loading

Steps

mekarpeles commented Mar 15, 2018

thisismattmiller commented Mar 20, 2018 • edited Loading

tfmorris commented Apr 3, 2018

tfmorris commented Apr 3, 2018

mekarpeles commented Apr 10, 2018 • edited Loading

mekarpeles commented May 13, 2018

thisismattmiller commented May 22, 2018

sbshah97 commented May 23, 2018

thisismattmiller commented May 23, 2018

mekarpeles commented Jun 19, 2018

tfmorris commented Sep 15, 2018

mekarpeles commented Sep 15, 2018 • edited Loading

tfmorris commented Nov 22, 2018

tfmorris commented Feb 13, 2019

hornc commented Feb 15, 2019

mekarpeles commented Apr 2, 2019

mekarpeles commented Mar 15, 2018 •

edited by scottbarnes

Loading

thisismattmiller commented Mar 20, 2018 •

edited

Loading

mekarpeles commented Apr 10, 2018 •

edited

Loading

mekarpeles commented Sep 15, 2018 •

edited

Loading