-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869
Comments
We may want to add all of these to a |
All data for the wish list is found in the sqlite database in my home directory ( The basic SQL query to get everything that will appear on the wish list is: One thing to know is that the wish list tried to be as inclusive as possible, so it used the field Basic JSON serialization of the wish list can also be found in |
Noooo. No more fake subjects! They're worse than fake news. "In my little sparkly library" is not a subject. A subject is something that a book is about like "French politics" or "Himalayan mountains." |
What is the basis for these 2.6 million wishes? ie where did they come from? |
@tfmorris -- got it, no We've been using The list comes from here: https://blog.archive.org/2018/03/14/lets-build-a-great-digital-library-together-starting-with-a-wishlist/ |
@thisismattmiller, is this also the case (i.e. isbn synonyms are included) in the If so, would you be able to generate a copy of the json that does not have synonyms? In order to import into Open Library, we'll need the exact book metadata / isbn, etc |
Sorry I missed this notification. Yes, that would be possible to exclude the synonyms. In
Just comment that out and run it and it will not populate out the classify related titles aka isbn synonyms. |
@thisismattmiller what would be an easy way to separate out Editions from Works on the Wishlist such that we include just a single Work (from Multiple Editions). |
In the SQLite DB you would need to limit it to only things that: The classify_work_id is a work identifier, so if two editions have the same id it is the same work. |
Opening until we've had a chance to process the remaining Wishlist works (and add them to OL) |
The more I look at this list, the more suspicious of it become. I was going to say it was English & North American biased, but I haven't found any evidence that it includes anything except English. At OpenLibrary we've been pushing for increased diversity, so it'd be sad to see Internet Archive's influence reverse that. Diversity is important. There are a bunch of references to a private database which apparently contains more information than the public CSVs. Is there a reason for this lack of transparency? Can we get a dump of all the metadata available? In looking at https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_3_provenance.tsv.zip which seems to contain the most information available, the ISBNs from Library Link seem to be completely disjoint with all the other sources which is exceedingly odd. Is it accurate? I have many more questions and concerns, but with access to the metadata I could probably answer the questions myself (and figure out if the concerns are justified). |
Agreed -- p.s. one experiment I'm trying to push is on-the-fly per-page translation of each of our books using a translation API. We have several different projects in the mix which are not mutually exclusive w/ this Open Libraries Wishlist. This wishlist is for a very specific program (openlibraries.online). We are independently funding large e.g. LGBT collections, Indian collections, etc. These just happen to be separate initiatives which are independently funded. I think we're unlikely to see a huge multi-lingual emphasis in the openlibraries specific push. I think IA is focusing on books a lot of libraries have for this initiative, as a way to potentially help them migrate online. It would be great if we could build a defensible library system any library could contribute to a la the Open Content Alliance back in 2008 cerca OL's inception: |
OK, so it's US English only and that's not going to change. |
I propose the OpenLibrary defer this task until Internet Archive is more transparent about this list. I heard a (hyperlocal) podcast just the other day touting how OpenLibrary could benefit the low income marginalized in Africa, Asia, etc. An American community library "wish" list isn't at all relevant to them. |
Some stats, Out of 2000 randomly sampled Non-English ISBNs on the Wishlist: 75% of the ~1.5M ISBNs are English, 12% are German, which seems to be the next biggest category. I'm planning on creating a tool to analyse ISBN lists by allocated Agency (which in most cases equates to country), so we can get a better indication of book sources from any large list of ISBNs. Importing the Non-English ISBNs into OL seems to give a slightly higher new item rate. The most obvious source of non-English items on the wishlist is probably the international Wikipedia citation lists that were used (items with more than one citation). The bulk of the wishlist is already on OL, so this task is about filling in the gaps, and the biggest gap is in the non-English books. If there are good quality sources of diverse bulk ISBNs (or full metadata!) we can import, point me to them and I can run the imports in parallel. |
I think @hornc finished this! |
There are 2.6M items on the Internet Archive Openlibraries Wishlist. We want to make sure each of these books has a corresponding catalog entry in openlibrary.org.
Steps
wishlist
dataset from Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869 (@mekarpeles will provide an updated link)books API
: https://openlibrary.org/dev/docs/api/books. This won't scale to 2.6M editions which is how many are on the wishlist. So, instead, we'll need to download + process theeditions
data dump https://openlibrary.org/data/ol_dump_editions_latest.txt.gzSee: https://archive.org/details/open_libraries_wish_list
https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_1.csv.zip
@thisismattmiller -- can get a master csv which collates isbn10, isbn13, and oclc (into one row)?
We want to add all of these records into Open Library so we can query by any of these identifier fields and retrieve the metadata for the book.
Also, do we have metadata on these books?
The text was updated successfully, but these errors were encountered: