Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GROBID for extraction of metadata from PDFs #6158

Closed
tobiasdiez opened this issue Mar 22, 2020 · 7 comments
Closed

Use GROBID for extraction of metadata from PDFs #6158

tobiasdiez opened this issue Mar 22, 2020 · 7 comments

Comments

@tobiasdiez
Copy link
Member

tobiasdiez commented Mar 22, 2020

Now that we have the GROBID server up and running, we can also use it to extract bibliographic metadata from PDFs.

https://grobid.readthedocs.io/en/latest/Grobid-service/
/api/processHeaderDocument

Old PR (using CERMINE instead of GROBID): #2474

@koppor
Copy link
Member

koppor commented Mar 23, 2020

Currently, it returns TEI XML format only, not BibTeX. I can try to patch the server accordingly. (Refs kermitt2/grobid#532 (comment))

I am curious in which cases GROBID is better than JabRef's custom implementation. It worked fine for me for IEEE and Springer LNCS. Still need to add more test cases though.

@tobiasdiez
Copy link
Member Author

Would be nice if you could change the server accordingly. Grobid is the defacto standard for metadata extraction from pdf (and is used by ResearchGate, Mendeley, etc). Our implementation was really naïve and only works for a few publisher.

@koppor koppor self-assigned this Apr 9, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2020

This issue has been inactive for half a year. Since JabRef is constantly evolving this issue may not be relevant any longer and it will be closed in two weeks if no further activity occurs.

As part of an effort to ensure that the JabRef team is focusing on important and valid issues, we would like to ask if you could update the issue if it still persists. This could be in the following form:

  • If there has been a longer discussion, add a short summary of the most important points as a new comment (if not yet existing).
  • Provide further steps or information on how to reproduce this issue.
  • Upvote the initial post if you like to see it implemented soon. Votes are not the only metric that we use to determine the requests that are implemented, however, they do factor into our decision-making process.
  • If all information is provided and still up-to-date, then just add a short comment that the issue is still relevant.

Thank you for your contribution!

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2021

This issue has been inactive for half a year. Since JabRef is constantly evolving this issue may not be relevant any longer and it will be closed in two weeks if no further activity occurs.

As part of an effort to ensure that the JabRef team is focusing on important and valid issues, we would like to ask if you could update the issue if it still persists. This could be in the following form:

  • If there has been a longer discussion, add a short summary of the most important points as a new comment (if not yet existing).
  • Provide further steps or information on how to reproduce this issue.
  • Upvote the initial post if you like to see it implemented soon. Votes are not the only metric that we use to determine the requests that are implemented, however, they do factor into our decision-making process.
  • If all information is provided and still up-to-date, then just add a short comment that the issue is still relevant.

Thank you for your contribution!

@DesBw
Copy link

DesBw commented Aug 29, 2021

Grobid is the defacto standard for metadata extraction from pdf (and is used by ResearchGate, Mendeley, etc). Our implementation was really naïve and only works for a few publisher.

Mendeley gives junks. I havn't finished cleaning the junk Mendeley gave me 10 years ago. Using the system that Mendeley is using is really bad idea. It never gets it right.

  • it is better to improve other aspects of Jabref than wasting resource on a system that will produce gibberish and unclean reference data.

@Siedlerchr
Copy link
Member

JabRef now uses several sources for extracting metadata from PDF (XMP, embeded bibtex, DOI, Grobid) and allows comparing them

Thank you for reporting this issue. We think, that is already fixed in our development version and consequently the change will be included in the next release.

We would like to ask you to use a development build from https://builds.jabref.org/main and report back if it works for you. Please remember to make a backup of your library before trying-out this version.

@koppor
Copy link
Member

koppor commented Dec 6, 2021

Fixed by #2838

koppor pushed a commit that referenced this issue Aug 1, 2022
c750b6e APA: Put conditional event-title logic in a macro (#6161)
a87414f Remove month from association-for-compuational-linguistics.csl (#6158)
6153db0 Remove issue numbers from BJOC style (#6155)
e231ea3 Bug fix for `event` regression (#6154)
0dab651 Add event-title to other APA styles (#6153)
698cf1c APA: `event-title` and conditional `event` (#6152)
58d3f8f Update vancouver-author-date.csl (#6148)
f1638a9 add substitute to Vancouver author date (#6147)
39fede5 Update associacao-brasileira-de-normas-tecnicas.csl (#6138)
fde7695 Include chapter title (#6140)
1e3d8b4 Update n.d. abbreivation for DGP style (#6136)
ebb728b suffix '.' after first group; changed e-mail (#6135)
eed4f07 Update and rename sciences-po-ecole-doctorale-note-french.csl to scie… (#6127)
f194647 Delete TU Dresden Medizin as requested by library (#6131)
d8423d8 Create entomological-review.csl (#6120)
064a394 Create australasian-journal-of-philosophy.csl (#6063)
a998ded Add composer.json (#5668)
37083c9 Update copernicus-publications.csl (#6062)
694c97b Create chaucer review (#6061)
625a424 Create haffner-style-manual.csl (#6054)
8b7224b make annals-of-allergy-asthma-and-immunology independent (#6041)
710748c Create university-of-pretoria-harvard-theology-religion.csl (#6106)
d16dffd Create health-physics.csl (#6040)
ca9e184 Update style-manual-australian-government.csl (#6119)
e412277 Create chemical-engineering-technology.csl (#6039)
bebdb48 Create bibliothek-forschung-und-praxis.csl (#6038)
29e49cd Update nature.csl (#6117)
891897d fix short title for SBL (#6118)

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: c750b6e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants