Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance update: Refactor loader add json, yamllib support #471

Merged
merged 3 commits into from
Mar 30, 2023

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Feb 12, 2023

This PR is a big refactor of the loader.
Sorry for that ;)

Invoice2data was very slow to use. The main cause is the loading of the yaml templates by pyyaml.
This pr adds support for the c library libyaml. Which speeds up the template loading by 10x
To use it, one needs to install this library.

$ apt-get install libyaml-dev

The usage of yaml templates was an design choice. They offer good human readability and might be easier to write then json files.
yaml is also very flexible, as it is able to package json inside.
(Discovered that, while developing the camelot plugin)

However there is a good case to be made to support json templates as well.

  • natively supported by python no additional library needed.
  • one of the most popular data formats
  • json can be 223x faster then yaml.
  • Some invoice2data plugins might require json templates (like camelot which is still in development).
  • Yaml files can be converted to json.

@bosd bosd marked this pull request as ready for review February 12, 2023 23:21
@bosd bosd mentioned this pull request Feb 12, 2023
@rmilecki
Copy link
Collaborator

I'm still thinking about this... give me a bit more time please.

@rmilecki
Copy link
Collaborator

Some first review note: I don't much like this one big commit handling few independent tasks. You get rid of ordered_load (without any real explanation), you start using SafeLoader and then you add support for JSON. For cleanliness I'd rather see those as at least 3 different commits.

Now on the real topic.

I totally agree we need to improve invoice2data performance. Loading is too slow. It seems that SafeLoader is a good idea (I didn't test it yet).

My another idea would be to add cache support. It should be quite simple:

  1. For each loaded template file calculate its checksum (SHA-256 probably?). Use cache with checksum (key) and dictionary (value).
  2. When loading a template first check if it's cached already
  3. We could use built-in pickle or diskcache (is Apache 2.0 compatible with MIT?)

Finally I don't really like idea of adding new templates format. I understand JSON is faster to load but with cache implemented (which we need anyway as we want to keep YAML support) it shouldn't matter. I'd rather stick to one format to avoid any compatibility issue later, to avoid increasing maintenance cost and to avoid discussions on further formats. We already had a request for storing templates as database tables. Someone may prefer XML over YAML/JSON and request to support it too.
So I'd suggest to stick to support for YAML only.

@bosd
Copy link
Collaborator Author

bosd commented Feb 19, 2023

Let's first respond to the json support.
Will give you the rationale behind this change.

TL;DR

Yaml and json are basically the same thing. If the Pyyaml library gets updated to support the "new" (from 2009) yaml 1.2 standard. Invoice2data is getting support for Json documents either way. (whether we are concious about it or not) But then it would be realized by the external lib. (note the performance issue).
We could wait for that change to happen. But it does'nt make a lot of sense as we have a better performing native solution at our disposal. This change is only fasttracking the support of json documents and implementing a better performance wise alternative.

Rationale:
Because of the design choice to use yaml templates an external library was needed as it was not natively supported by python. For invoice2data that library became the well known pyyaml library.
(which is also the main problem for the slow performance)
Development of the pyyaml library slowed down. (The focus seemed to be more on the libyaml c variant.)
Or whatever background it is. the Pyyaml library is currently still only supporting theolder YAML 1.1 specification.
However yaml 1.2 specification has been out since 2009!

Starting from YAML 1.2, the format officially became a strict superset of JSON, meaning that every valid JSON document also happens to be a YAML document.

source: https://realpython.com/python-yaml/

Basically, if Pyyaml started to support the yaml 1.2 standard. It could parse json documents as well.
Meaning invoice2data would support json templates.
If/when that is going to happen with pyyaml is a big question markt. There have been several attempts to update the library with this support. Recently there is been a new attempt / pr.

(if we really needed/wanted we could exchange the pyyaml library for some other).
But better prevent the hassle.

Conclusion
Json support will happen either way, if we are conscious about it or not. Better to realize it by using core python modules instead of depending on (slower) 3rd party modules.

@bosd
Copy link
Collaborator Author

bosd commented Feb 19, 2023

My another idea would be to add cache support. It should be quite simple:

I think this is a different topic. Can be quite interesting, I personally don't have a opinion on this now.

We already had a request for storing templates as database tables.

I have seen an implementation of this. The database handling was handled by a third party application.
It stored the templates in their own database. Depending on the type of document to be parsed it called invoice2data with the corresponding dict with a subset of templates.
e.g.

  1. Input pdf = type invoice; Fetch all the templates of the type purchase_invoice
    call extract_data(file_name, templates=purchase_invoice)

  2. Input pdf = a purchase order; Fetch all the templates of the type po
    call extract_data(file_name, templates=po)

Very nice solution. IMO this is something for the 3rd party application. Not for the "core" invoice2data parser.

@bosd
Copy link
Collaborator Author

bosd commented Feb 19, 2023

You get rid of ordered_load (without any real explanation),

Orderedload is a leftover for python 2.7 support. It is no longer needed. Since python 3, there are cleaner/better options.

you start using SafeLoader

Rationale behind this is that the Loader has some security concerns associated with it.
(Have to google again what it was exactly)

The best option (performance wise) would be to use base loader.
As we don't have a very advanced or difficult subset in our yaml files.
Using the baseloader would increase performance by yet another 20%.
Sadly, we use booleans in the yaml files.
remove_accents: True.
Switching to the baseloader would break that functionality. So decided, not to do that (yet??)

If we really wanted to we could use an alternative implementation. if the key remove_accents is present we could start that function. It does'nt make sense to use remove_accents: False in a template. (but it can happen and it supported)

@bosd
Copy link
Collaborator Author

bosd commented Feb 20, 2023

some more rationale behind the removal of ordered load:

In python 2 the dictionaries were not ordered.
In Python 3.6, dictionaries were redesigned to improve their performance (their memory usage was decreased by around 20–25%). This change had an interesting side-effect — dictionaries became ordered

OrderedDict is over 80% slower than the standard Python dictionary.

source: https://python.plainenglish.io/should-we-still-use-ordereddict-in-python-f223c85a01d5

Propably there are some more places, where we can move from ordereddict.

@bosd bosd force-pushed the refactor-loader branch 10 times, most recently from 9597a9e to b7bcaa2 Compare March 6, 2023 20:21
@bosd
Copy link
Collaborator Author

bosd commented Mar 6, 2023

@rmilecki I've split the commits so it is easier to review. Hope we can merge this soon.

@bosd
Copy link
Collaborator Author

bosd commented Mar 12, 2023

@rmilecki Can you have a look at this one??
It is a very important pr. This makes using the lib an absolute pleasure!

@bosd
Copy link
Collaborator Author

bosd commented Mar 19, 2023

This is an important pr. As it increases speed and usbility a lot.
I'm keen for getting it into production. (Which pulls invoice2data form pypi)

We can depate further to drop the commit 'native json' support and wait until it's merged in pyyaml.
(but I expect that's slower)

Can either @rmilecki or @m3nu approve so we can merge this.

Ordereddict no longer needed in python3
No coverage
@bosd
Copy link
Collaborator Author

bosd commented Mar 30, 2023

This one is open for some time. 2 weeks passed since my last question.
So following better ask for Forgiveness then wait for permission model.
So mergiing this, We can always revert.

@bosd bosd merged commit 0d2d16c into invoice-x:master Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants