Replies: 5 comments 1 reply
-
Raw outputcodefrom test/test_pdf.py
create HTML output with coordinates but no flow
|
Beta Was this translation helpful? Give feedback.
-
commandline
creates output for each page but NOT for total document
e.g.
Note the layout is determined by the coordinate-based text
|
Beta Was this translation helpful? Give feedback.
-
Processing of command
goes through
PDF commands |
Beta Was this translation helpful? Give feedback.
-
Flow example (imperfect, needs debugging but creates whole chapter)THIS ONE NEARLY WORKS example (being gradually transferred to new
input argumentsdebug calls removed for clarity.
chunks to removeThese are chunks to remove. Not sure they work. They are mainly inter-page cruft.
convert and write (messy)This dies several things controlled by args and class variables. Needs cleaning?
This is a messy routine - see below - needs more switches
|
Beta Was this translation helpful? Give feedback.
-
.pdf_to_raw_then_raw_to_tidy (messy)
This should be higher up it's the major module that creates raw html (i.e. with coordinates and unjoined lines)
# program option
these should be switchable by flags
IDs are added sequently in the form
sections are a major task. I think they are joined in the wrong order
|
Beta Was this translation helpful? Give feedback.
-
Some of the PDF parsers create characters with coordinates ("raw"). These display on separate lines in the HTML. they need to be joined into a single flowing paragraph ("flow").
Beta Was this translation helpful? Give feedback.
All reactions