python - PDFMiner - pdf2txt.py parsing text out of order -


i'm new both python , pdfminer, although not programming in general. i'm using python 2.7 on windows 7, , in cygwin.

i'm writing script beautifulsoup extract specific info pdfs. that, used pdf2txt create both .txt , .html file of pdf use in testing. usual commands:

python pdf2txt.py -o output.txt input.pdf

python pdf2txt.py -o output.html input.pdf

they both have same problem: lines of text appear out of order. run pdf2txt on pdf , you'll see mean: (edit: found original site.) it's american samoa 1 on page: https://www.iamovers.org/resourcespublications/shipperguides.aspx?navitemnumber=580

for example, correct layout of 1 section of page 1 of pdf:

documents required

copy of passport (some ports require passports family members listed on 3299)
form cf-3299
supplemental declaration (required ports)
detailed inventory in english
copy of visa (if non-us citizen / permanent resident) / copy of permanent resident card
i-94 stamp / card
copy of bill of lading (obl) / air waybill (awb)
form ds-1504 (diplomats)
a-1 visa (diplomats)
importers security filing (isf)

and how comes out both txt , html conversion using pdf2txt.py:

documents required

copy of passport (some ports require passports family members listed on 3299)
form cf-3299
supplemental declaration (required ports)
detailed inventory in english
copy of visa (if non-us citizen / permanent resident) / copy of permanent resident card

copy of bill of lading (obl) / air waybill (awb)
form ds-1504 (diplomats)
a-1 visa (diplomats)

importers security filing (isf)

i-94 stamp / card

for reason lines beginning letter "i" taken place, , placed either on next blank line, or previous blank line. there other characters keeps happening well.

i've posted issue here: https://github.com/euske/pdfminer/issues/121

it has issue pdfminer. don't yet know python enough go making changes in packages. know how fix this?

i don't want give email address file hosting site can't @ example provided.

but, speaking, if pdf file contains text in neatly ordered rows, pdf still not text file format. it's more script tells pdf reader graphics , text elements place on page , in order.

you seem want plaintext output match layout of pdf file if think that's not such trivial requirement. assume (as pdf2txt apparently does) elements sorted within pdf file, top-to-bottom, left-to-right, , scan through list of elements each page, adding extractable string or character come across end of string. that's naive assumption because there's no requirement elements in pdf page ordered in way (except back-to-front in case overlaps.)

personally i've used c# version of pdftron's pdfnet lot. it's able produce list of words on text page each word represented string , bounding box. can create rules chaining words lines , ordering lines, although more complicated more types of layout want take account (e.g. text arranged in 2 columns, table layouts, vertical text along borders of page, etc.)

there's python version, too, maybe have @ that. bottom line you're trying not simple.


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -