python - PDFMiner - pdf2txt.py parsing text out of order -
i'm new both python , pdfminer, although not programming in general. i'm using python 2.7 on windows 7, , in cygwin.
i'm writing script beautifulsoup extract specific info pdfs. that, used pdf2txt create both .txt , .html file of pdf use in testing. usual commands:
python pdf2txt.py -o output.txt input.pdf
python pdf2txt.py -o output.html input.pdf
they both have same problem: lines of text appear out of order. run pdf2txt on pdf , you'll see mean: (edit: found original site.) it's american samoa 1 on page: https://www.iamovers.org/resourcespublications/shipperguides.aspx?navitemnumber=580
for example, correct layout of 1 section of page 1 of pdf:
documents required
copy of passport (some ports require passports family members listed on 3299)
form cf-3299
supplemental declaration (required ports)
detailed inventory in english
copy of visa (if non-us citizen / permanent resident) / copy of permanent resident card
i-94 stamp / card
copy of bill of lading (obl) / air waybill (awb)
form ds-1504 (diplomats)
a-1 visa (diplomats)
importers security filing (isf)
and how comes out both txt , html conversion using pdf2txt.py:
documents required
copy of passport (some ports require passports family members listed on 3299)
form cf-3299
supplemental declaration (required ports)
detailed inventory in english
copy of visa (if non-us citizen / permanent resident) / copy of permanent resident cardcopy of bill of lading (obl) / air waybill (awb)
form ds-1504 (diplomats)
a-1 visa (diplomats)importers security filing (isf)
i-94 stamp / card
for reason lines beginning letter "i" taken place, , placed either on next blank line, or previous blank line. there other characters keeps happening well.
i've posted issue here: https://github.com/euske/pdfminer/issues/121
it has issue pdfminer. don't yet know python enough go making changes in packages. know how fix this?
i don't want give email address file hosting site can't @ example provided.
but, speaking, if pdf file contains text in neatly ordered rows, pdf still not text file format. it's more script tells pdf reader graphics , text elements place on page , in order.
you seem want plaintext output match layout of pdf file if think that's not such trivial requirement. assume (as pdf2txt apparently does) elements sorted within pdf file, top-to-bottom, left-to-right, , scan through list of elements each page, adding extractable string or character come across end of string. that's naive assumption because there's no requirement elements in pdf page ordered in way (except back-to-front in case overlaps.)
personally i've used c# version of pdftron's pdfnet lot. it's able produce list of words on text page each word represented string , bounding box. can create rules chaining words lines , ordering lines, although more complicated more types of layout want take account (e.g. text arranged in 2 columns, table layouts, vertical text along borders of page, etc.)
there's python version, too, maybe have @ that. bottom line you're trying not simple.
Comments
Post a Comment