How do I extract rows from a PDF file into a csv file? -
i want list of colleges in usa pdf file , put csv file. import csv file sql server (so can run queries easily).
i tried several online pdf csv converters , java based pdf csv tutorials. nothing worked. have spent 6-8 hours today , failed. csv files messed , had lot of nulls in db when imported csv. tried searching dhs api give me info found none.
can please me extract colleges shown in pdf file ?
ps: can see colleges using this url also. but, have scroll manually extract results. take long , data not in format given in pdf file.
as claimed in comment question,
considering straight forward page content stream style, data should extractable using not complicated custom text extractor.
in detail:
the page content stream style
regular table entry content drawn entry entry, each entry field field in reading order. thus, while going through content stream not have try , re-arrange content establish order. makes task easy.
so main work ignore non-entries, i.e. header on first page, bars indicating new first letter starts, , page numbers.
we by
- ignoring graphics , non-black text takes care of header , first letter bars;
- not accepting entries not starting data in school name column takes care of page numbers live in campus name column.
(other approaches have done, e.g. ignoring in bottom page area take care of page numbers.)
now merely have split entries fields.
again document structure helps, uniform document, table columns have identical position , dimensions on each page. merely have dissect @ fixed x values.
there 1 stumbling block: in entries atomic text chunks contain content of different columns. e.g. contents of f , m columns drawn single string "yn" , optical distance introduced via character spacing.
so have process text chunks character character, not whole.
a sample implementation
i use java , pdf library itext (current version 5.5.7 development snapshot) here. not mean @ cannot done using different setup, merely setup i'm accustomed to.
as separator use tab character because other candidates occur part of text , did not want have cope escaping them.
this custom renderlistener
class introduced cope content explained above:
public class certifiedschoollistextractionstrategy implements renderlistener { public certifiedschoollistextractionstrategy(appendable data, appendable nondata) { this.data = data; this.nondata = nondata; } // // renderlistener implementation // @override public void begintextblock() { } @override public void endtextblock() { } @override public void renderimage(imagerenderinfo renderinfo) { } @override public void rendertext(textrenderinfo renderinfo) { try { vector startpoint = renderinfo.getbaseline().getstartpoint(); basecolor fillcolor = renderinfo.getfillcolor(); if (fillcolor instanceof graycolor && ((graycolor)fillcolor).getgray() == 0) { if (debug) data.append(string.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startpoint.get(i1), startpoint.get(i2), renderinfo.gettext())); (textrenderinfo info : renderinfo.getcharacterrenderinfos()) { rendercharacter(info); } } else { if (debug) nondata.append(string.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startpoint.get(i1), startpoint.get(i2), renderinfo.gettext())); if (currentfield > -1) finishentry(); entrybuilder.append(renderinfo.gettext()); } } catch (ioexception e) { e.printstacktrace(); } { chunk++; } } public void rendercharacter(textrenderinfo renderinfo) throws ioexception { vector startpoint = renderinfo.getbaseline().getstartpoint(); float x = startpoint.get(i1); if (currentfield > -1) { if (isincurrentfield(x)) { entrybuilder.append(renderinfo.gettext()); return; } if (isinnextfield(x)) { currentfield++; entrybuilder.append('\t').append(renderinfo.gettext()); return; } finishentry(); } if (isinnextfield(x)) { finishentry(); currentfield = 0; } entrybuilder.append(renderinfo.gettext()); } public void close() throws ioexception { finishentry(); } boolean isincurrentfield(float x) { if (currentfield == -1) return false; if (x < fieldstarts[currentfield]) return false; if (currentfield == fieldstarts.length - 1) return true; return x <= fieldstarts[currentfield + 1]; } boolean isinnextfield(float x) { if (currentfield == fieldstarts.length - 1) return false; if (x < fieldstarts[currentfield + 1]) return false; if (currentfield == fieldstarts.length - 2) return true; return x <= fieldstarts[currentfield + 2]; } void finishentry() throws ioexception { if (entrybuilder.length() > 0) { if (currentfield == fieldstarts.length - 1) { data.append(entrybuilder).append('\n'); } else { nondata.append(entrybuilder).append('\n'); } entrybuilder.setlength(0); } currentfield = -1; } // // hidden members // final appendable data, nondata; boolean debug = false; int chunk = 0; int currentfield = -1; stringbuilder entrybuilder = new stringbuilder(); final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534}; }
(certifiedschoollistextractionstrategy.java)
we can use this:
@test public void testcertifiedschoollist_9_16_2015() throws ioexception { try ( writer data = new outputstreamwriter(new fileoutputstream(new file(result_folder, "data.txt")), "utf-8"); writer nondata = new outputstreamwriter(new fileoutputstream(new file(result_folder, "non-data.txt")), "utf-8") ) { certifiedschoollistextractionstrategy strategy = new certifiedschoollistextractionstrategy(data, nondata); pdfreader reader = new pdfreader("certified-school-list-9-16-2015.pdf"); pdfreadercontentparser parser = new pdfreadercontentparser(reader); (int page = 1; page <= reader.getnumberofpages(); page++) parser.processcontent(page, strategy); strategy.close(); } }
(extractcertifiedschoollist.java)
now data.txt
contains entries tab-separated lines , non-data.txt
ignored.
behind scenes
to understand happening here, 1 first has know how page content in pdfs organized , how (for sample code given) itext operates on it.
inside pdf
pdf documents structures built number of base object types, primitive types (numbers, strings, ...) , more complex ones (arrays or dictionaries of other objects or streams).
a page in pdf document represented such dictionary object containing entries defining page properties (like page dimensions) , other entries referencing objects define drawn on page: content streams.
content streams contain sequence of operations, may
- select color (for stroking or filling),
- define path (move point, line other point, curve yet one, ...),
- stroke or fill such path,
- draw bitmap image somewhere,
- draw text somewhere, or
- do numerous other things.
for question @ hand interested in operations involved in drawing text. in contrast word processors operations not take long string of text , arrange paragraph instead more primitively move text position here, draw short string here, move text position again, , draw string there.
e.g. in sample pdf oeprations drawing table header , first entry line these:
/tt2 1 tf
select font tt2 @ size 1.
9.72 0 0 9.72 20.16 687.36 tm
set text matrix move text insertion coordinates 20.16, 687.36 , scale following factor of 9.72.
0 g
select grayscale fill color black
0 tc 0 tw
select additional character , word spacing 0.
(school)tj
draw "school" here.
/tt1 1 tf
select font tt1.
3.4082 0 td
move text insertion point 3.4082 in x direction.
<0003>tj
draw space character (the current font uses different encoding uses 16 bit per character, not 8, , here represented hexadecimally).
/tt2 1 tf .2261 0 td [(name)-17887.4(campus)]tj
select font, move text insertion point, , draw string "name", gap of 17887.4 text units, draw "campus".
/tt1 1 tf 24.1809 0 td <0003>tj /tt2 1 tf .2261 0 td [(name)-8986.6(f)-923.7(m)-459.3(city)-6349.9(st)-1390.2(campus)]tj /tt1 1 tf 28.5147 0 td <0003>tj /tt2 1 tf .2261 0 td (id)tj
draw rest of header line.
/tt4 1 tf -56.782 -1.3086 td
move left 56.782 , down 1.3086, i.e. start of first entry line.
("i)tj /tt3 1 tf .6528 0 td <0003>tj /tt4 1 tf .2261 0 td (am")tj /tt3 1 tf 1.7783 0 td <0003>tj /tt4 1 tf .2261 0 td (school)tj /tt3 1 tf 2.6919 0 td <0003>tj /tt4 1 tf .2261 0 td [(inc.)-16894.2("i)]tj /tt3 1 tf 18.9997 0 td <0003>tj /tt4 1 tf .2261 0 td (am")tj /tt3 1 tf 1.7783 0 td <0003>tj /tt4 1 tf .2261 0 td (school)tj /tt3 1 tf 2.6919 0 td <0003>tj /tt4 1 tf .2261 0 td [(inc.)-8239.9(y)-1018.9(n)-576.7(mount)]tj /tt3 1 tf 15.189 0 td <0003>tj /tt4 1 tf .2261 0 td [(shasta)-2423.3(ca)-2443.7(41789)]tj
and draw first entry line.
as see , had mentioned above, table content drawn in reading order. multi line column entries come in needed order, e.g. campus name "a f international of westlake village":
[(inc.)-7228.7(a)]tj /tt3 1 tf 9.26 0 td <0003>tj /tt4 1 tf .2261 0 td (f)tj /tt3 1 tf .4595 0 td <0003>tj /tt4 1 tf .2261 0 td (international)tj /tt3 1 tf 5.2886 0 td <0003>tj /tt4 1 tf .2261 0 td (of)tj /tt3 1 tf .8325 0 td <0003>tj /tt4 1 tf .2261 0 td (westlake)tj /tt3 1 tf 3.7739 0 td <0003>tj /tt4 1 tf -11.8374 -1.3086 td
move down second line of column.
(village)tj 15.4938 1.3086 td
move again main line of entry.
[(y)-1018.9(n)-576.7(westlake)]tj
so can digest text comes, no need sorting (the content ordered in different way).
but see there no obvious column start , end points. associate text column, therefore, have calculate positions of each character , compare them externally given column start positions.
parsing supported libraries
pdf libraries provide mechanism parsing such content streams.
there 2 basic architectures this, library may parse content stream
- as whole , provide big array of positioned text chunks or
- or piecewise , forward individual positioned text chunks using listener pattern.
the former variant @ first seems easier handle may have big resource requirements (i have come across multi-mb content streams), while second 1 seems bit more difficult handle has smaller memory requirements.
the library used (itext) follows latter approach problem have been solved using library following former one.
renderlistener
listener interface implement here, rendertext
methods retrieves tindividual text chunks positions.
in implementation above (certifiedschoollistextractionstrategy
) rendertext
method first checks fill color associated chunk , forwards black text further processing in rendercharacter
. method (and helpers) in turn checks field text in (by hard coded position boundaries) , accordingly exports tab separated values. logic have been implemented using other libraries, too.
Comments
Post a Comment