optimization - Python/lxml is eating too much memory -
the program quite simple, recursively descend directories , extract element. directories 1k 200 files of 0.5m. see consumes 2.5g of memory after time, it's unacceptable, script's not alone eat everything. cannot understand why doesn't release memory. explicit del doesn't help. there techniques consider?
lxml import etree import os res=set() root, dirs, files in os.walk(basedir): in files: tree = etree.parse(os.path.join(root,i), parser) in tree.xpath("//a[@class='ctitle']/@href"): res.add(i) del tree
you're keeping references element tree, _elementunicoderesult
. element keeps references parent. prevents whole tree being garbage collected.
try converting element string , store that:
from lxml import etree import os titles = set() root, dirs, files in os.walk(basedir): filename in files: tree = etree.parse(os.path.join(root, filename), parser) title in tree.xpath("//a[@class='ctitle']/@href"): titles.add(str(title))
Comments
Post a Comment