optimization - Python/lxml is eating too much memory -


the program quite simple, recursively descend directories , extract element. directories 1k 200 files of 0.5m. see consumes 2.5g of memory after time, it's unacceptable, script's not alone eat everything. cannot understand why doesn't release memory. explicit del doesn't help. there techniques consider?

 lxml import etree import os  res=set() root, dirs, files in os.walk(basedir):     in files:         tree = etree.parse(os.path.join(root,i), parser)         in tree.xpath("//a[@class='ctitle']/@href"):             res.add(i)         del tree 

you're keeping references element tree, _elementunicoderesult. element keeps references parent. prevents whole tree being garbage collected.

try converting element string , store that:

from lxml import etree import os  titles = set() root, dirs, files in os.walk(basedir):     filename in files:         tree = etree.parse(os.path.join(root, filename), parser)         title in tree.xpath("//a[@class='ctitle']/@href"):             titles.add(str(title)) 

Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -