scala - Spark: OutOfMemory despite MEMORY_AND_DISK_SER -
i wrote piece of code reads multiple parquet files , caches them subsequent use. code looks simplified this
val data = sparkstartup.sqlcontext.read.parquet(...) data.setname(...).persist(storagelevel.memory_and_disk_ser).collect() map += data
the parquet file in total 11g. config application by:
val sparkconfig = new sparkconf().setappname(...).setmaster("local[128]") sparkconfig.set("spark.serializer", "org.apache.spark.serializer.kryoserializer") sparkconfig.set("spark.kryoserializer.buffer.max", "512m"); sparkconfig.set("spark.kryoserializer.buffer", "256"); sparkconfig.set("spark.driver.maxresultsize", "0"); sparkconfig.set("spark.driver.memory", "9g");
i thought using memory_and_disk_ser
, spark spill out disk if memory used. however, `java.lang.outofmemoryerror: java heap space errors @
at java.util.arrays.copyof(arrays.java:3230) @ java.io.bytearrayoutputstream.grow(bytearrayoutputstream.java:113) @ java.io.bytearrayoutputstream.ensurecapacity(bytearrayoutputstream.java:93) @ java.io.bytearrayoutputstream.write(bytearrayoutputstream.java:140) @ java.io.bufferedoutputstream.flushbuffer(bufferedoutputstream.java:82) @ java.io.bufferedoutputstream.write(bufferedoutputstream.java:126) @ com.esotericsoftware.kryo.io.output.flush(output.java:155) @ com.esotericsoftware.kryo.io.output.require(output.java:135) @ com.esotericsoftware.kryo.io.output.writeascii_slow(output.java:446) @ com.esotericsoftware.kryo.io.output.writestring(output.java:306) @ com.esotericsoftware.kryo.util.defaultclassresolver.writename(defaultclassresolver.java:105) @ com.esotericsoftware.kryo.util.defaultclassresolver.writeclass(defaultclassresolver.java:81) @ com.esotericsoftware.kryo.kryo.writeclass(kryo.java:472)
why this? start application -xmx9g -dspark.executor.memory=9g -dspark.executor.cores=3
. files read before crashes, can see in sparkui parquet files takes 9x size when read memory.
it because calling collect()
in driver application. returns array
of data items, need fit memory.
you should instead work data
rdd , map, reduce, group, etc large set of data desired result, , collect()
smaller amount of data.
Comments
Post a Comment