scala - Spark: OutOfMemory despite MEMORY_AND_DISK_SER -


i wrote piece of code reads multiple parquet files , caches them subsequent use. code looks simplified this

val data = sparkstartup.sqlcontext.read.parquet(...) data.setname(...).persist(storagelevel.memory_and_disk_ser).collect() map += data 

the parquet file in total 11g. config application by:

 val sparkconfig = new sparkconf().setappname(...).setmaster("local[128]") sparkconfig.set("spark.serializer", "org.apache.spark.serializer.kryoserializer") sparkconfig.set("spark.kryoserializer.buffer.max", "512m"); sparkconfig.set("spark.kryoserializer.buffer", "256"); sparkconfig.set("spark.driver.maxresultsize", "0"); sparkconfig.set("spark.driver.memory", "9g"); 

i thought using memory_and_disk_ser, spark spill out disk if memory used. however, `java.lang.outofmemoryerror: java heap space errors @

at java.util.arrays.copyof(arrays.java:3230) @ java.io.bytearrayoutputstream.grow(bytearrayoutputstream.java:113) @ java.io.bytearrayoutputstream.ensurecapacity(bytearrayoutputstream.java:93) @ java.io.bytearrayoutputstream.write(bytearrayoutputstream.java:140) @ java.io.bufferedoutputstream.flushbuffer(bufferedoutputstream.java:82) @ java.io.bufferedoutputstream.write(bufferedoutputstream.java:126) @ com.esotericsoftware.kryo.io.output.flush(output.java:155) @ com.esotericsoftware.kryo.io.output.require(output.java:135) @ com.esotericsoftware.kryo.io.output.writeascii_slow(output.java:446) @ com.esotericsoftware.kryo.io.output.writestring(output.java:306) @ com.esotericsoftware.kryo.util.defaultclassresolver.writename(defaultclassresolver.java:105) @ com.esotericsoftware.kryo.util.defaultclassresolver.writeclass(defaultclassresolver.java:81) @ com.esotericsoftware.kryo.kryo.writeclass(kryo.java:472) 

why this? start application -xmx9g -dspark.executor.memory=9g -dspark.executor.cores=3. files read before crashes, can see in sparkui parquet files takes 9x size when read memory.

it because calling collect() in driver application. returns array of data items, need fit memory.

you should instead work data rdd , map, reduce, group, etc large set of data desired result, , collect() smaller amount of data.


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -