java - How to read HDFS sequence file in spark -
i trying read file hdfs
(s3
in case) spark rdd. file in sequenceinputfileformat
. not able decode contents of file string. have following code:
package com.spark.example.examplespark; import java.util.list; import scala.tuple2; import org.apache.spark.sparkconf; import org.apache.spark.api.java.javasparkcontext; import org.apache.spark.api.java.javardd; import org.apache.spark.api.java.javapairrdd; import org.apache.spark.api.java.function.function; import org.apache.spark.sql.row; import org.apache.spark.sql.sqlcontext; import org.apache.spark.sql.dataframe; import org.apache.spark.sql.hive.hivecontext; public class raweventdump { public static void main( string[] args ) { sparkconf conf = new sparkconf().setappname("atlas_raw_events").setmaster("local[2]"); javasparkcontext jsc = new javasparkcontext(conf); javapairrdd<string, byte> file = jsc.sequencefile("s3n://key_id:secret_key@<file>", string.class, byte.class); list<string> values = file.map( new function<tuple2<string, byte>, string>() { public string call(tuple2 row) { return "value: " + row._2.tostring() + "\n"; } }).collect(); system.out.println(values); } }
but getting following output:
value: 7b 22 65 76 65 6e ... , value: 7b 22 65 76 65 6e 74 22 3a ... , value: 7b 22 65 76 65 6... ...
how can read contents of file in spark?
sequence files use hadoop types textwritable,byteswritable,longwritable etc. rdd type should javapairrdd<longwritable, byteswritable>
then turn string should call org.apache.hadoop.io.text.decode(row._2.getbytes())
Comments
Post a Comment