java - How to read HDFS sequence file in spark -


i trying read file hdfs(s3 in case) spark rdd. file in sequenceinputfileformat. not able decode contents of file string. have following code:

package com.spark.example.examplespark;  import java.util.list; import scala.tuple2;  import org.apache.spark.sparkconf; import org.apache.spark.api.java.javasparkcontext; import org.apache.spark.api.java.javardd; import org.apache.spark.api.java.javapairrdd; import org.apache.spark.api.java.function.function; import org.apache.spark.sql.row; import org.apache.spark.sql.sqlcontext; import org.apache.spark.sql.dataframe; import org.apache.spark.sql.hive.hivecontext;   public class raweventdump  {     public static void main( string[] args )     {          sparkconf conf = new sparkconf().setappname("atlas_raw_events").setmaster("local[2]");         javasparkcontext jsc = new javasparkcontext(conf);          javapairrdd<string, byte> file = jsc.sequencefile("s3n://key_id:secret_key@<file>", string.class, byte.class);         list<string> values = file.map(             new function<tuple2<string, byte>, string>() {             public string call(tuple2 row) {                 return "value: " + row._2.tostring() + "\n";             }         }).collect();         system.out.println(values);     } } 

but getting following output:

value: 7b 22 65 76 65 6e ... , value: 7b 22 65 76 65 6e 74 22 3a ... , value: 7b 22 65 76 65 6... ... 

how can read contents of file in spark?

sequence files use hadoop types textwritable,byteswritable,longwritable etc. rdd type should javapairrdd<longwritable, byteswritable>

then turn string should call org.apache.hadoop.io.text.decode(row._2.getbytes())


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -