apache spark - broadcast variable fails to take all data -


when applying broadcast variable collectasmap(), not values included broadcast variable. e.g.

    val emp = sc.textfile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()     val emp_new = sc.textfile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()     emp_new.foreach(println)      val emp_newbc = sc.broadcast(emp_new.collectasmap())     println(emp_newbc.value) 

when checked values within emp_newbc saw not data emp_new appear. missing?

thanks in advance.

the problem emp_new collection of tuples, while emp_newbc broadcasted map. if collecting map, duplicate keys being removed , therefore have less data. if want list of tuples, use

val emp_newbc = sc.broadcast(emp_new.collect())


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -