apache spark - broadcast variable fails to take all data -
when applying broadcast variable collectasmap(), not values included broadcast variable. e.g.
val emp = sc.textfile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct() val emp_new = sc.textfile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct() emp_new.foreach(println) val emp_newbc = sc.broadcast(emp_new.collectasmap()) println(emp_newbc.value)
when checked values within emp_newbc saw not data emp_new appear. missing?
thanks in advance.
the problem emp_new collection of tuples, while emp_newbc broadcasted map. if collecting map, duplicate keys being removed , therefore have less data. if want list of tuples, use
val emp_newbc = sc.broadcast(emp_new.collect())
Comments
Post a Comment