I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it
example :
doc1.txt :
" hello my name sam "
doc2.txt :
"hello world"
I need to pass folder path and the results be :
(hello, doc1), (my,doc1), (world,doc2), ..... etc
I tried this :
val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
rddWhole.foreach(f=>{
println(f._1 "=>" f._2)
})
but it's dealing with whole text in the file as one string, any one have idea how ccan i solve it ?
CodePudding user response:
Based on my assumptions, you want to extract every word in a file, and couple it with the file name which the word is contained in it. As you mentioned, spark gives you the whole content of a file as a single string. Like if this is the file content:
hello
my name is
John Doe
The value you get would be:
val fileString = "hello\nmy name is\nJohn Doe"
Right? So you need to split the string value by any amount of spaces or new line characters, like so:
val wordsSeparated = fileString.split("\\s |\\n ") // \\s means space, \\n means new line (in regexes, character escaping and stuff)
So at the end, you'll need something like this:
rddWhole.foreach { f =>
f._2.split("\\s |\\n ").foreach(word => println(f._1 " => " word))
}
This would be the result:
file:/tmp/spark-test/two.txt => and
file:/tmp/spark-test/two.txt => this
file:/tmp/spark-test/two.txt => would
file:/tmp/spark-test/one.txt => so
file:/tmp/spark-test/one.txt => hello
file:/tmp/spark-test/one.txt => my
file:/tmp/spark-test/one.txt => name
file:/tmp/spark-test/one.txt => is
file:/tmp/spark-test/one.txt => John
file:/tmp/spark-test/one.txt => Doe
file:/tmp/spark-test/two.txt => be
file:/tmp/spark-test/two.txt => the
file:/tmp/spark-test/two.txt => second
file:/tmp/spark-test/two.txt => text
file:/tmp/spark-test/two.txt => file