How should I start solving it in Scala rdd-CodePudding

student marks are stored in hdfs://Hmaster/training/dump/stdmarks1.txt

Input format: sno, name, m1, m2, m3, branch create an rdd and display the student names of students belongs to branch: cse Display the names of students using println. format of output: xxxx yyyy

And I have a sample text file

1,RAMESH,70,52,60,CSE

2,SOMESH,80,69,88,ECE

3,VANITA,90,73,92,CSE

4,KIRAN,74,96,68,IT

The output should be only student's name:

RAMESH

VANITA

Already uploaded the text file in hdfs as given but not able to do further steps

CodePudding user response：

This is an example:

spark
 .read
 .option("header", "true")
 .csv(hdfsFilePath)
 .where(col("m3") === "CSE")
 .select("name")
 .distinct()
 .show()

I recommend you to read the documentation.