I have a CSV of data and I need to load it into HDFS directories based on a certain field (year). I am planning to use Java. I have looked at using BufferedReader however I am having trouble implementing it. Would this be the optimal thing to use for this task or is there a better way?
CodePudding user response:
Use Spark to read the CSV into a dataframe.
use partitionBy("year")
during writing to HDFS, and it'll create sub-folders under the path starting with year=
for each unique value.