I have a folder labeled 'input' with multiple CSV files in it. They all have the same columns names but the data is different in each CSV file.
I'm wondering how I can use Spark and Java to go to the folder labeled 'input', read all the CSV files in that folder and merge those CSV files into one file.
The files in the folder may change e.g.might have 4 CSV files and another day have 6 and so on so forth.
Dataset<Row> df = (
spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/Users/input/*.csv")
);
but I'm not get an output, spark just shuts down.
I don't want to list all the CSV files in the folder I want the code to take any CSV files present in that folder and merge. Is this possible?
From there I can use that one CSV file to convert into a data frame.
Thanks in advance
CodePudding user response:
You may use an older version of the data source.
Dataset<Row> df = spark.read()
.format("csv")
.option("header", true)
.load("/Users/input/*.csv");
df.show();
should work.
You can find a complete example there: https://github.com/jgperrin/net.jgp.books.spark.ch01 and with multiple files there: https://github.com/jgperrin/net.jgp.books.spark.ch15/blob/master/src/main/java/net/jgp/books/spark/ch15/lab300_nyc_school_stats/NewYorkSchoolStatisticsApp.java
CodePudding user response:
1.Create a new DataFrame(headerDF) containing header names.
2.Union it with the DataFrame(dataDF) containing the data.
3.Output the union-ed DataFrame to disk with option("header", "false").
4.merge partition files(part-0000**0.csv) using hadoop FileUtil
In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file.
Ref1 : https://intellipaat.com/community/16329/merge-spark-output-csv-files-with-a-single-header
Ref2: Merge Spark output CSV files with a single header