I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.
I am writing these csv files with the help of spark dataset like this in java
df.write().csv(somePath)
I was also thinking of using coalsec(1) but it is not memory efficient in my case
I know that this write will also create some redundant files in a folder. so need to handle that also
I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file
I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files
so I am not able to figure out how should I merge all the csv files without merging the headers
CodePudding user response:
coalesce(1)
is exactly what you want.
Speed/memory usage is the tradeoff you get for wanting exactly one file
CodePudding user response:
It seems this will do it for you:
# importing libraries
import pandas as pd
import glob
import os
# merging the files
joined_files = os.path.join("/hadoop", "*.csv")
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
Edit: I don't know much about Hadoop, but maybe the same logic applies.