merge multiple csv files present in hadoop into one csv files in local-CodePudding

I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.

I am writing these csv files with the help of spark dataset like this in java

df.write().csv(somePath)

I was also thinking of using coalsec(1) but it is not memory efficient in my case

I know that this write will also create some redundant files in a folder. so need to handle that also

I want to merge all these csv files into one big csv files but I don't want to repeat the header in the combined csv files.I just want one line of header on top of data in my csv file

I am working with python to merging these files. I know I can use hadoop getmerge command but it will merge the headers also which are present in each csv files

so I am not able to figure out how should I merge all the csv files without merging the headers

CodePudding user response：

coalesce(1) is exactly what you want.

Speed/memory usage is the tradeoff you get for wanting exactly one file

CodePudding user response：

It seems this will do it for you:

# importing libraries
import pandas as pd
import glob
import os
  
# merging the files
joined_files = os.path.join("/hadoop", "*.csv")
  
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
  
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)

Edit: I don't know much about Hadoop, but maybe the same logic applies.