here, df is our data frame in which we are having our output, As i'm using dataframewriter to write the whole output to directory, but all the data is getting partitioned as mentioned below..
$ hdfs dfs -ls /path to hdfs directory..
Found 4 items
-rw-r--r-- 3 xxxxxx xxxxxxx 0 2022-04-28 23:19 path to hdfs directory../_SUCCESS
-rw-r--r-- 3 xxxxxx xxxxxx 238 2022-04-28 23:19 path to hdfs directory../part-00000-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv
-rw-r--r-- 3 xxxxxxx xxxxxxx 6204498 2022-04-28 23:19 path to hdfs directory../part-00043-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv
-rw-r--r-- 3 xxxxxxx xxxxxxx 5875627 2022-04-28 23:19 path to hdfs directory../part-00191-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv
I want all the data into one single CSV file, is there any other option put in the code.. below
df.write.mode("overwrite").csv('path to hdfs directory', header = True, sep = ',')
the data is about 55k rows in the df.
CodePudding user response:
You can use coalesce(1)
to make a single CSV file
df.coalesce(1).write.mode("overwrite").csv('path to hdfs directory', header = True, sep = ',')