Home > Software engineering >  Apache Spark CSV Write from a DataFrame with Windows New Lines (CRLF)
Apache Spark CSV Write from a DataFrame with Windows New Lines (CRLF)

Time:03-11

I'm running Apache Spark 3.1.2 in a Unix-based cluster to prepare CSV files for a Windows based ingestion system. When the Windows system ingests the CSV file created by the cluster's Spark CSV export, it fails to parse csv because the new lines are LF \n Unix Style new lines while the Windows system is expecting CRLF \r\n style line endings.

Is there a way to configure the Apache Spark CSV exporter to write with windows based new lines despite operating in a unix environment? Is there perhaps a scala tool that can be run after the CSV write that can convert the file to windows new lines before export to the windows system?

I've seen the .option("lineSep", "\r\n") but I believe that's for READING ONLY.

CodePudding user response:

  1. Ugly solution - if your fields are not escaped you can add \r to the last field
  2. Still ugly - if your csv fields don't need escaping - no strange characters, you can build the lines manually by joining all columns with comma add \r at the end and write as text
  3. postprocessing - save as csv, read as text, add \r to each line and save as text.
  4. if files not too big - I guess they are not as you are going to transfer them to another machine for processing, you can use linux tools to add \r, sed, perl, or just unix2dos util

CodePudding user response:

I had to post-process the file. I coalesced it to 1 partition and wrote out the CSV, then used a Java BufferedReader to load the file line by line. I used a BufferedOutputWriter to then pipe the input stream line by line into the writer, injecting \r\n between each line... SO LAME.

  • Related