pyspark ignore linefeed character within a column in a csv file-CodePudding

I have a csv file with a record having a linefeed character in a column(COMMENT). When I read the file with pyspark the record span into multiple lines(3).Even using multiLine option , it doesnt work.Below is my code

spark = SparkSession.builder.appName("ProviderAnalysis").master("local[*]").getOrCreate()
provider_df = (
                    spark.read.csv("./sample_data/pp.csv",header=True , inferSchema=True,multiLine=True)

)

and below is the record from csv file with linefeed character

"A","B","C","D","COMMENTS","E","F","G","H","I","J","K","L","M","N","O","Q","R","S","T","U","V","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH"
1,"S","S","R","Pxxxx xxx xxxx. xxxxx xxx ""xxxxx xxx xxxx."" xx xxx xxx xx xxx xxxxx xxxx xx 10/27/24.
xxx xxxxx xxxxx xxxx xxxxx xx 6/30/29 -yyy
10/26/2018 fffff ffffff ff: fffffff-ff","fff",,"","fff","ff","","f","","1","1","","",,"1","","","","","","","","","f","",5,"ffff","",""

If I open the file in LibreOffice Calc app it gets displayed as one record however pypsark reads it as 3 lines

Has anyone faced this issue and/or could anyone help me out on how to fix this. Thanks

CodePudding user response：

Try adding the escape option. You have double quote within the double quoted column (COMMENTS), so need to escape the double quote inside.

provider_df = spark.read.csv("./sample_data/pp.csv",
                             header=True,
                             inferSchema=True,
                             multiLine=True,
                             escape='"')  # <-- added