Home > Software engineering >  pyspark ignore linefeed character within a column in a csv file
pyspark ignore linefeed character within a column in a csv file

Time:03-12

I have a csv file with a record having a linefeed character in a column(COMMENT). When I read the file with pyspark the record span into multiple lines(3).Even using multiLine option , it doesnt work.Below is my code

spark = SparkSession.builder.appName("ProviderAnalysis").master("local[*]").getOrCreate()
provider_df = (
                    spark.read.csv("./sample_data/pp.csv",header=True , inferSchema=True,multiLine=True)

)

and below is the record from csv file with linefeed character

"A","B","C","D","COMMENTS","E","F","G","H","I","J","K","L","M","N","O","Q","R","S","T","U","V","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH"
1,"S","S","R","Pxxxx xxx xxxx. xxxxx xxx ""xxxxx xxx xxxx."" xx xxx xxx xx xxx xxxxx xxxx xx 10/27/24.
xxx xxxxx xxxxx xxxx xxxxx xx 6/30/29 -yyy
10/26/2018 fffff ffffff ff: fffffff-ff","fff",,"","fff","ff","","f","","1","1","","",,"1","","","","","","","","","f","",5,"ffff","",""

If I open the file in LibreOffice Calc app it gets displayed as one record however pypsark reads it as 3 lines

Has anyone faced this issue and/or could anyone help me out on how to fix this. Thanks

CodePudding user response:

Try adding the escape option. You have double quote within the double quoted column (COMMENTS), so need to escape the double quote inside.

provider_df = spark.read.csv("./sample_data/pp.csv",
                             header=True,
                             inferSchema=True,
                             multiLine=True,
                             escape='"')  # <-- added
  • Related