I have a csv file with a record having a linefeed character in a column(COMMENT). When I read the file with pyspark the record span into multiple lines(3).Even using multiLine option , it doesnt work.Below is my code
spark = SparkSession.builder.appName("ProviderAnalysis").master("local[*]").getOrCreate()
provider_df = (
spark.read.csv("./sample_data/pp.csv",header=True , inferSchema=True,multiLine=True)
)
and below is the record from csv file with linefeed character
"A","B","C","D","COMMENTS","E","F","G","H","I","J","K","L","M","N","O","Q","R","S","T","U","V","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH"
1,"S","S","R","Pxxxx xxx xxxx. xxxxx xxx ""xxxxx xxx xxxx."" xx xxx xxx xx xxx xxxxx xxxx xx 10/27/24.
xxx xxxxx xxxxx xxxx xxxxx xx 6/30/29 -yyy
10/26/2018 fffff ffffff ff: fffffff-ff","fff",,"","fff","ff","","f","","1","1","","",,"1","","","","","","","","","f","",5,"ffff","",""
If I open the file in LibreOffice Calc app it gets displayed as one record however pypsark reads it as 3 lines
Has anyone faced this issue and/or could anyone help me out on how to fix this. Thanks
CodePudding user response:
Try adding the escape
option. You have double quote within the double quoted column (COMMENTS), so need to escape the double quote inside.
provider_df = spark.read.csv("./sample_data/pp.csv",
header=True,
inferSchema=True,
multiLine=True,
escape='"') # <-- added