Wrong encoding when reading csv file with pyspark-CodePudding

For my course in university, I run pyspark-notebook docker image

docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook

And then run next python code

import pyspark 
from pyspark.sql import SparkSession
from pyspark.sql.types import *

sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED') 
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()

The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running

listings_df.groupBy("room_type").count().show()

which gives next output

 --------------- ----- 
|      room_type|count|
 --------------- ----- 
|            169|    1|
|        4.88612|    1|
|        4.90075|    1|
|    Shared room|   44|
|             35|    1|
|            187|    1|
|           null|   16|
|             70|    1|
|             27|    1|
|             75|    1|
|     Hotel room|  109|
|            198|    1|
|             60|    1|
|            280|    1|
|Entire home/apt|12818|
|            220|    1|
|            190|    1|
|            156|    1|
|            450|    1|
|        4.88865|    1|
 --------------- ----- 
only showing top 20 rows

while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].

Spark info which might be useful:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.1.2
Master
local[*]
AppName
pyspark-shell

And encoding of the file

!file listings.csv
listings.csv: UTF-8 Unicode text

listings.csv is an Airbnb statistics csv file downloaded from here

All run & drive code I've also uploaded to Colab

CodePudding user response：

I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.

Like shown below;

listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

CodePudding user response：

There are two things that I've found:

Some lines have quotes to escape (escape='"')
Also @JosefZ has mentioned about unwanted line breaks (multiLine=True)

That's how you must read it:

input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')

output_df = input_df.groupBy("room_type").count()
output_df.show()
 --------------- ----- 
|      room_type|count|
 --------------- ----- 
|    Shared room|   44|
|     Hotel room|  110|
|Entire home/apt|12829|
|   Private room| 3495|
 --------------- -----