For my course in university, I run pyspark-notebook docker image
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook
And then run next python code
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark
listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED')
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()
The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()
). You can see that something definitely is broken also by running
listings_df.groupBy("room_type").count().show()
which gives next output
--------------- -----
| room_type|count|
--------------- -----
| 169| 1|
| 4.88612| 1|
| 4.90075| 1|
| Shared room| 44|
| 35| 1|
| 187| 1|
| null| 16|
| 70| 1|
| 27| 1|
| 75| 1|
| Hotel room| 109|
| 198| 1|
| 60| 1|
| 280| 1|
|Entire home/apt|12818|
| 220| 1|
| 190| 1|
| 156| 1|
| 450| 1|
| 4.88865| 1|
--------------- -----
only showing top 20 rows
while real room_type
values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].
Spark info which might be useful:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.2
Master
local[*]
AppName
pyspark-shell
And encoding of the file
!file listings.csv
listings.csv: UTF-8 Unicode text
listings.csv
is an Airbnb statistics csv file downloaded from here
All run & drive code I've also uploaded to Colab
CodePudding user response:
I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.
Like shown below;
listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')
CodePudding user response:
There are two things that I've found:
- Some lines have quotes to escape (
escape='"'
) - Also @JosefZ has mentioned about unwanted line breaks (
multiLine=True
)
That's how you must read it:
input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')
output_df = input_df.groupBy("room_type").count()
output_df.show()
--------------- -----
| room_type|count|
--------------- -----
| Shared room| 44|
| Hotel room| 110|
|Entire home/apt|12829|
| Private room| 3495|
--------------- -----