When I have a pyspark dataframe with a column of numbers as strings and filter it using an integer the filter applies to the strings:
df = spark.createDataFrame([
("a", "1"),
("a", "2"),
("b", "1"),
("b", "2"),
], ["id", "number"])
df.filter(col('number')==1)
results in
id number
a 1
b 1
c 1
wheareas, when I convert it to a pandas data frame and apply the same filter, the result is an empty df
pandas_df = df.toPandas()
pandas_df[pandas_df['number']==1]
# result
id number
that leads to two questions:
- why does the pyspark filter function matches the strings, when I filter using integers?
- is there a way to filter type specific in pyspark? So creating the same results as in pandas? I cool have avoided quite some time of searching for an error with that functionality
CodePudding user response:
This is the physical plan of your query:
== Physical Plan == *(1) Filter (isnotnull(number#18) AND (cast(number#18 as int) = 1)) - *(1) Scan ExistingRDD[id#17,number#18]
As you can see, spark is casting the column to integer cast(number#18 as int) = 1
You can access logical and physical plans with .explain()
.
If you change your query by df.filter(col('number')=="1")
, there will be no casting.