Home > Software engineering >  why does pyspark filter a string column work with integers? And why does pandas behave the other way
why does pyspark filter a string column work with integers? And why does pandas behave the other way

Time:11-23

When I have a pyspark dataframe with a column of numbers as strings and filter it using an integer the filter applies to the strings:

df = spark.createDataFrame([
  ("a", "1"),
  ("a", "2"),
  ("b", "1"),
  ("b", "2"),
], ["id", "number"])

df.filter(col('number')==1)

results in

id  number
a   1
b   1
c   1

wheareas, when I convert it to a pandas data frame and apply the same filter, the result is an empty df

pandas_df = df.toPandas()
pandas_df[pandas_df['number']==1]
# result    
id  number

that leads to two questions:

  1. why does the pyspark filter function matches the strings, when I filter using integers?
  2. is there a way to filter type specific in pyspark? So creating the same results as in pandas? I cool have avoided quite some time of searching for an error with that functionality

CodePudding user response:

This is the physical plan of your query:

== Physical Plan == *(1) Filter (isnotnull(number#18) AND (cast(number#18 as int) = 1)) - *(1) Scan ExistingRDD[id#17,number#18]

As you can see, spark is casting the column to integer cast(number#18 as int) = 1

You can access logical and physical plans with .explain().

If you change your query by df.filter(col('number')=="1"), there will be no casting.

  • Related