filter function not filtering correctly in pyspark-CodePudding

I have my data stored in df0

INFERENCE_WEEK='2022-04-23'
schema = StructType(
        [
            StructField("week_end_date", StringType(), True),
            StructField("cust_name", StringType(), True),
.
.
.<blanked due to privacy concerns>        

]
    )
df0=spark.read.csv(path,
        header=False,
        schema=schema).alias('d0')

df1=df0.filter((df0.week_end_date==INFERENCE_WEEK)).select(["cust_name"]).distinct().alias('d1')

What i am noticing is that the filtering is removing more cust_name that it should? Any idea for the the peculiar behavior of pyspark?

CodePudding user response：

That doesn't happen for me.

df.show()
 ---- ---------- 
|   C|         D|
 ---- ---------- 
|foo1|2009-01-05|
|foo2|2009-01-05|
|foo3|2009-01-05|
|foo4|2009-01-06|
|foo5|2009-01-07|
|foo1|2009-01-05|
|foo2|2009-01-05|
|foo3|2009-01-05|
|foo4|2009-01-06|
|foo5|2009-01-07|
 ---- ---------- 


INFERENCE_WEEK='2009-01-05'
df.where(col('D')==f'{INFERENCE_WEEK}').select('C').distinct().show()

 ---- 
|   C|
 ---- 
|foo1|
|foo2|
|foo3|
 ----

CodePudding user response：

it is working fine for me

>>> df0=spark.read.csv("/path to/sample1.csv",header=False,schema=schema)

>>> df0.show()
 ------------- ---------- 
|week_end_date| cust_name|
 ------------- ---------- 
|   2022-04-23|samplename|
|   2022-04-25|   abcrret|
|   2022-04-23|samplename|
|   2022-04-27|  abcrtret|
|   2022-04-28|  abcrtret|
|   2022-04-29|   abctrtr|
|   2022-04-30|   abctgrg|
|   2022-04-31|  abcrttru|
 ------------- ---------- 
>>> df1=df0.filter((df0.week_end_date==INFERENCE_WEEK)).select(["cust_name"]).distinct().alias('d1')
>>> df1.show()
 ---------- 
| cust_name|
 ---------- 
|samplename|
 ----------

can you check whether week_end_date does not contains leading and trailing spaces