I have my data stored in df0
INFERENCE_WEEK='2022-04-23'
schema = StructType(
[
StructField("week_end_date", StringType(), True),
StructField("cust_name", StringType(), True),
.
.
.<blanked due to privacy concerns>
]
)
df0=spark.read.csv(path,
header=False,
schema=schema).alias('d0')
df1=df0.filter((df0.week_end_date==INFERENCE_WEEK)).select(["cust_name"]).distinct().alias('d1')
What i am noticing is that the filtering is removing more cust_name that it should? Any idea for the the peculiar behavior of pyspark?
CodePudding user response:
That doesn't happen for me.
df.show()
---- ----------
| C| D|
---- ----------
|foo1|2009-01-05|
|foo2|2009-01-05|
|foo3|2009-01-05|
|foo4|2009-01-06|
|foo5|2009-01-07|
|foo1|2009-01-05|
|foo2|2009-01-05|
|foo3|2009-01-05|
|foo4|2009-01-06|
|foo5|2009-01-07|
---- ----------
INFERENCE_WEEK='2009-01-05'
df.where(col('D')==f'{INFERENCE_WEEK}').select('C').distinct().show()
----
| C|
----
|foo1|
|foo2|
|foo3|
----
CodePudding user response:
it is working fine for me
>>> df0=spark.read.csv("/path to/sample1.csv",header=False,schema=schema)
>>> df0.show()
------------- ----------
|week_end_date| cust_name|
------------- ----------
| 2022-04-23|samplename|
| 2022-04-25| abcrret|
| 2022-04-23|samplename|
| 2022-04-27| abcrtret|
| 2022-04-28| abcrtret|
| 2022-04-29| abctrtr|
| 2022-04-30| abctgrg|
| 2022-04-31| abcrttru|
------------- ----------
>>> df1=df0.filter((df0.week_end_date==INFERENCE_WEEK)).select(["cust_name"]).distinct().alias('d1')
>>> df1.show()
----------
| cust_name|
----------
|samplename|
----------
can you check whether week_end_date does not contains leading and trailing spaces