In PySpark, I have a data frame which has a column 'Date' which is in 'Date and Time' for each row in the table.
I would like to query all the rows in the data frame which happened in previous day (i.e. the day before when I execute the query).
I see this doc https://www.datasciencemadesimple.com/subtract-or-add-days-months-and-years-to-timestamp-in-pyspark/ which talks about how to create a column in data frame which subtracts a day from the values in a column.
But what I am looking for is now to subtract a day from current date and use that in filter.
CodePudding user response:
@Kashyap comment says it all but here is an example:
from pyspark.sql.functions import col
from datetime import date, timedelta
yesterday = date.today() - timedelta(1)
df.filter(col("Date") == yesterday).show()
If your Date column actually contains datetimes, you can truncate the datetimes to convert them into dates first.
from pyspark.sql.functions import col, date_trunc
from datetime import date, timedelta
yesterday = date.today() - timedelta(1)
df.filter(date_trunc("day", col("Date")) == yesterday).show()
date_trunc
will only work well with timestamp columns, not strings.
Make sure you replace df
with your DataFrame name.