I'm trying to get the day of the week in int format from it saying Mon, Tues, Wed, etc.
I'm using this code right now:
from pyspark.sql.functions import date_format
df_ex1 = df.withColumn("day", date_format('Timestamp', 'E'))
Output:
------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- ---
|Timestamp |Open|High|Low |Close|Volume_(BTC)|Volume_(Currency)|Weighted_Price|Year|Month|day|
------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- ---
|2011-12-31 09:52:00|4.39|4.39|4.39|4.39 |0.45558086 |2.0 |4.39 |2011|12 |Sat|
|2011-12-31 09:53:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:54:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:55:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:56:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- ---
only showing top 5 rows
CodePudding user response:
There are two ways to do this, the first one is creating a custom UDF (which I do not recommend) as below:
val match = udf((day: String) => day match {
case "Mon" => 1
case "Tue" => 2
case "Wed" => 3
case "Thu" => 4
case "Fri" => 5
case "Sat" => 6
case "Sun" => 7
})
then applying this to the column:
df = df.withColumn("dayAsNum", match(col("day")))
The result will then look as below:
---------- --- --------
|date |day|dayAsNum|
---------- --- --------
|2011-12-30|Sat|6 |
|2011-12-31|Mon|1 |
---------- --- --------
Or the other way is to use dayofweek
API (using the date), as below:
df = df
.withColumn("date", to_date(col("date"), "yyyy-MM-dd HH:mm:ss"))
.withColumn("dayAsNum2", dayofweek(col("date")))
The result would then look as below:
---------- --- ---------
|date |day|dayAsNum2|
---------- --- ---------
|2011-12-30|Sat|6 |
|2011-12-31|Mon|7 |
---------- --- ---------
Good luck!
CodePudding user response:
There are two Spark native ways: dayofweek
and weekday
(the latter - only SQL).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-08-01', 'Mon'),
('2022-08-02', 'Tue'),
('2022-08-03', 'Wed'),
('2022-08-04', 'Thu'),
('2022-08-05', 'Fri'),
('2022-08-06', 'Sat'),
('2022-08-07', 'Sun')],
['Timestamp', 'day'])
df = df.withColumn('day_number_1', F.dayofweek('Timestamp'))
df = df.withColumn('day_number_2', F.expr("weekday(Timestamp)"))
df.show()
# ---------- --- ------------ ------------
# | Timestamp|day|day_number_1|day_number_2|
# ---------- --- ------------ ------------
# |2022-08-01|Mon| 2| 0|
# |2022-08-02|Tue| 3| 1|
# |2022-08-03|Wed| 4| 2|
# |2022-08-04|Thu| 5| 3|
# |2022-08-05|Fri| 6| 4|
# |2022-08-06|Sat| 7| 5|
# |2022-08-07|Sun| 1| 6|
# ---------- --- ------------ ------------