Home > OS >  Change Spark day timestamp from name to number
Change Spark day timestamp from name to number

Time:08-09

I'm trying to get the day of the week in int format from it saying Mon, Tues, Wed, etc.

I'm using this code right now:

from pyspark.sql.functions import date_format
df_ex1 = df.withColumn("day", date_format('Timestamp', 'E'))

Output:

 ------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- --- 
|Timestamp          |Open|High|Low |Close|Volume_(BTC)|Volume_(Currency)|Weighted_Price|Year|Month|day|
 ------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- --- 
|2011-12-31 09:52:00|4.39|4.39|4.39|4.39 |0.45558086  |2.0              |4.39          |2011|12   |Sat|
|2011-12-31 09:53:00|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |2011|12   |Sat|
|2011-12-31 09:54:00|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |2011|12   |Sat|
|2011-12-31 09:55:00|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |2011|12   |Sat|
|2011-12-31 09:56:00|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |2011|12   |Sat|
 ------------------- ---- ---- ---- ----- ------------ ----------------- -------------- ---- ----- --- 
only showing top 5 rows

CodePudding user response:

There are two ways to do this, the first one is creating a custom UDF (which I do not recommend) as below:

val match = udf((day: String) => day match {
  case "Mon" => 1
  case "Tue" => 2
  case "Wed" => 3
  case "Thu" => 4
  case "Fri" => 5
  case "Sat" => 6
  case "Sun" => 7
})

then applying this to the column:

df = df.withColumn("dayAsNum", match(col("day")))

The result will then look as below:

 ---------- --- -------- 
|date      |day|dayAsNum|
 ---------- --- -------- 
|2011-12-30|Sat|6       |
|2011-12-31|Mon|1       |
 ---------- --- -------- 

Or the other way is to use dayofweek API (using the date), as below:

df = df
  .withColumn("date", to_date(col("date"), "yyyy-MM-dd HH:mm:ss"))
  .withColumn("dayAsNum2", dayofweek(col("date")))

The result would then look as below:

 ---------- --- --------- 
|date      |day|dayAsNum2|
 ---------- --- --------- 
|2011-12-30|Sat|6        |
|2011-12-31|Mon|7        |
 ---------- --- --------- 

Good luck!

CodePudding user response:

There are two Spark native ways: dayofweek and weekday (the latter - only SQL).

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [('2022-08-01', 'Mon'),
     ('2022-08-02', 'Tue'),
     ('2022-08-03', 'Wed'),
     ('2022-08-04', 'Thu'),
     ('2022-08-05', 'Fri'),
     ('2022-08-06', 'Sat'),
     ('2022-08-07', 'Sun')],
    ['Timestamp', 'day'])

df = df.withColumn('day_number_1', F.dayofweek('Timestamp'))
df = df.withColumn('day_number_2', F.expr("weekday(Timestamp)"))

df.show()
#  ---------- --- ------------ ------------ 
# | Timestamp|day|day_number_1|day_number_2|
#  ---------- --- ------------ ------------ 
# |2022-08-01|Mon|           2|           0|
# |2022-08-02|Tue|           3|           1|
# |2022-08-03|Wed|           4|           2|
# |2022-08-04|Thu|           5|           3|
# |2022-08-05|Fri|           6|           4|
# |2022-08-06|Sat|           7|           5|
# |2022-08-07|Sun|           1|           6|
#  ---------- --- ------------ ------------ 
  • Related