Home > database >  Extract Year from (Month Day, Year, HH:MM:SS) datetime format using Regex in Pyspark
Extract Year from (Month Day, Year, HH:MM:SS) datetime format using Regex in Pyspark

Time:03-28

Using Pyspark, I'm querying a data base with some datetime info:

df.select(df.Activity_date).show(5, truncate = False)
 ------------------------- 
|Activity_date            |
 ------------------------- 
|Jul 20, 2015, 1:06:57 PM |
|Oct 11, 2015, 10:03:14 PM|
|Dec 8, 2016, 4:28:12 PM  |
|Dec 12, 2016, 10:22:44 PM|
|Dec 25, 2016, 10:37:32 PM|
 ------------------------- 
only showing top 5 rows

I would like to do a query that extracts the Month, then the Year.

Here's where things stand:

df.select( "Activity_Type", 
regexp_extract("Activity_Date",r"^(\\w{3})",1).alias("Month"), 
regexp_extract("Activity_Date", r"^\\d{4}$",1).alias("Year"), 
).show(5,truncate=False)
 ------------- ----- ---- 
|Activity_Type|Month|Year|
 ------------- ----- ---- 
|Ride         |Jul  |    |
|Ride         |Oct  |    |
|Ride         |Dec  |    |
|Ride         |Dec  |    |
|Ride         |Dec  |    |
 ------------- ----- ---- 
only showing top 5 rows

As you can see, I'm having a hard time writing the regex to extract the Year part.

CodePudding user response:

And, in a classic case of further attempts, here's one possible answer:

df.select(
    "Activity_Type",
    regexp_extract("Activity_Date", r"^(\w{3})", 1).alias("Month"),
    regexp_extract("Activity_Date", r"^\w{3}\s\d{1,2}(,)\s(.{4})",2).alias("Year"),
).show(5,truncate=False)

CodePudding user response:

\w ?\s\d{1,2},\s(\d{4}) This pattern extracts the year into the first capturing group. https://regex101.com/r/Wo0dIm/1

  • Related