Using Pyspark, I'm querying a data base with some datetime info:
df.select(df.Activity_date).show(5, truncate = False)
-------------------------
|Activity_date |
-------------------------
|Jul 20, 2015, 1:06:57 PM |
|Oct 11, 2015, 10:03:14 PM|
|Dec 8, 2016, 4:28:12 PM |
|Dec 12, 2016, 10:22:44 PM|
|Dec 25, 2016, 10:37:32 PM|
-------------------------
only showing top 5 rows
I would like to do a query that extracts the Month, then the Year.
Here's where things stand:
df.select( "Activity_Type",
regexp_extract("Activity_Date",r"^(\\w{3})",1).alias("Month"),
regexp_extract("Activity_Date", r"^\\d{4}$",1).alias("Year"),
).show(5,truncate=False)
------------- ----- ----
|Activity_Type|Month|Year|
------------- ----- ----
|Ride |Jul | |
|Ride |Oct | |
|Ride |Dec | |
|Ride |Dec | |
|Ride |Dec | |
------------- ----- ----
only showing top 5 rows
As you can see, I'm having a hard time writing the regex to extract the Year part.
CodePudding user response:
And, in a classic case of further attempts, here's one possible answer:
df.select(
"Activity_Type",
regexp_extract("Activity_Date", r"^(\w{3})", 1).alias("Month"),
regexp_extract("Activity_Date", r"^\w{3}\s\d{1,2}(,)\s(.{4})",2).alias("Year"),
).show(5,truncate=False)
CodePudding user response:
\w ?\s\d{1,2},\s(\d{4})
This pattern extracts the year into the first capturing group.
https://regex101.com/r/Wo0dIm/1