Pyspark - how to grab just the hour timestamp in a datetime string?-CodePudding

I have a df in pyspark that has a date/time column. The datetime is set up as a string. I'm trying to extract the particular hour out of each datetime string and then count how many '0-hours', '1-hours', '2-hours', etc. are included. Each string is set up such as: 'YYYY-MM-DD HH:MM:SS' ex: '2016-04-26 19:49:16'. Should I separate the days and times in separate strings or is there a quicker way?

CodePudding user response：

Maybe something like:

>>> from pyspark.sql.functions import to_timestamp, hour

>>> df = spark.createDataFrame([{"dt": "13:07:2022 22-32-57"}])
>>> df.select(hour(to_timestamp("dt", "dd:MM:yyyy HH-mm-ss")), "dt").show()
 ------------------------------------------- ------------------- 
|hour(to_timestamp(dt, dd:MM:yyyy HH-mm-ss))|                 dt|
 ------------------------------------------- ------------------- 
|                                         22|13:07:2022 22-32-57|
 ------------------------------------------- -------------------

That's a very odd timestamp format though, maybe you want to double check that it's correct.

CodePudding user response：

try regex.

import re

regex = re.compile(r'\d ')

exempleDate = '11:11:2020 01-02-03'
print(regex.findall(exempleDate)[3])

output: