I have a df in pyspark that has a date/time column. The datetime is set up as a string. I'm trying to extract the particular hour out of each datetime string and then count how many '0-hours', '1-hours', '2-hours', etc. are included. Each string is set up such as: 'YYYY-MM-DD HH:MM:SS' ex: '2016-04-26 19:49:16'. Should I separate the days and times in separate strings or is there a quicker way?
CodePudding user response:
Maybe something like:
>>> from pyspark.sql.functions import to_timestamp, hour
>>> df = spark.createDataFrame([{"dt": "13:07:2022 22-32-57"}])
>>> df.select(hour(to_timestamp("dt", "dd:MM:yyyy HH-mm-ss")), "dt").show()
------------------------------------------- -------------------
|hour(to_timestamp(dt, dd:MM:yyyy HH-mm-ss))| dt|
------------------------------------------- -------------------
| 22|13:07:2022 22-32-57|
------------------------------------------- -------------------
That's a very odd timestamp format though, maybe you want to double check that it's correct.
CodePudding user response:
try regex.
import re
regex = re.compile(r'\d ')
exempleDate = '11:11:2020 01-02-03'
print(regex.findall(exempleDate)[3])
output:
01