I try to split the utc value found in timestamp_value
in a new column called utc
. I tried to use the Python RegEx but I was not able to do it.
Thank you for your answer!
This is how my dataframe looks like
-------- ----------------------------
|machine |timestamp_value |
-------- ----------------------------
|1 |2022-01-06T07:47:37.319 0000|
|2 |2022-01-06T07:47:37.319 0000|
|3 |2022-01-06T07:47:37.319 0000|
-------- ----------------------------
This is how It should look like
-------- ---------------------------- -----
|machine |timestamp_value |utc |
-------- ----------------------------------
|1 |2022-01-06T07:47:37.319 | 0000|
|2 |2022-01-06T07:47:37.319 | 0000|
|3 |2022-01-06T07:47:37.319 | 0000|
-------- ----------------------------------
CodePudding user response:
You can do this with with a regexp_extract
and regexp_replace
respectively
import pyspark.sql.functions as F
(df
.withColumn('utc', F.regexp_extract('timestamp_value', '.*(\ .*)', 1))
.withColumn('timestamp_value', F.regexp_replace('timestamp_value', '\ (.*)', ''))
).show(truncate=False)
------- ----------------------- -----
|machine|timestamp_value |utc |
------- ----------------------- -----
|1 |2022-01-06T07:47:37.319| 0000|
|2 |2022-01-06T07:47:37.319| 0000|
|3 |2022-01-06T07:47:37.319| 0000|
------- ----------------------- -----
To better understand what that regular expression means, have a look at this tool.