I have the following dataframe:
Timestamp | Values | Duration | 10MinInterval |
---|---|---|---|
01-01-2019 08:20:00 | xxx | 00:50:00 | 5 |
01-01-2019 09:10:00 | xx | 00:10:00 | 1 |
01-01-2019 09:20:00 | xxxxx | 01:10:00 | 7 |
01-01-2019 10:30:00 | xxxxxx | 00:40:00 | 4 |
01-01-2019 11:10:00 | xx | 01:00:00 | 6 |
01-01-2019 12:10:00 | xxxx | xxxx | xxxx |
I would like to add the 10 minute intervals as new rows. For example, the first timestamp has the interval of 5. So I would like to add 5 new rows for 8:30, 8:40, 8:50 and so on, including 09:10. The values in other columns shall stay the same. Any idea how I can do it?
It should look something like this:
Timestamp | Values | Duration | 10MinInterval |
---|---|---|---|
01-01-2019 08:20:00 | xxx | 00:50:00 | 5 |
01-01-2019 08:30:00 | xxx | 00:10:00 | 1 |
01-01-2019 08:40:00 | xxx | 01:10:00 | 7 |
01-01-2019 08:50:00 | xxx | 00:40:00 | 4 |
01-01-2019 09:00:00 | xxx | 01:00:00 | 6 |
01-01-2019 09:10:00 | xxx | xxx | xxx |
CodePudding user response:
In PySpark, you probably could use sequence
with interval 10 minute
.
Spark 3.2
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('01-01-2019 08:20:00', 'xxx', '00:50:00', 5),
('01-01-2019 09:10:00', 'xx', '00:10:00', 1),
('01-01-2019 09:20:00', 'xxxxx', '01:10:00', 7),
('01-01-2019 10:30:00', 'xxxxxx', '00:40:00', 4),
('01-01-2019 11:10:00', 'xx', '01:00:00', 6),
('01-01-2019 12:10:00', 'xxxx', None, None)],
['Timestamp', 'Values', 'Duration', '10MinInterval'])
df = (df
.withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
.withColumn(
'Timestamp',
F.expr("explode(sequence(time, time make_dt_interval(0, 0, 10 * 10MinInterval), interval 10 minute))"))
.drop('time')
)
Result:
df.show()
# ------------------- ------ -------- -------------
# | Timestamp|Values|Duration|10MinInterval|
# ------------------- ------ -------- -------------
# |2019-01-01 08:20:00| xxx|00:50:00| 5|
# |2019-01-01 08:30:00| xxx|00:50:00| 5|
# |2019-01-01 08:40:00| xxx|00:50:00| 5|
# |2019-01-01 08:50:00| xxx|00:50:00| 5|
# |2019-01-01 09:00:00| xxx|00:50:00| 5|
# |2019-01-01 09:10:00| xxx|00:50:00| 5|
# |2019-01-01 09:10:00| xx|00:10:00| 1|
# |2019-01-01 09:20:00| xx|00:10:00| 1|
# |2019-01-01 09:20:00| xxxxx|01:10:00| 7|
# |2019-01-01 09:30:00| xxxxx|01:10:00| 7|
# |2019-01-01 09:40:00| xxxxx|01:10:00| 7|
# |2019-01-01 09:50:00| xxxxx|01:10:00| 7|
# |2019-01-01 10:00:00| xxxxx|01:10:00| 7|
# |2019-01-01 10:10:00| xxxxx|01:10:00| 7|
# |2019-01-01 10:20:00| xxxxx|01:10:00| 7|
# |2019-01-01 10:30:00| xxxxx|01:10:00| 7|
# |2019-01-01 10:30:00|xxxxxx|00:40:00| 4|
# |2019-01-01 10:40:00|xxxxxx|00:40:00| 4|
# |2019-01-01 10:50:00|xxxxxx|00:40:00| 4|
# |2019-01-01 11:00:00|xxxxxx|00:40:00| 4|
# ------------------- ------ -------- -------------
# only showing top 20 rows
Spark 3.0
df = (df
.withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
.withColumn(
'Timestamp',
F.expr("explode(sequence(time, time make_interval(0, 0, 0, 0, 0, 10 * 10MinInterval), interval 10 minute))"))
.drop('time')
)