Add x amount of new rows with different time interval based on the timestamp and the interval value-CodePudding

I have the following dataframe:

Timestamp	Values	Duration	10MinInterval
01-01-2019 08:20:00	xxx	00:50:00	5
01-01-2019 09:10:00	xx	00:10:00	1
01-01-2019 09:20:00	xxxxx	01:10:00	7
01-01-2019 10:30:00	xxxxxx	00:40:00	4
01-01-2019 11:10:00	xx	01:00:00	6
01-01-2019 12:10:00	xxxx	xxxx	xxxx

I would like to add the 10 minute intervals as new rows. For example, the first timestamp has the interval of 5. So I would like to add 5 new rows for 8:30, 8:40, 8:50 and so on, including 09:10. The values in other columns shall stay the same. Any idea how I can do it?

It should look something like this:

Timestamp	Values	Duration	10MinInterval
01-01-2019 08:20:00	xxx	00:50:00	5
01-01-2019 08:30:00	xxx	00:10:00	1
01-01-2019 08:40:00	xxx	01:10:00	7
01-01-2019 08:50:00	xxx	00:40:00	4
01-01-2019 09:00:00	xxx	01:00:00	6
01-01-2019 09:10:00	xxx	xxx	xxx

CodePudding user response：

In PySpark, you probably could use sequence with interval 10 minute.

Spark 3.2

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('01-01-2019 08:20:00', 'xxx', '00:50:00', 5),
     ('01-01-2019 09:10:00', 'xx', '00:10:00', 1),
     ('01-01-2019 09:20:00', 'xxxxx', '01:10:00', 7),
     ('01-01-2019 10:30:00', 'xxxxxx', '00:40:00', 4),
     ('01-01-2019 11:10:00', 'xx', '01:00:00', 6),
     ('01-01-2019 12:10:00', 'xxxx', None, None)],
    ['Timestamp', 'Values', 'Duration', '10MinInterval'])

df = (df
    .withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
    .withColumn(
        'Timestamp',
        F.expr("explode(sequence(time, time   make_dt_interval(0, 0, 10 * 10MinInterval), interval 10 minute))"))
    .drop('time')
)

Result:

df.show()
#  ------------------- ------ -------- ------------- 
# |          Timestamp|Values|Duration|10MinInterval|
#  ------------------- ------ -------- ------------- 
# |2019-01-01 08:20:00|   xxx|00:50:00|            5|
# |2019-01-01 08:30:00|   xxx|00:50:00|            5|
# |2019-01-01 08:40:00|   xxx|00:50:00|            5|
# |2019-01-01 08:50:00|   xxx|00:50:00|            5|
# |2019-01-01 09:00:00|   xxx|00:50:00|            5|
# |2019-01-01 09:10:00|   xxx|00:50:00|            5|
# |2019-01-01 09:10:00|    xx|00:10:00|            1|
# |2019-01-01 09:20:00|    xx|00:10:00|            1|
# |2019-01-01 09:20:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:30:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:40:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:50:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:00:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:10:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:20:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:30:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:30:00|xxxxxx|00:40:00|            4|
# |2019-01-01 10:40:00|xxxxxx|00:40:00|            4|
# |2019-01-01 10:50:00|xxxxxx|00:40:00|            4|
# |2019-01-01 11:00:00|xxxxxx|00:40:00|            4|
#  ------------------- ------ -------- ------------- 
# only showing top 20 rows

Spark 3.0

df = (df
    .withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
    .withColumn(
        'Timestamp',
        F.expr("explode(sequence(time, time   make_interval(0, 0, 0, 0, 0, 10 * 10MinInterval), interval 10 minute))"))
    .drop('time')
)