Home > other >  Add x amount of new rows with different time interval based on the timestamp and the interval value
Add x amount of new rows with different time interval based on the timestamp and the interval value

Time:07-27

I have the following dataframe:

Timestamp Values Duration 10MinInterval
01-01-2019 08:20:00 xxx 00:50:00 5
01-01-2019 09:10:00 xx 00:10:00 1
01-01-2019 09:20:00 xxxxx 01:10:00 7
01-01-2019 10:30:00 xxxxxx 00:40:00 4
01-01-2019 11:10:00 xx 01:00:00 6
01-01-2019 12:10:00 xxxx xxxx xxxx

I would like to add the 10 minute intervals as new rows. For example, the first timestamp has the interval of 5. So I would like to add 5 new rows for 8:30, 8:40, 8:50 and so on, including 09:10. The values in other columns shall stay the same. Any idea how I can do it?

It should look something like this:

Timestamp Values Duration 10MinInterval
01-01-2019 08:20:00 xxx 00:50:00 5
01-01-2019 08:30:00 xxx 00:10:00 1
01-01-2019 08:40:00 xxx 01:10:00 7
01-01-2019 08:50:00 xxx 00:40:00 4
01-01-2019 09:00:00 xxx 01:00:00 6
01-01-2019 09:10:00 xxx xxx xxx

CodePudding user response:

In PySpark, you probably could use sequence with interval 10 minute.

Spark 3.2

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('01-01-2019 08:20:00', 'xxx', '00:50:00', 5),
     ('01-01-2019 09:10:00', 'xx', '00:10:00', 1),
     ('01-01-2019 09:20:00', 'xxxxx', '01:10:00', 7),
     ('01-01-2019 10:30:00', 'xxxxxx', '00:40:00', 4),
     ('01-01-2019 11:10:00', 'xx', '01:00:00', 6),
     ('01-01-2019 12:10:00', 'xxxx', None, None)],
    ['Timestamp', 'Values', 'Duration', '10MinInterval'])

df = (df
    .withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
    .withColumn(
        'Timestamp',
        F.expr("explode(sequence(time, time   make_dt_interval(0, 0, 10 * 10MinInterval), interval 10 minute))"))
    .drop('time')
)

Result:

df.show()
#  ------------------- ------ -------- ------------- 
# |          Timestamp|Values|Duration|10MinInterval|
#  ------------------- ------ -------- ------------- 
# |2019-01-01 08:20:00|   xxx|00:50:00|            5|
# |2019-01-01 08:30:00|   xxx|00:50:00|            5|
# |2019-01-01 08:40:00|   xxx|00:50:00|            5|
# |2019-01-01 08:50:00|   xxx|00:50:00|            5|
# |2019-01-01 09:00:00|   xxx|00:50:00|            5|
# |2019-01-01 09:10:00|   xxx|00:50:00|            5|
# |2019-01-01 09:10:00|    xx|00:10:00|            1|
# |2019-01-01 09:20:00|    xx|00:10:00|            1|
# |2019-01-01 09:20:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:30:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:40:00| xxxxx|01:10:00|            7|
# |2019-01-01 09:50:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:00:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:10:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:20:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:30:00| xxxxx|01:10:00|            7|
# |2019-01-01 10:30:00|xxxxxx|00:40:00|            4|
# |2019-01-01 10:40:00|xxxxxx|00:40:00|            4|
# |2019-01-01 10:50:00|xxxxxx|00:40:00|            4|
# |2019-01-01 11:00:00|xxxxxx|00:40:00|            4|
#  ------------------- ------ -------- ------------- 
# only showing top 20 rows

Spark 3.0

df = (df
    .withColumn('time', F.to_timestamp('Timestamp', 'MM-dd-yyyy HH:mm:ss'))
    .withColumn(
        'Timestamp',
        F.expr("explode(sequence(time, time   make_interval(0, 0, 0, 0, 0, 10 * 10MinInterval), interval 10 minute))"))
    .drop('time')
)
  • Related