Home > Mobile >  Python df duplicate row by timestamp
Python df duplicate row by timestamp

Time:01-31

After several hours looking for help, I cant find anything useful for me. Im facing with this problem: I have a DataFrame with the structure (little example)

    site    host
0   PO4102  hostname1
1   PO4102  hostname2

and a second Panas Series with differents timestamps

ini_time = '2022-11-09 00:00:00'
final_time = '2023-01-09 00:00:00'
df1 = pd.date_range(ini_time, final_time, freq="H")

DatetimeIndex(['2022-11-09 00:00:00', '2022-11-09 01:00:00',
               '2022-11-09 02:00:00', '2022-11-09 03:00:00',
               '2022-11-09 04:00:00', '2022-11-09 05:00:00',
               '2022-11-09 06:00:00', '2022-11-09 07:00:00',
               '2022-11-09 08:00:00', '2022-11-09 09:00:00',
               ...
               '2023-01-08 15:00:00', '2023-01-08 16:00:00',
               '2023-01-08 17:00:00', '2023-01-08 18:00:00',
               '2023-01-08 19:00:00', '2023-01-08 20:00:00',
               '2023-01-08 21:00:00', '2023-01-08 22:00:00',
               '2023-01-08 23:00:00', '2023-01-09 00:00:00'],
              dtype='datetime64[ns]', length=1465, freq='H')

My objetive is insert each date (1465 in this example) per items in DataFrame (duplicate df items). The final result will be a DataFrame with 2930 lines (2 lines per timestamp) Should look like

    site    host        timestamp
0   PO4102  hostname1   2022-11-09 00:00:00
1   PO4102  hostname2   2022-11-09 00:00:00
2   PO4102  hostname1   2022-11-09 01:00:00
3   PO4102  hostname2   2022-11-09 01:00:00
4   PO4102  hostname1   2022-11-09 02:00:00
5   PO4102  hostname2   2022-11-09 02:00:00
.....
2929    PO4102  hostname1   2022-12-10 00:00:00
2930    PO4102  hostname2   2022-12-10 00:00:00

Can you help me?

Thanks in advance

CodePudding user response:

Use how='cross' as parameter of merge function to get the cartesian product of both dataframes:

df1 = pd.DataFrame({'timestamp': pd.date_range(ini_time, final_time, freq="H")})
out = df.merge(df1, how='cross').sort_values('timestamp', ignore_index=True)
print(out)

# Output
        site       host           timestamp
0     PO4102  hostname1 2022-11-09 00:00:00
1     PO4102  hostname2 2022-11-09 00:00:00
2     PO4102  hostname2 2022-11-09 01:00:00
3     PO4102  hostname1 2022-11-09 01:00:00
4     PO4102  hostname1 2022-11-09 02:00:00
...      ...        ...                 ...
2925  PO4102  hostname1 2023-01-08 22:00:00
2926  PO4102  hostname2 2023-01-08 23:00:00
2927  PO4102  hostname1 2023-01-08 23:00:00
2928  PO4102  hostname1 2023-01-09 00:00:00
2929  PO4102  hostname2 2023-01-09 00:00:00

[2930 rows x 3 columns]
  • Related