I have the following Pandas datetime-indexed dataframe:
date_time | category | num_files | num_lines | worst_index |
---|---|---|---|---|
2022-07-15 23:50:00 | black | 2 | 868 | 0.01 |
2022-07-15 23:50:00 | red | 5 | 5631 | 0.01 |
2022-07-15 23:50:00 | green | 1 | 1891 | 0.00 |
2022-07-15 23:50:00 | all | 8 | 8390 | 0.01 |
2022-07-16 00:00:00 | all | 0 | 0 | 0.00 |
2022-07-16 00:10:00 | all | 0 | 0 | 0.00 |
2022-07-16 00:20:00 | black | 1 | 656 | 0.00 |
2022-07-16 00:20:00 | red | 2 | 4922 | 0.00 |
2022-07-16 00:20:00 | green | 1 | 1847 | 0.00 |
2022-07-16 00:20:00 | all | 4 | 7425 | 0.00 |
2022-07-16 00:30:00 | all | 0 | 0 | 0.00 |
The data is collected every 10 minutes for the categories "black", "red" and "green" there is a summary category "all" with respectively cumulated values for "num_files", "num_lines" and "worst_index".
In case, that num_files, num_lines or worst_index for the "all" category of a measurement point is 0 (zero), I would like to set those values for the three categories "black", "red" and "green" also to 0 (zero) in the dataframe. So, either insert a corresponding row if there is none for that timestamp so far.
Background is that I found the subsequently generated matplotlib graphs indicating wrongly for the three categories: e.g. for category "black" there should not be a direct line between timestamp "2022-07-15 23:50:00" "num_files"-value 2 and "num_files"-value 1 at timestamp "2022-07-16 00:20:00" as actually "num_files" for category black was 0 (zero) for the timestamps "2022-07-16 00:00:00" and "2022-07-16 00:10:00" in between but unfortunately the data is collected like this which I cannot change.
I tried to iterate through the datetime indexed dataframe using iterrows and to select / filter with loc but did not manage it with my too junior Python and Pandas knowledge and experience.
CodePudding user response:
You can do this with a reindexing operation, treating date_time
and category
as a multi-index. First, construct the final desired index (i.e., 10 minute separated dates with an entry for every category). The MultiIndex.from_product
method does this neatly:
drange = pd.date_range(df['date_time'].min(), df['date_time'].max(), freq='10T')
cats = ['black', 'green', 'red', 'all']
new_idx = pd.MultiIndex.from_product([drange, cats], names=['date_time', 'category'])
Then, reindex your data with the new_idx
(after temporarily turning the date/category columns to the index). Fill any
NAs created with 0:
df = df.set_index(['date_time', 'category']).reindex(new_idx).reset_index().fillna(0)
Result:
date_time category num_files num_lines worst_index
0 2022-07-15 23:50:00 black 2.0 868.0 0.01
1 2022-07-15 23:50:00 green 1.0 1891.0 0.00
2 2022-07-15 23:50:00 red 5.0 5631.0 0.01
3 2022-07-15 23:50:00 all 8.0 8390.0 0.01
4 2022-07-16 00:00:00 black 0.0 0.0 0.00
5 2022-07-16 00:00:00 green 0.0 0.0 0.00
6 2022-07-16 00:00:00 red 0.0 0.0 0.00
7 2022-07-16 00:00:00 all 0.0 0.0 0.00
8 2022-07-16 00:10:00 black 0.0 0.0 0.00
9 2022-07-16 00:10:00 green 0.0 0.0 0.00
10 2022-07-16 00:10:00 red 0.0 0.0 0.00
11 2022-07-16 00:10:00 all 0.0 0.0 0.00
12 2022-07-16 00:20:00 black 1.0 656.0 0.00
13 2022-07-16 00:20:00 green 1.0 1847.0 0.00
14 2022-07-16 00:20:00 red 2.0 4922.0 0.00
15 2022-07-16 00:20:00 all 4.0 7425.0 0.00
16 2022-07-16 00:30:00 black 0.0 0.0 0.00
17 2022-07-16 00:30:00 green 0.0 0.0 0.00
18 2022-07-16 00:30:00 red 0.0 0.0 0.00
19 2022-07-16 00:30:00 all 0.0 0.0 0.00