I have like 19 data frames with datetime indexes and I want to iterate over each one in parallel. So I start with one df, slice it to a given time range, and do the same with the others. This completes an entire iteration of the while loop. During the next iteration, I want to create a new slice, starting at the end of the old slice till the next closest time stamp of all data frames. I came up with this code, and it is working, but due to the high number of iterations it is quite time-consuming, and I wonder if there is a faster way to do this.
import pandas as pd
import datetime
# creating test data frames
df1 = pd.DataFrame({'A': range(9)})
df1.index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:01:00'),
pd.Timestamp('20130101 09:30:00'),
pd.Timestamp('20130101 09:44:00'),
pd.Timestamp('20130101 09:50:00'),
pd.Timestamp('20130101 10:16:00'),
pd.Timestamp('20130101 10:47:00'),
pd.Timestamp('20130101 10:53:00'),
pd.Timestamp('20130101 11:22:00')]
df2 = pd.DataFrame({'B': range(9)})
df2.index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:01:00'),
pd.Timestamp('20130101 09:04:00'),
pd.Timestamp('20130101 09:05:00'),
pd.Timestamp('20130101 09:09:00'),
pd.Timestamp('20130101 10:10:00'),
pd.Timestamp('20130101 10:15:00'),
pd.Timestamp('20130101 10:16:00'),
pd.Timestamp('20130101 11:18:00')]
db_dict = {"a": df1, "b": df2}
time_dict_start = {}
time_dict_end = {}
complete_list = []
start_time = datetime.datetime.now()
# starting the main loop
while True:
# check if all data has been processed
if len(complete_list) == len(db_dict):
print(datetime.datetime.now() - start_time)
break
# iterate over every data frame
for name in db_dict:
# skip completed data frames
if name in complete_list:
continue
db = db_dict[name]
# first iteration
if name not in time_dict_start:
start = db.index[0]
end = start datetime.timedelta(seconds=10)
# all other iterations
else:
start = time_dict_start[name]
# get smallest time stamp
time_list = [v for k, v in time_dict_end.items()]
time_list.sort()
end = time_list[0]
time_dict_start[name] = end datetime.timedelta(seconds=1)
split = db.loc[start: end]
try:
# find next closest index
next_idx = db.index[np.searchsorted(db.index, end datetime.timedelta(seconds=1))]
time_dict_end[name] = next_idx
except IndexError:
del time_dict_end[name]
complete_list.append(name)
# do something with the sliced data frame
CodePudding user response:
Would it help to combine the data frames? For example, here is one way to combine the data frames:
df1.index.name = 'time_stamp'
df1.columns.name = 'group'
df2.index.name = 'time_stamp'
df2.columns.name = 'group'
print(
pd.concat((df1, df2), axis=1)
.unstack()
.loc[ lambda x: x.notna() ]
.astype(int)
.reset_index()
.sort_values(['time_stamp', 'group'])
.rename(columns = {0: 'value'})
)
The first 5 rows are:
group time_stamp value
0 A 2013-01-01 09:00:00 0
9 B 2013-01-01 09:00:00 0
1 A 2013-01-01 09:01:00 1
10 B 2013-01-01 09:01:00 1
11 B 2013-01-01 09:04:00 2