Home > OS >  Iterating over multiple data frames row by row - any ways to increase speed?
Iterating over multiple data frames row by row - any ways to increase speed?

Time:11-30

I have like 19 data frames with datetime indexes and I want to iterate over each one in parallel. So I start with one df, slice it to a given time range, and do the same with the others. This completes an entire iteration of the while loop. During the next iteration, I want to create a new slice, starting at the end of the old slice till the next closest time stamp of all data frames. I came up with this code, and it is working, but due to the high number of iterations it is quite time-consuming, and I wonder if there is a faster way to do this.

import pandas as pd
import datetime

# creating test data frames
df1 = pd.DataFrame({'A': range(9)})
df1.index = [pd.Timestamp('20130101 09:00:00'),
             pd.Timestamp('20130101 09:01:00'),
             pd.Timestamp('20130101 09:30:00'),
             pd.Timestamp('20130101 09:44:00'),
             pd.Timestamp('20130101 09:50:00'),
             pd.Timestamp('20130101 10:16:00'),
             pd.Timestamp('20130101 10:47:00'),
             pd.Timestamp('20130101 10:53:00'),
             pd.Timestamp('20130101 11:22:00')]

df2 = pd.DataFrame({'B': range(9)})
df2.index = [pd.Timestamp('20130101 09:00:00'),
             pd.Timestamp('20130101 09:01:00'),
             pd.Timestamp('20130101 09:04:00'),
             pd.Timestamp('20130101 09:05:00'),
             pd.Timestamp('20130101 09:09:00'),
             pd.Timestamp('20130101 10:10:00'),
             pd.Timestamp('20130101 10:15:00'),
             pd.Timestamp('20130101 10:16:00'),
             pd.Timestamp('20130101 11:18:00')]

db_dict = {"a": df1, "b": df2}


time_dict_start = {}
time_dict_end = {}
complete_list = []
start_time = datetime.datetime.now()

# starting the main loop
while True:
    # check if all data has been processed
    if len(complete_list) == len(db_dict):
        print(datetime.datetime.now() - start_time)
        break
    
    # iterate over every data frame
    for name in db_dict:
        
        # skip completed data frames
        if name in complete_list:
            continue

        db = db_dict[name]
        
        # first iteration
        if name not in time_dict_start:
            start = db.index[0]
            end = start   datetime.timedelta(seconds=10)
        # all other iterations
        else:
            start = time_dict_start[name]
            # get smallest time stamp
            time_list = [v for k, v in time_dict_end.items()]
            time_list.sort()
            end = time_list[0]

        time_dict_start[name] = end   datetime.timedelta(seconds=1)

        split = db.loc[start: end]

        try:
            # find next closest index
            next_idx = db.index[np.searchsorted(db.index, end   datetime.timedelta(seconds=1))]
            time_dict_end[name] = next_idx
        except IndexError:
            del time_dict_end[name]
            complete_list.append(name)

        # do something with the sliced data frame

CodePudding user response:

Would it help to combine the data frames? For example, here is one way to combine the data frames:

df1.index.name = 'time_stamp'
df1.columns.name = 'group'

df2.index.name = 'time_stamp'
df2.columns.name = 'group'

print(
    pd.concat((df1, df2), axis=1)
    .unstack()
    .loc[ lambda x: x.notna() ]
    .astype(int)
    .reset_index()
    .sort_values(['time_stamp', 'group'])
    .rename(columns = {0: 'value'})
)

The first 5 rows are:

   group          time_stamp  value
0      A 2013-01-01 09:00:00      0
9      B 2013-01-01 09:00:00      0
1      A 2013-01-01 09:01:00      1
10     B 2013-01-01 09:01:00      1
11     B 2013-01-01 09:04:00      2
  • Related