Home > Enterprise >  How to compare two dataframes with timestamps and create a dictionary of dataframes ? Python Pandas
How to compare two dataframes with timestamps and create a dictionary of dataframes ? Python Pandas

Time:12-05

I would like to retrieve each row of dataframe 2 that are included in the timestamps of dataframe 1. The timestamps of dataframe 1 go in pairs (11 being the beginning and 12 the end).

The goal is to create a dictionary of dataframes to plot each curve.

df1
Timestamp               Num
2021-01-01 08:00:00     11
2021-01-01 09:00:00     12
2021-01-01 10:00:00     11
2021-01-01 11:00:00     12
2021-01-01 12:00:00     11
2021-01-01 13:00:00     12

And

df2                     Value
2021-01-01 07:30:00     66
2021-01-01 08:30:00     67
2021-01-01 08:45:00     67
2021-01-01 09:15:00     64
2021-01-01 10:30:00     65
2021-01-01 10:30:00     61
2021-01-01 10:45:00     68
2021-01-01 11:15:00     60
2021-01-01 12:30:00     66
2021-01-01 12:30:00     67
2021-01-01 12:45:00     67

I thought of making a mask, but it works only for one couple of timestamp of dataframe 1. I have the idea, but I can't write it in python:

start = df1.iloc[::2, :]["Timestamp"] #to have each 11
end = df2.iloc[1::2, :]["Timestamp"] ##to have each 12

for each (start, end) in df2 : 
   create a df

dict = dict.append(df)

The final result must be:

Dict:

final_df1               Value
2021-01-01 08:30:00     67
2021-01-01 08:45:00     67

final_df2               Value
2021-01-01 10:30:00     65
2021-01-01 10:30:00     61
2021-01-01 10:45:00     68

final_df3               Value
2021-01-01 12:30:00     66
2021-01-01 12:30:00     67
2021-01-01 12:45:00     67

I tried this :

df_12 = df_1.iloc[1::2, :]["Timestamp"]
df_11 = df_1.iloc[::2, :]["Timestamp"]

df_12 = pd.DataFrame(df_12)
df_11 = pd.DataFrame(df_11)

for row in df_11.iterrows():
    for row in df_12.iterrows():
        mask_Filtered = (df_2['Timestamp']>= df_11) & (df_2['Timestamp'<=df_12)      <-----
        df_2 = df_2.loc[mask_Filtered]
dict.append(df_2)

MemoryError: Unable to allocate 46.0 GiB for an array with shape (1090629, 5657) and data type float64

CodePudding user response:

Is it what you are looking for?

intervals = list(zip(df1[::2]['Timestamp'], df1[1::2]['Timestamp']))
bins = pd.IntervalIndex.from_tuples(intervals)
groups = df2.groupby(pd.cut(df2['Timestamp'], bins=bins))

dfs = {}
for idx, (_, df) in enumerate(groups, 1):
    dfs[f"final_df{idx}"] = df
    # or process individually here to avoid MemoryError

Output:

>>> dfs
{'final_df1':             
            Timestamp  Num
1 2021-01-01 08:30:00   67
2 2021-01-01 08:45:00   67,

 'final_df2':
            Timestamp  Num
4 2021-01-01 10:30:00   65
5 2021-01-01 10:30:00   61
6 2021-01-01 10:45:00   68,

'final_df3':
             Timestamp  Num
8  2021-01-01 12:30:00   66
9  2021-01-01 12:30:00   67
10 2021-01-01 12:45:00   67}
  • Related