Home > OS >  Efficiently identify an event that occurs between a beginning and ending time stamp
Efficiently identify an event that occurs between a beginning and ending time stamp

Time:12-21

I have two data frames:

Data frame one has a timestamp, a factor(amount of power generated) and a location.

Data frame two has an event(amount of rain), a timestamp for the beginning time of the event, a time stamp for the ending time of the event and a location.

I want to include, in the first data frame, a column for the amount of rain falling when a certain amount of power was generated.

I was able to create a small dataframe and run a test with the following code:

df1 =pd.DataFrame({'factor': ['2','3','4','5','6','7'],
                   'timestamp':['2022-12-01 10:00:00','2022-12-01 10:05:00',
                                '2022-12-01 10:15:00','2022-12-01 10:20:00',
                                '2022-12-15 13:00:00','2022-12-20 06:00:00'],
                   'location':['a','b','c','d','a','d']
                   })

df2 =pd.DataFrame({'event': ['2','3','4','5','6','7'],
                   'time_start':['2022-12-01 9:00:00','2022-12-02 10:05:00',
                                 '2022-12-01 8:15:00','2022-12-01 9:20:00',
                                 '2022-12-25 10:00:00','2022-12-20 05:00:00'],
                   'time_end':['2022-12-01 16:00:00','2022-12-02 10:15:00',
                               '2022-12-01 20:15:00','2022-12-01 20:20:00',
                               '2022-12-25 13:00:00','2022-12-20 06:30:00'],
                   'location':['a','b','c','d','b','c']
                   })

df1['timestamp'] =  pd.to_datetime(df1['timestamp'])

df2['time_start'] =  pd.to_datetime(df2['time_start'])
df2['time_end'] =  pd.to_datetime(df2['time_end'])

df3 = df1.merge(df2, how='outer', on="location")
 
df3['quantity_rain'] = df3['event'].where(df3['timestamp'].between(df3['time_start'], df3['time_end']))
df3.replace(np. nan,0)

but when I run the code with my larger dataframe, the kernal restarts because I am using too much ram.

This occurs when I try to merge the two dataframes with df3 = df1.merge(df2, how='outer', on="location")

I was trying to find a way around this, I read that I should try to use SQL. I figured I can merge the dataframes, convert the merged dataframe back to pandas then proceed as usual, but I am not sure of how to do that (or even if that's the best way to go about things?). When I run my code I get the error * sqlite://(sqlite3.OperationalError) no such table: df1

My code is below:

%load_ext sql
%sql sqlite://

import sqlite3

conn = sqlite3.connect('test_database')
c = conn.cursor()

# Converting dataframes to SQL tables
df1.to_sql('df1_SQL', conn, if_exists='replace', index = False)
df2.to_sql('df1_SQL', conn, if_exists='replace', index = False)

# Merging tables
%sql SELECT * FROM df1 JOIN df2 USING (location)

Is there a way to do this with less ram with python? if not is sql the way to go and how can I fix my code?

CodePudding user response:

To overcome the memory overflow problem, here is one aproach.

Create an interval index form start and end columns of df2 then create a mapping series by setting the index of df2 to location along with the interval index. Now group df1 by location and for each location map the timestamp values to corresponding events using the mapping series finally concat all groups together

ix = pd.IntervalIndex.from_arrays(df2['time_start'], df2['time_end'], closed='both')
event = df2.set_index(['location', ix])['event']

pd.concat([
    g.assign(event=g['timestamp'].map(event.loc[k])) 
    for k, g in df1.groupby('location', sort=False)
])

After concat the result will be

  factor           timestamp location event
0      2 2022-12-01 10:00:00        a     2
4      6 2022-12-15 13:00:00        a   NaN
1      3 2022-12-01 10:05:00        b   NaN
2      4 2022-12-01 10:15:00        c     4
3      5 2022-12-01 10:20:00        d     5
5      7 2022-12-20 06:00:00        d   NaN
  • Related