I have two data frames:
Data frame one has a timestamp, a factor(amount of power generated) and a location.
Data frame two has an event(amount of rain), a timestamp for the beginning time of the event, a time stamp for the ending time of the event and a location.
I want to include, in the first data frame, a column for the amount of rain falling when a certain amount of power was generated.
I was able to create a small dataframe and run a test with the following code:
df1 =pd.DataFrame({'factor': ['2','3','4','5','6','7'],
'timestamp':['2022-12-01 10:00:00','2022-12-01 10:05:00',
'2022-12-01 10:15:00','2022-12-01 10:20:00',
'2022-12-15 13:00:00','2022-12-20 06:00:00'],
'location':['a','b','c','d','a','d']
})
df2 =pd.DataFrame({'event': ['2','3','4','5','6','7'],
'time_start':['2022-12-01 9:00:00','2022-12-02 10:05:00',
'2022-12-01 8:15:00','2022-12-01 9:20:00',
'2022-12-25 10:00:00','2022-12-20 05:00:00'],
'time_end':['2022-12-01 16:00:00','2022-12-02 10:15:00',
'2022-12-01 20:15:00','2022-12-01 20:20:00',
'2022-12-25 13:00:00','2022-12-20 06:30:00'],
'location':['a','b','c','d','b','c']
})
df1['timestamp'] = pd.to_datetime(df1['timestamp'])
df2['time_start'] = pd.to_datetime(df2['time_start'])
df2['time_end'] = pd.to_datetime(df2['time_end'])
df3 = df1.merge(df2, how='outer', on="location")
df3['quantity_rain'] = df3['event'].where(df3['timestamp'].between(df3['time_start'], df3['time_end']))
df3.replace(np. nan,0)
but when I run the code with my larger dataframe, the kernal restarts because I am using too much ram.
This occurs when I try to merge the two dataframes with df3 = df1.merge(df2, how='outer', on="location")
I was trying to find a way around this, I read that I should try to use SQL. I figured I can merge the dataframes, convert the merged dataframe back to pandas then proceed as usual, but I am not sure of how to do that (or even if that's the best way to go about things?). When I run my code I get the error
* sqlite://(sqlite3.OperationalError) no such table: df1
My code is below:
%load_ext sql
%sql sqlite://
import sqlite3
conn = sqlite3.connect('test_database')
c = conn.cursor()
# Converting dataframes to SQL tables
df1.to_sql('df1_SQL', conn, if_exists='replace', index = False)
df2.to_sql('df1_SQL', conn, if_exists='replace', index = False)
# Merging tables
%sql SELECT * FROM df1 JOIN df2 USING (location)
Is there a way to do this with less ram with python? if not is sql the way to go and how can I fix my code?
CodePudding user response:
To overcome the memory overflow problem, here is one aproach.
Create an interval index form start and end columns of df2
then create a mapping series by setting the index of df2
to location
along with the interval index. Now group df1
by location
and for each location map
the timestamp
values to corresponding events using the mapping series finally concat
all groups together
ix = pd.IntervalIndex.from_arrays(df2['time_start'], df2['time_end'], closed='both')
event = df2.set_index(['location', ix])['event']
pd.concat([
g.assign(event=g['timestamp'].map(event.loc[k]))
for k, g in df1.groupby('location', sort=False)
])
After concat
the result will be
factor timestamp location event
0 2 2022-12-01 10:00:00 a 2
4 6 2022-12-15 13:00:00 a NaN
1 3 2022-12-01 10:05:00 b NaN
2 4 2022-12-01 10:15:00 c 4
3 5 2022-12-01 10:20:00 d 5
5 7 2022-12-20 06:00:00 d NaN