So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1['Po']-df1['Ref']
high range = df1['Po'] df1['Ref']
if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score" otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
- df1 (base records) have more or less rows than df2 (simulation records)
- df1 has more columns than df2
- some columns in df1 have the same name but different values in df2
- ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
- previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
CodePudding user response:
Because you don't provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column")
for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply()
to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po'] row['Ref'] >= row['Sim']), axis = 1)
CodePudding user response:
Use pandas.DataFrame.reindex
, its method
offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof
, its direction
offer nearest.
Method 1:
reindex()
with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge(df2, df1.reindex(df2.index, method='nearest').reset_index(), on='Timestamp')
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof()
with direction='nearest'
this way is not executed with indexed value, therefore we don't have to set column Timestamp
as index. But it needs binding objects(in this case we merge on column Timestamp
)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
CodePudding user response:
You can do it via padasql
:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
select df2.Timestamp, df2.Name, df2.Time, df2.Sim,
case
when Sim >= (df1.Po - df1.Ref) and Sim <= (df1.Po df1.Ref) then 'True'
when Sim < (df1.Po - df1.Ref) or Sim > (df1.Po df1.Ref) then 'False'
end as 'Sim Score'
from df1, df2
where df2.Name == df1.Name and df2.Time == df1.Time
''')