Home > Software engineering >  Query for one dataframe row based on row in another dataframe & compare values
Query for one dataframe row based on row in another dataframe & compare values

Time:08-14

So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.

df1 = base records

df2 = simulation records

Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.

Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.

low range = df1['Po']-df1['Ref']

high range = df1['Po'] df1['Ref']

if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score" otherwise return false

Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.

helpful information:

  • df1 (base records) have more or less rows than df2 (simulation records)
  • df1 has more columns than df2
  • some columns in df1 have the same name but different values in df2
  • ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
  • previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)

Timestamp            Name     Time     Po  Ref

7/11/2022 11:30:00   trial   20 mins   5   2

7/10/2022 04:00:00   trial   20 mins   4   4

7/09/2022 02:45:00   trial   20 mins   2   2

6/28/2022 03:45:00   trial   20 mins   3   6
df2 simulation records example (columns that matter)

Timestamp             Name     Time     Sim

7/10/2022 05:15:00    trial   20 mins   7

7/11/2022 12:45:00    trial   20 mins   4

7/12/2022 03:30:00    trial   20 mins   8
desired result of new column added to df2

Timestamp             Name     Time     Sim  Sim Score

7/10/2022 05:15:00    trial   20 mins   7    True

7/11/2022 12:45:00    trial   20 mins   4    True

7/12/2022 03:30:00    trial   20 mins   8    False

CodePudding user response:

Because you don't provide code to construct the dataframe, I will sketch a potential solution:

First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column") for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply() to check your condition:

df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po'] row['Ref'] >= row['Sim']), axis = 1)

CodePudding user response:

Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)

Or use merge_asof, its direction offer nearest.





Method 1:

reindex() with method='nearest'

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po']   df1['Ref']
print(df1)
###
                      Name     Time  Po  Ref  l_r  h_r
Timestamp                                             
2022-07-11 11:30:00  trial  20 mins   5    2    3    7
2022-07-10 04:00:00  trial  20 mins   4    4    0    8
2022-07-09 02:45:00  trial  20 mins   2    2    0    4
2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
                      Name     Time  Sim
Timestamp                               
2022-07-10 05:15:00  trial  20 mins    7
2022-07-11 12:45:00  trial  20 mins    4
2022-07-12 03:30:00  trial  20 mins    8
temp = pd.merge(df2, df1.reindex(df2.index, method='nearest').reset_index(), on='Timestamp')
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False




Method 2:

merge_asof() with direction='nearest' this way is not executed with indexed value, therefore we don't have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po']   df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
            Timestamp   Name     Time  Po  Ref  l_r  h_r
3 2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
2 2022-07-09 02:45:00  trial  20 mins   2    2    0    4
1 2022-07-10 04:00:00  trial  20 mins   4    4    0    8
0 2022-07-11 11:30:00  trial  20 mins   5    2    3    7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim
0 2022-07-10 05:15:00  trial  20 mins    7
1 2022-07-11 12:45:00  trial  20 mins    4
2 2022-07-12 03:30:00  trial  20 mins    8

temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Frankly speaking, working on indexed things would be faster if you have a large dataset.

CodePudding user response:

You can do it via padasql: But note that you better add a unique constraint to one of the columns (e.g. a number of trial)

from pandasql import sqldf

df3 = sqldf('''
    select df2.Timestamp, df2.Name, df2.Time, df2.Sim,
    
    case
        when Sim >= (df1.Po - df1.Ref) and Sim <= (df1.Po   df1.Ref) then 'True'
        when Sim < (df1.Po - df1.Ref) or Sim > (df1.Po   df1.Ref) then 'False'
    end as 'Sim Score'
    
    from df1, df2
    
    where df2.Name == df1.Name and df2.Time == df1.Time        
''')
  • Related