Home > Mobile >  How can I add a new column based on two dataframes and conditions
How can I add a new column based on two dataframes and conditions

Time:03-03

How can I add a new column based on two dataframes and conditions? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i]   2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i]   2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want. It just compare row by row. I want to find: Is the first row in df inside df2 according to conditions? It means check all rows in df2 for each row in df. So the expected output should be as below:

Expected output: df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

CodePudding user response:

You can loop through each DataFrame and check for all combinations.

for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if  (row['x']-2.5 < row2['x']  < row['x'] 2.5) and (row['y']-2.5 < row2['y']  < row['y'] 2.5):
            print(index,index2)
            df.loc[index, 'Result'] = 1

CodePudding user response:

This is the alternative solution with Pandas vectorization. If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.

You can first merge the 2 dataframes with cross so that the we can check the conditions in df1 per df2's row.

dfa = df.merge(df2, how='cross', suffixes=('1', '2'))

Then, apply the conditions.

dfa['Result'] = (dfa.x2 > dfa.x1 - 2.5) & 
                (dfa.x2 < dfa.x1   2.5) & 
                (dfa.y2 > dfa.y1 - 2.5) & 
                (dfa.y2 < dfa.y1   2.5) & 
                (dfa.year1 == dfa.year2)

Finally, you group by the df's x, y, year (x1, y1, year1) and return True if any row's Result is True.

# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year1']).Result.any().astype(int).reset_index()

Result

      x1     y1  year1  Result
0    0.0  160.1   2003       0
1   40.1  100.1   2000       1
2   50.1  110.1   2000       1
3   60.1  120.1   2001       0
4   70.1  130.1   2001       0
5   80.1  140.1   2003       0
6   90.1  150.1   2003       0
7  300.1  400.1   2004       1
  • Related