How can I add a new column based on two dataframes and conditions? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.
import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}
df = pd.DataFrame(data)
df
x y year
0 40.1 100.1 2000
1 50.1 110.1 2000
2 60.1 120.1 2001
3 70.1 130.1 2001
4 80.1 140.1 2003
5 90.1 150.1 2003
6 0.0 160.1 2003
7 300.1 400.1 2004
df2
data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}
df2 = pd.DataFrame(data2)
df2
x y year
0 92.2 149.3 1950
1 30.1 100.1 1951
2 82.6 139.4 1952
3 51.1 111.1 2000
4 39.4 100.8 2000
5 10.1 180.1 1954
6 0.0 0.0 1955
7 299.1 402.5 2004
Output: df
new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] 2.5) and
(df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] 2.5) and
df['year'].iloc[i] == df2['year'].iloc[i]):
out = 1
else:
out = 0
if out == 1:
new_coll.append(1)
else:
new_col.append(0)
df['Result'] = new_col
df
x y year Result
0 40.1 100.1 2000 0
1 50.1 110.1 2000 0
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
But the output is not correct in terms of what i want. It just compare row by row. I want to find: Is the first row in df inside df2 according to conditions? It means check all rows in df2 for each row in df. So the expected output should be as below:
Expected output: df
As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
So expected output:
x y year Result
0 40.1 100.1 2000 1
1 50.1 110.1 2000 1
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
CodePudding user response:
You can loop through each DataFrame and check for all combinations.
for index, row in df.iterrows():
for index2, row2 in df2.iterrows():
if (row['x']-2.5 < row2['x'] < row['x'] 2.5) and (row['y']-2.5 < row2['y'] < row['y'] 2.5):
print(index,index2)
df.loc[index, 'Result'] = 1
CodePudding user response:
This is the alternative solution with Pandas vectorization. If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.
You can first merge the 2 dataframes with cross
so that the we can check the conditions in df1 per df2's row.
dfa = df.merge(df2, how='cross', suffixes=('1', '2'))
Then, apply the conditions.
dfa['Result'] = (dfa.x2 > dfa.x1 - 2.5) &
(dfa.x2 < dfa.x1 2.5) &
(dfa.y2 > dfa.y1 - 2.5) &
(dfa.y2 < dfa.y1 2.5) &
(dfa.year1 == dfa.year2)
Finally, you group by the df's x, y, year (x1, y1, year1) and return True
if any row's Result is True
.
# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year1']).Result.any().astype(int).reset_index()
Result
x1 y1 year1 Result
0 0.0 160.1 2003 0
1 40.1 100.1 2000 1
2 50.1 110.1 2000 1
3 60.1 120.1 2001 0
4 70.1 130.1 2001 0
5 80.1 140.1 2003 0
6 90.1 150.1 2003 0
7 300.1 400.1 2004 1