I have two dataframes, df
and df2
. I want to include on df
an extra column on df
called names
. This column should contain the name of the row of df
that is defined on df2
based on the conjunction of the group names of df and df2 and when this condition is met: (df[val2] >= df2[val] >= df[val2])
.
Therefore, given this inputs:
df
val1 val2 group
0 1 2 a1
1 3 4 a1
2 10 12 a1
3 23 25 a1
4 1 2 a2
5 3 4 a2
6 10 12 a2
7 23 25 a2
df2
val name group
0 1 abc a1
1 4 def a1
2 12 ghi a1
3 26 jkl a1
4 2 p1 a2
5 4 p2 a2
6 12 p3 a2
7 26 p4 a2
The desired output would be:
df
val1 val2 group name
0 1 2 a1 abc
1 3 4 a1 def
2 10 12 a1 ghi
3 23 25 a1
4 1 2 a2 p1
5 3 4 a2 p2
6 10 12 a2 p3
7 23 25 a2
As said, I manage to obtain the correct result, but my code is not efficient at all (in this small example, it does not really matter, but when running this snippet on an example with thousands of registers, you can tell that it is quite slow). Is there any way to boost this operation? PS: bellow is my current code with a working example:
d = {'val1': [1, 3, 10, 23, 1, 3, 10, 23], 'val2': [2, 4, 12, 25, 2, 4, 12, 25], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
d2 = {'val': [1, 4, 12, 26, 2, 4, 12, 26], 'name': ['abc', 'def', 'ghi', 'jkl', 'p1', 'p2', 'p3', 'p4'], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
names = set(df["group"])
df["name"] = ""
for name in names:
df_it = df[df["group"] == name]
df2_it = df2[df2["group"] == name]
for it in df_it.index:
df2_it_2 = df2_it[(df2["val"] >= \
df_it["val1"][it]) &
(df2["val"] <=
df_it["val2"][it])]
if not df2_it_2.empty:
df["name"][it] = df2_it_2["name"].iloc[0]
CodePudding user response:
You can perform a merge_asof
to have val ≥ val1, then mask the "name"s for which the merged val is greater than val2:
(pd.merge_asof(df.reset_index().sort_values(by='val1'),
df2.sort_values(by='val'),
direction='forward',
by='group', left_on='val1', right_on='val')
.set_index('index').sort_index()
.assign(name=lambda d: d['name'].mask(d['val'].gt(d['val2'])))
.drop(columns='val')
)
Another option using a second join
, this might be more flexible for many columns to add:
cols = ['name']
df.join(
pd.merge_asof(df.reset_index().sort_values(by='val1'),
df2.sort_values(by='val'),
direction='forward',
by='group', left_on='val1', right_on='val')
.loc[lambda d: d['val'].le(d['val2']), ['index'] cols]
.set_index('index')
)
output:
val1 val2 group name
index
0 1 2 a1 abc
1 3 4 a1 def
2 10 12 a1 ghi
3 23 25 a1 NaN
4 1 2 a2 p1
5 3 4 a2 p2
6 10 12 a2 p3
7 23 25 a2 NaN