I would like to add a new columns in the first dataframe called string_id, where I take the string and distance from this first dataframe and check
- where the string matches in the second dataframe(columns = string_oldname')
- the distance is between which range of the second dataframe and then take the string_id of the second dataframe.
first df:
Timestamp Distance String
2021-09-01 00:11:37 0.414 1
2021-09-01 00:11:37 0.884 1
2021-09-01 00:11:37 1.354 1
2021-09-01 00:11:37 1.824 1
......
2nd dataframe:
String_Id from to String_old_name
0 1.1 8323.54 9272.67 1
1 1.2 7353.54 8323.54 1
2 1.3 6393.97 7353.54 1
3 1.4 5444.51 6393.97 1
4 1.5 4505.14 5444.51 1
5 1.6 3545.57 4505.14 1
6 1.7 2596.11 3545.57 1
7 1.8 1687.00 2596.11 1
8 1.9 848.68 1687.00 1
9 1.10 0.00 848.68 1
.............
For example: for the first row in the first dataframe, String = 1 and Distance = 0.414, so I check in the second dataframe for string_old_name = 1and the distance 0.414 is between which range (0.00 and 848.68) so I take string_id 1.10!
Is it possible to add a column in the first dataframe with this complicated condition?
Thank you so much in advance for your help!
CodePudding user response:
Thank you so much @vtasca! This is exactly what I was looking for! I have one question do you know why the following code doesnt work:
def find_string(row):
cond_1 = (row["String"] == df2["String_old_name"])
cond_2 = (row["Distance"] > df2["from"])
cond_3 = (row["Distance"] < df2["to"])
return df2[cond_1 & cond_2 & cond_3]["String_Id"]
and then df1['string component'] = df1.apply(find_string, axis=1)
CodePudding user response:
Could be expressed more elegantly, but this will get you where you need to go. Assuming the first dataframe is df1
and the second is df2
- First we merge the two dataframes on the String
identifier, then we isolate the specific String_Id
according to your conditions:
df3 = df1.merge(df2, left_on='String', right_on='String_old_name')
df3.loc[(df3.Distance > df3['from']) & (df3.Distance < df3['to'])][['Distance', 'String_Id']].merge(df1, on='Distance')
This returns:
Distance String_Id Timestamp String
0 0.414 1.10 2021-09-01 1
1 0.884 1.10 2021-09-01 1
2 1954.120 1.8 2021-09-01 1
Edit
For the function you wrote in your comment, simply make sure to get the values out of the Series you're returning. The following code will work:
def find_string(row):
cond_1 = (row["String"] == df2["String_old_name"])
cond_2 = (row["Distance"] > df2["from"])
cond_3 = (row["Distance"] < df2["to"])
return df2[cond_1 & cond_2 & cond_3]["String_Id"].to_numpy()[0]
df1['string component'] = df1.apply(find_string, axis=1)