Home > Enterprise >  Add a column in a dataframe which is conditioned on a multiple dataframes columns
Add a column in a dataframe which is conditioned on a multiple dataframes columns

Time:09-28

I would like to add a new columns in the first dataframe called string_id, where I take the string and distance from this first dataframe and check

  1. where the string matches in the second dataframe(columns = string_oldname')
  2. the distance is between which range of the second dataframe and then take the string_id of the second dataframe.

first df:

Timestamp Distance String

2021-09-01 00:11:37 0.414 1

2021-09-01 00:11:37 0.884 1

2021-09-01 00:11:37 1.354 1

2021-09-01 00:11:37 1.824 1

......

2nd dataframe:

String_Id from to String_old_name

0 1.1 8323.54 9272.67 1

1 1.2 7353.54 8323.54 1

2 1.3 6393.97 7353.54 1

3 1.4 5444.51 6393.97 1

4 1.5 4505.14 5444.51 1

5 1.6 3545.57 4505.14 1

6 1.7 2596.11 3545.57 1

7 1.8 1687.00 2596.11 1

8 1.9 848.68 1687.00 1

9 1.10 0.00 848.68 1

.............

For example: for the first row in the first dataframe, String = 1 and Distance = 0.414, so I check in the second dataframe for string_old_name = 1and the distance 0.414 is between which range (0.00 and 848.68) so I take string_id 1.10!

Is it possible to add a column in the first dataframe with this complicated condition?

Thank you so much in advance for your help!

CodePudding user response:

Thank you so much @vtasca! This is exactly what I was looking for! I have one question do you know why the following code doesnt work:

def find_string(row):

cond_1 = (row["String"] == df2["String_old_name"])

cond_2 = (row["Distance"] > df2["from"])
cond_3 = (row["Distance"] < df2["to"])

return df2[cond_1 & cond_2 & cond_3]["String_Id"] 

and then df1['string component'] = df1.apply(find_string, axis=1)

CodePudding user response:

Could be expressed more elegantly, but this will get you where you need to go. Assuming the first dataframe is df1 and the second is df2 - First we merge the two dataframes on the String identifier, then we isolate the specific String_Id according to your conditions:

df3 = df1.merge(df2, left_on='String', right_on='String_old_name')

df3.loc[(df3.Distance > df3['from']) & (df3.Distance < df3['to'])][['Distance', 'String_Id']].merge(df1, on='Distance')

This returns:

Distance        String_Id   Timestamp   String
0   0.414       1.10    2021-09-01      1
1   0.884       1.10    2021-09-01      1
2   1954.120    1.8     2021-09-01      1

Edit
For the function you wrote in your comment, simply make sure to get the values out of the Series you're returning. The following code will work:

def find_string(row):
    cond_1 = (row["String"] == df2["String_old_name"])
    cond_2 = (row["Distance"] > df2["from"])
    cond_3 = (row["Distance"] < df2["to"])
    return df2[cond_1 & cond_2 & cond_3]["String_Id"].to_numpy()[0]

df1['string component'] = df1.apply(find_string, axis=1)
  • Related