Home > Software engineering >  How to join if key column only partially matches
How to join if key column only partially matches

Time:08-12

We're having two tables with quality and topology information from a large network (>1Mio elements).

First table containing shows all elements with quality issues and looks like:

 ---- ------- ----------- 
|    |cpe_sum|   element |
|---- ------- -----------|
|  0 |     1 |        10 |
|  1 |     2 |        20 |
|  2 |     3 |        30 |
|  3 |     4 |        40 |
|  4 |     5 |        50 |
 ---- ------- ----------- 

Second table contains full topology paths:

 ---- ----------------- 
|    | topo            |
|---- -----------------|
|  0 | 8,9,10,11,12,13 |
|  1 | 19,20,21        |
|  2 | 18,19,20,22     |
|  3 | 90,91,92        |
|  4 | 30,31,100,200   |
|  5 | 7,8,9,10        |
|  6 | 50              |
 ---- ----------------- 

Now I want to add a affected_device column into the second table.

 ---- ----------------- ------------------- 
|    | topo            |   affected_device |
|---- ----------------- -------------------|
|  0 | 8,9,10,11,12,13 |                10 | topo contains 10 -> take 10
|  1 | 19,20,21        |                20 | topo contains 20 -> take 20
|  2 | 18,19,20,22     |                20 | topo contains 20 -> take 20
|  3 | 90,91,92        |                NaN| no match -> np.NaN
|  4 | 30,31,100,200   |                30 | topo contains 30 -> take 30 (attention: 100!=10!)
|  5 | 7,8,9,10        |                10 | topo contains 10 -> take 10
|  6 | 50              |                50 | topo contains 50 -> take 50
 ---- ----------------- ------------------- 

Logical:

  • If df2["topo"] contains value from df1["element"], take this value
  • two matches are impossible (by default)
  • in many cases there is no match, then take np.nan
  • 100 shouldn't match with 10 (as e.g. 95624698 not with 24698)

The only way I know, would be very handy and slow: Loop through each row from and do something like str.split(",", expand=True) and then check if values occures in this new cells, if True, adding this value. But I'm sure there must be a better and faster solution without loop. Can somebody help?


MWE

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "cpe":[1,2,3,4,5],
    "element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})

# Target column
df2["affected_device"] = [10,20,20,np.nan,30,10,50]

CodePudding user response:

Explode preserves indexes, and join uses indexes - so you could explode, then filter and join.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "cpe":[1,2,3,4,5],
    "element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})


m = df2.topo.str.split(',').explode()
df2.join(m.loc[m.isin(df.element.astype(str))].to_frame('affected_device'))

Ouput

              topo affected_device
0  8,9,10,11,12,13              10
1         19,20,21              20
2      18,19,20,22              20
3         90,91,92             NaN
4    30,31,100,200              30
5         7,8,9,10              10
6               50              50

CodePudding user response:

You can use regex with a word boundary:

df2["new"] = df2["topo"].str.extract(fr'\b({"|".join(df1["element"].astype(str))})\b')
print (df2)

              topo  affected_device  new
0  8,9,10,11,12,13             10.0   10
1         19,20,21             20.0   20
2      18,19,20,22             20.0   20
3         90,91,92              NaN  NaN
4    30,31,100,200             30.0   30
5         7,8,9,10             10.0   10
6               50             50.0   50

CodePudding user response:

Here is a way using map() and list comprehension.

(df2['topo'].str.split(',')
.map(lambda x: [i for i in x if i in df1['element'].astype(str).tolist()])
.str.join(','))
  • Related