We're having two tables with quality and topology information from a large network (>1Mio elements).

First table containing shows all elements with quality issues and looks like:

 ---- ------- ----------- 
|    |cpe_sum|   element |
|---- ------- -----------|
|  0 |     1 |        10 |
|  1 |     2 |        20 |
|  2 |     3 |        30 |
|  3 |     4 |        40 |
|  4 |     5 |        50 |
 ---- ------- -----------

Second table contains full topology paths:

 ---- ----------------- 
|    | topo            |
|---- -----------------|
|  0 | 8,9,10,11,12,13 |
|  1 | 19,20,21        |
|  2 | 18,19,20,22     |
|  3 | 90,91,92        |
|  4 | 30,31,100,200   |
|  5 | 7,8,9,10        |
|  6 | 50              |
 ---- -----------------

Now I want to add a affected_device column into the second table.

 ---- ----------------- ------------------- 
|    | topo            |   affected_device |
|---- ----------------- -------------------|
|  0 | 8,9,10,11,12,13 |                10 | topo contains 10 -> take 10
|  1 | 19,20,21        |                20 | topo contains 20 -> take 20
|  2 | 18,19,20,22     |                20 | topo contains 20 -> take 20
|  3 | 90,91,92        |                NaN| no match -> np.NaN
|  4 | 30,31,100,200   |                30 | topo contains 30 -> take 30 (attention: 100!=10!)
|  5 | 7,8,9,10        |                10 | topo contains 10 -> take 10
|  6 | 50              |                50 | topo contains 50 -> take 50
 ---- ----------------- -------------------

Logical:

If df2["topo"] contains value from df1["element"], take this value
two matches are impossible (by default)
in many cases there is no match, then take np.nan
100 shouldn't match with 10 (as e.g. 95624698 not with 24698)

The only way I know, would be very handy and slow: Loop through each row from and do something like str.split(",", expand=True) and then check if values occures in this new cells, if True, adding this value. But I'm sure there must be a better and faster solution without loop. Can somebody help?

MWE

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "cpe":[1,2,3,4,5],
    "element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})

# Target column
df2["affected_device"] = [10,20,20,np.nan,30,10,50]

CodePudding user response：

Explode preserves indexes, and join uses indexes - so you could explode, then filter and join.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "cpe":[1,2,3,4,5],
    "element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})


m = df2.topo.str.split(',').explode()
df2.join(m.loc[m.isin(df.element.astype(str))].to_frame('affected_device'))

Ouput

              topo affected_device
0  8,9,10,11,12,13              10
1         19,20,21              20
2      18,19,20,22              20
3         90,91,92             NaN
4    30,31,100,200              30
5         7,8,9,10              10
6               50              50

CodePudding user response：

You can use regex with a word boundary:

df2["new"] = df2["topo"].str.extract(fr'\b({"|".join(df1["element"].astype(str))})\b')
print (df2)

              topo  affected_device  new
0  8,9,10,11,12,13             10.0   10
1         19,20,21             20.0   20
2      18,19,20,22             20.0   20
3         90,91,92              NaN  NaN
4    30,31,100,200             30.0   30
5         7,8,9,10             10.0   10
6               50             50.0   50

CodePudding user response：

Here is a way using map() and list comprehension.

(df2['topo'].str.split(',')
.map(lambda x: [i for i in x if i in df1['element'].astype(str).tolist()])
.str.join(','))