We're having two tables with quality and topology information from a large network (>1Mio elements).
First table containing shows all elements with quality issues and looks like:
---- ------- -----------
| |cpe_sum| element |
|---- ------- -----------|
| 0 | 1 | 10 |
| 1 | 2 | 20 |
| 2 | 3 | 30 |
| 3 | 4 | 40 |
| 4 | 5 | 50 |
---- ------- -----------
Second table contains full topology paths:
---- -----------------
| | topo |
|---- -----------------|
| 0 | 8,9,10,11,12,13 |
| 1 | 19,20,21 |
| 2 | 18,19,20,22 |
| 3 | 90,91,92 |
| 4 | 30,31,100,200 |
| 5 | 7,8,9,10 |
| 6 | 50 |
---- -----------------
Now I want to add a affected_device
column into the second table.
---- ----------------- -------------------
| | topo | affected_device |
|---- ----------------- -------------------|
| 0 | 8,9,10,11,12,13 | 10 | topo contains 10 -> take 10
| 1 | 19,20,21 | 20 | topo contains 20 -> take 20
| 2 | 18,19,20,22 | 20 | topo contains 20 -> take 20
| 3 | 90,91,92 | NaN| no match -> np.NaN
| 4 | 30,31,100,200 | 30 | topo contains 30 -> take 30 (attention: 100!=10!)
| 5 | 7,8,9,10 | 10 | topo contains 10 -> take 10
| 6 | 50 | 50 | topo contains 50 -> take 50
---- ----------------- -------------------
Logical:
- If
df2["topo"]
contains value fromdf1["element"]
, take this value - two matches are impossible (by default)
- in many cases there is no match, then take
np.nan
100
shouldn't match with10
(as e.g.95624698
not with24698
)
The only way I know, would be very handy and slow:
Loop through each row from and do something like str.split(",", expand=True)
and then check if values occures in this new cells, if True
, adding this value. But I'm sure there must be a better and faster solution without loop. Can somebody help?
MWE
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
"cpe":[1,2,3,4,5],
"element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})
# Target column
df2["affected_device"] = [10,20,20,np.nan,30,10,50]
CodePudding user response:
Explode preserves indexes, and join uses indexes - so you could explode, then filter and join.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
"cpe":[1,2,3,4,5],
"element":[10,20,30,40,50]
})
df2 = pd.DataFrame({"topo":["8,9,10,11,12,13","19,20,21","18,19,20,22","90,91,92","30,31,100,200","7,8,9,10","50"]})
m = df2.topo.str.split(',').explode()
df2.join(m.loc[m.isin(df.element.astype(str))].to_frame('affected_device'))
Ouput
topo affected_device
0 8,9,10,11,12,13 10
1 19,20,21 20
2 18,19,20,22 20
3 90,91,92 NaN
4 30,31,100,200 30
5 7,8,9,10 10
6 50 50
CodePudding user response:
You can use regex with a word boundary:
df2["new"] = df2["topo"].str.extract(fr'\b({"|".join(df1["element"].astype(str))})\b')
print (df2)
topo affected_device new
0 8,9,10,11,12,13 10.0 10
1 19,20,21 20.0 20
2 18,19,20,22 20.0 20
3 90,91,92 NaN NaN
4 30,31,100,200 30.0 30
5 7,8,9,10 10.0 10
6 50 50.0 50
CodePudding user response:
Here is a way using map()
and list comprehension.
(df2['topo'].str.split(',')
.map(lambda x: [i for i in x if i in df1['element'].astype(str).tolist()])
.str.join(','))