At a group level, how does one add missing string to a column (col_2) if a string in another column (col_1) is present anywhere else in the original column (col_2)? Problem is shown in sample data below.
import pandas as pd
import numpy as np
dict = {
'ID': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 2},
'col_1': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D'},
'col_2': {0: 'A,B,C', 1: 'A,B', 2: 'C', 3: 'D', 4: 'A,D', 5: 'B', 6: 'C', 7: 'D'}
}
df = pd.DataFrame.from_dict(dict)
Output:
ID col_1 col_2
0 1 A A,B,C
1 1 B A,B
2 1 C C
3 1 D D
4 2 A A,D
5 2 B B
6 2 C C
7 2 D A
For ID 1, Since 'A,B,C' is together in col_2 at index 0 and col_1 has 'B' and 'C' at index 1 and 2, col_2 need to have 'A,B,C' at index 1 and 2.
For ID 2, since 'A,D' is together in col_2 at index 4 and col_1 has 'D' at index 7, col_2 need to have 'A,D' at index 7.
Desired output:
ID col_1 col_2
0 1 A A,B,C
1 1 B A,B,C
2 1 C A,B,C
3 1 D D
4 2 A A,D
5 2 B B
6 2 C C
7 2 D A,D
I would appreciate of any help.
CodePudding user response:
This is hard to solve just with pandas, you can approach this as a graph problem with help from
import networkx as nx
from itertools import pairwise
def get_connected(s):
l = [x.split(',') for x in s]
G = nx.from_edgelist([(a,b) for x in l for a,b in pairwise(x)])
G.add_nodes_from(n for x in l for n in x)
d = {}
for c in nx.connected_components(G):
path = ','.join(sorted(c))
for n in c:
d[n] = path
return d
df['col_3'] = (df.groupby('ID')
.apply(lambda g: g['col_1'].map(get_connected(g['col_2'])))
.droplevel(0)
)
Output (as a new col_3
):
ID col_1 col_2 col_3
0 1 A A,B,C A,B,C
1 1 B A,B A,B,C
2 1 C C A,B,C
3 1 D D D
4 2 A A,D A,D
5 2 B B B
6 2 C C C
7 2 D D A,D