Home > Software engineering >  Pandas - For each group, if string in one column is in another column, add to column
Pandas - For each group, if string in one column is in another column, add to column

Time:11-18

At a group level, how does one add missing string to a column (col_2) if a string in another column (col_1) is present anywhere else in the original column (col_2)? Problem is shown in sample data below.

import pandas as pd
import numpy as np

dict = {
'ID': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 2},
'col_1': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D'},
'col_2': {0: 'A,B,C', 1: 'A,B', 2: 'C', 3: 'D', 4: 'A,D', 5: 'B', 6: 'C', 7: 'D'}
}

df = pd.DataFrame.from_dict(dict)

Output:

   ID col_1  col_2
0   1     A  A,B,C
1   1     B    A,B
2   1     C      C
3   1     D      D
4   2     A    A,D
5   2     B      B
6   2     C      C
7   2     D      A

For ID 1, Since 'A,B,C' is together in col_2 at index 0 and col_1 has 'B' and 'C' at index 1 and 2, col_2 need to have 'A,B,C' at index 1 and 2.

For ID 2, since 'A,D' is together in col_2 at index 4 and col_1 has 'D' at index 7, col_2 need to have 'A,D' at index 7.

Desired output:

   ID col_1  col_2
0   1     A  A,B,C
1   1     B  A,B,C
2   1     C  A,B,C
3   1     D      D
4   2     A    A,D
5   2     B      B
6   2     C      C
7   2     D    A,D

I would appreciate of any help.

CodePudding user response:

This is hard to solve just with pandas, you can approach this as a graph problem with help from enter image description here

import networkx as nx
from itertools import pairwise

def get_connected(s):
    l = [x.split(',') for x in s]
    G = nx.from_edgelist([(a,b) for x in l for a,b in pairwise(x)])
    G.add_nodes_from(n for x in l for n in x)
    
    d = {}
    for c in nx.connected_components(G):
        path = ','.join(sorted(c)) 
        for n in c:
            d[n] = path
       
    return d

df['col_3'] = (df.groupby('ID')
                 .apply(lambda g: g['col_1'].map(get_connected(g['col_2'])))
                 .droplevel(0)
              )

Output (as a new col_3):

   ID col_1  col_2  col_3
0   1     A  A,B,C  A,B,C
1   1     B    A,B  A,B,C
2   1     C      C  A,B,C
3   1     D      D      D
4   2     A    A,D    A,D
5   2     B      B      B
6   2     C      C      C
7   2     D      D    A,D
  • Related