Merging Pandas DataFrame rows by common values with any degree of separation-CodePudding

I've got a DataFrame containing 3 columns (and ~1100 rows, but may be more or less). Each column contains IDs made of long integers, which can exist in more than one row and column. I need to merge every single row linked by at least one common ID with any degree of separation (i've posted an simple example of the DF which should clarify this part). The only way i've come up with is taking me ages to calculate.

#Dataframe:
A   B   C
X   Y   Z
D   E   F
T   U   V
C   D   E
E   N   Z
AA  BB  CC
HH  CC  U

#Final needed result (the order is irrelevant):
A B C D E F N Z X Y
T U V HH CC AA BB

CodePudding user response：

This obtains the bi-partite graph of your example.

import networkx as nx

g = nx.Graph()
for _, row in df.iterrows():
    a, b, c = row
    # The given requirement is "linked by at least one common ID",
    # so each row induces three edges.
    g.add_edge(a, b)
    g.add_edge(b, c)
    g.add_edge(a, c)

Now enumerate_all_cliques() will recover the desired connected components, even if it develops that there's far more than two.

CodePudding user response：

You can use networkx and find connected components like this:

import networkx as nx
from itertools import tee

#Using and itertools recipe, pairwise
def pairwise(iterable):
    # pairwise('ABCDEFG') --> AB BC CD DE EF FG
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


df_p = pd.concat([df[[i[0], i[1]]].set_axis(['s', 'e'], axis=1) for i in pairwise(df)])
G = nx.from_pandas_edgelist(df_p, 's', 'e')
list(nx.connected_components(G))

Output:

[{'A', 'B', 'C', 'D', 'E', 'F', 'N', 'X', 'Y', 'Z'},
 {'AA', 'BB', 'CC', 'HH', 'T', 'U', 'V'}]

CodePudding user response：

This uses plain Python and should be fairly efficient, but may not be as efficient as the solutions that use networkx.

import pandas as pd
df = pd.DataFrame(
    r.strip().split()
    for r in """
        A   B   C
        X   Y   Z
        D   E   F
        T   U   V
        C   D   E
        E   N   Z
        AA  BB  CC
        HH  CC  U
    """.strip().splitlines()
)

# dict that will point to the matching cluster for each ID
id_cluster = dict()

for idx, row in df.iterrows():
    # merge all the matching clusters into the first one
    # note: we reuse the first one to avoid rebuilding the whole cluster
    # each time if all the elements are already in the first one
    new_cluster = id_cluster.get(row[0], {row[0]})
    for id in row[1:]:
        new_cluster.update(id_cluster.get(id, {id}))
    # update all the IDs in this cluster to point to the correct cluster
    # (there will only be one copy of new_cluster in memory, but they
    # will all point to it, and Python will discard the old clusters)
    for id in new_cluster:
        id_cluster[id] = new_cluster

# show the unique clusters
# note: using frozenset() is an easy way to make the sets hashable
# so we can select only the unique ones via set()
result = [sorted(row) for row in set(map(frozenset, id_cluster.values()))]
print(result)
# [['AA', 'BB', 'CC', 'HH', 'T', 'U', 'V'],
#  ['A', 'B', 'C', 'D', 'E', 'F', 'N', 'X', 'Y', 'Z']]