how to get rows with the same value in a column?-CodePudding

I have 2 columns, I want to get the values of the second column that have the same value in the first one including the value in the first column. For example in the following the values of first, third and fourth row have the same value A. so the output is [A,B,C]

For example:

col1     col2
A          B
B          A
A          C
A          C
D          G
D          F

Desired output:

[[A,B,C], [D,G,F]]

I did this by using list operation:

L1=df['col1'].tolist()
L2=df['col2'].tolist()

Then I used for loop to get the result. I want to know if it can be done by pandas functions!

CodePudding user response：

This looks to be a graph and you're trying to find connected components. We could use networkx for this job. Basically, build a graph object and group nodes depending on which component they belong to:

import networkx as nx
G = nx.from_pandas_edgelist(df, 'col1', 'col2')
out = list(map(list, nx.connected_components(G)))

Output:

[['C', 'B', 'A'], ['F', 'D', 'G']]

The above code is simplified a lot thanks to @ScottBoston

A (sort of) pandas solution:

Concatenate df back to df with columns reversed; then groupby unique will get us the components each value belongs to. This will have duplicate groups, we could eliminate them using a set.issubset in a list comprehension:

tmp = pd.concat((df, df[['col2','col1']])).groupby('col1')['col2'].unique()
tmp = [set([i]   v.tolist()) for i, v in zip(tmp.index, tmp.tolist())]
out = [list(x) for x in tmp if not any(x.issubset(y) for y in tmp if x!=y)]