I have 2 columns, I want to get the values of the second column that have the same value in the first one including the value in the first column. For example in the following the values of first, third and fourth row have the same value A
. so the output is [A,B,C]
For example:
col1 col2
A B
B A
A C
A C
D G
D F
Desired output:
[[A,B,C], [D,G,F]]
I did this by using list operation:
L1=df['col1'].tolist()
L2=df['col2'].tolist()
Then I used for loop to get the result. I want to know if it can be done by pandas functions!
CodePudding user response:
This looks to be a graph and you're trying to find connected components. We could use networkx
for this job. Basically, build a graph object and group nodes depending on which component they belong to:
import networkx as nx
G = nx.from_pandas_edgelist(df, 'col1', 'col2')
out = list(map(list, nx.connected_components(G)))
Output:
[['C', 'B', 'A'], ['F', 'D', 'G']]
The above code is simplified a lot thanks to @ScottBoston
A (sort of) pandas solution:
Concatenate df
back to df
with columns reversed; then groupby
unique
will get us the components each value belongs to. This will have duplicate groups, we could eliminate them using a set.issubset
in a list comprehension:
tmp = pd.concat((df, df[['col2','col1']])).groupby('col1')['col2'].unique()
tmp = [set([i] v.tolist()) for i, v in zip(tmp.index, tmp.tolist())]
out = [list(x) for x in tmp if not any(x.issubset(y) for y in tmp if x!=y)]