I am new to python and struggling with data formatting with the below data frame :
Col1 Col2
Type1 Type2
Type3 Type4
Type8 Type13
Type3 Type15
Type2 Type6
Type4 Type9
Type6 Type11
Type9 Type18
Type13 Type20
I want to identify the chain like format using col1
and col2
. For example Type1-->Type2-->Type6-->Type11
form a chain.So the final result will look as below :
Col1 Col2 Chain
Type1 Type2 Chain1
Type3 Type4 Chain2
Type8 Type13 Chain3
Type3 Type15
Type2 Type6 Chain1
Type4 Type9 Chain2
Type6 Type11 Chain1
Type9 Type18 Chain2
Type13 Type20 Chain3
CodePudding user response:
You might want to do something like this (you need to install networkx
). Note that df
is your Dataframe containing all your data:
import networkx as nx
edges = df.drop_duplicates(['Col1'])
G = nx.Graph()
G.add_edges_from(edges.itertuples(index=False, name=None))
ccs = list(nx.connected_components(G))
df['Chain'] = df.apply(lambda row: next((f'Chain{i}' for i, cc in enumerate(ccs) if row[0] in cc and row[1] in cc), ''), axis=1)
Output:
Col1 Col2 Chain
0 Type1 Type2 Chain0
1 Type3 Type4 Chain1
2 Type8 Type13 Chain2
3 Type3 Type15
4 Type2 Type6 Chain0
5 Type4 Type9 Chain1
6 Type6 Type11 Chain0
7 Type9 Type18 Chain1
8 Type13 Type20 Chain2