Say I have a DataFrame like below
UUID domains
0 asd [foo.com, foo.ca]
1 jkl [foo.ca, foo.fr]
2 xyz [foo.fr]
3 iek [bar.com, bar.org]
4 qkr [bar.org]
5 kij [buzz.net]
How can I turn it in to something like this?
UUID
0 [asd, jkl, xyz]
1 [iek, qkr]
2 [kij]
I want to group all the UUIDs where any domain is present in any other domains
column. For example, rows 0
and 1
both contain foo.ca
and rows 1
and 2
both contain foo.fr
so should be grouped together.
The size of my data set is millions of rows so I can't brute force it.
CodePudding user response:
We can do explode
first then use networkx
import networkx as nx
s = df.explode('domains')
G = nx.from_pandas_edgelist(s, 'UUID', 'domains')
out = pd.Series([[y for y in x if y not in s.domains.tolist()] for x in [*nx.connected_components(G)]])
Out[209]:
0 [xyz, jkl, asd]
1 [iek, qkr]
2 [kij]
dtype: object
CodePudding user response:
Assuming the following input with domains as lists:
df = pd.DataFrame({'UUID': ['asd', 'jkl', 'xyz', 'iek', 'qkr', 'kij'],
'domains': [['foo.com', 'foo.ca'], ['foo.ca', 'foo.fr'], ['foo.fr'], ['bar.com', 'bar.org'], ['bar.org'], ['buzz.net']]}
)
You problem is a graph problem. You want to find the roots of the disconnected subgraphs:
This is easily achieved with networkx
.
# transform dataframe into graph
import networkx as nx
G = nx.from_pandas_edgelist(df.explode('domains'),
source='UUID', target='domains',
create_using=nx.DiGraph)
# split the subgraphs (weakly_connected) and find the roots (degree: 0)
# the output is a generator
groups = ([n for n,g in G.subgraph(c).in_degree if g==0]
for c in nx.weakly_connected_components(G))
# transform the generator to Series
s = pd.Series(groups)
output:
0 [asd, jkl, xyz]
1 [iek, qkr]
2 [kij]