Home > OS >  Pandas dataframe groupby text value that occurs in two columns
Pandas dataframe groupby text value that occurs in two columns

Time:02-11

My dataframe looks like this:

     v1           v2        distance
0   be          belong      0.666667
4   increase    decrease    0.666667
9   analyze     assay       0.666667
11  bespeak     circulate   0.769231
21  induce      generate    0.800000
24  decrease    delay       0.750000
26  cause       trip        0.666667
27  isolate     distinguish 0.750000
28  give        infect      0.666667
29  result      prove       0.800000
31  describe    explain     0.714286
33  report      circulate   0.666667
36  affect      expose      0.666667
40  explain     intercede   0.705882
41  suppress    restrict    0.833333

With v1 and v2 being verbs and distance is their similarity. I want to create clusters of similar words, based on their appearance in the dataframe.

For example, the word circulate appears be similar with both bespeak and report. So I would like to have a cluster of these 3 words. Groupby doesn't help since they are string values. Can someone help?

CodePudding user response:

This seems like a graph problem.

You could try to use graph

Small function to plot the graph in jupyter:

def nxplot(G):
    from networkx.drawing.nx_agraph import to_agraph
    A = to_agraph(G)
    A.layout('dot')
    A.draw('/tmp/graph.png')
    from IPython.display import Image
    return Image(filename='/tmp/graph.png')

CodePudding user response:

The following line would select only the rows containing the string target_string:

rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]

Concatenate them and find the unique elements:

cluster = pd.concat([rows[['v1', 'v2']]], axis = 1).unique()

If you want to find clusters with all the words, repeat this for all the unique elements. An inefficient example:

clusters = pd.DataFrame()
for target_string in df.v1.unique():
    rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]
    clusters.append(pd.concat([rows[['v1', 'v2']]], axis = 1).unique())
  • Related