Home > OS >  Combine elements within a group and get number of occurences across groups using pandas
Combine elements within a group and get number of occurences across groups using pandas

Time:10-15

The data I'm analyzing has a structure similar to this one:

df = pd.DataFrame(
    {
        "group": ["group1", "group1", "group2", "group2", "group2", "group3", "group3", "group3", "group4", "group4", "group4", "group4", "group5", "group5"],
        "letter": ["B1", "B2", "B1", "B2", "B3", "B1", "B2", "B4", "B2", "B1", "B3", "B4", "B3", "B4"]
    })

I want to get the possible combination of elements within each group, where the order is not important. For example, for group2 I would like to obtain something like this:

group   letter_x  letter_y
group2     B1       B2
group2     B1       B3
group2     B2       B3

I've merged the df with itself to obtain the combinations and then got rid of the equal values (e.g. B1B1) this way:

df_merge = df.merge(df, left_on='group', right_on='group', how="outer")
df_merge = df_merge[df_merge['letter_x'] != df_merge['letter_y']]

But I haven't been able to get rid of the symmetric pairs, meaning that, for example, for group2 I obtain:

group   letter_x  letter_y
group2     B1       B2
group2     B1       B3
group2     B2       B1
group2     B2       B3
group2     B3       B1
group2     B3       B2

Any ideas?

On the other hand, I want to get the occurrences of each pair across groups in a different dataframe. For instance:

letter_x  letter_y  count  groups
  B1        B2        4    (group1,group2,group3,group4)
  B1        B3        2    (group2,group4)
  B1        B4        2    (group3,group4)
  ...

My intention was to apply something like this:

df_overlap = df_merge.groupby(['letter_x', 'letter_y']).agg(lambda x: tuple(x)).reset_index()

And then obtain the counts with the length of the "groups" list in an old-fashioned way using a "for".

For this part, I worry that some elements in the "letter" column will have the mirror pair (as B2B1 in group4) and thus they will have two different counts, whilst I need them to be accounted for as the same.

Any solution for that? I am open to more efficient ways to do this as well. Thanks!

CodePudding user response:

IIUC you are looking for combinations:

from itertools import combinations

out = (df.sort_values(["group", "letter"])
         .groupby("group")["letter"]
         .apply(lambda d: pd.DataFrame(combinations(d, 2),  columns=["letter_x", "letter_y"]))
         .droplevel(1).reset_index())

print (out.groupby(["letter_x","letter_y"])["group"].agg(["count", tuple]).reset_index())

  letter_x letter_y  count                             tuple
0       B1       B2      4  (group1, group2, group3, group4)
1       B1       B3      2                  (group2, group4)
2       B1       B4      2                  (group3, group4)
3       B2       B3      2                  (group2, group4)
4       B2       B4      2                  (group3, group4)
5       B3       B4      2                  (group4, group5)
  • Related