The data I'm analyzing has a structure similar to this one:
df = pd.DataFrame(
{
"group": ["group1", "group1", "group2", "group2", "group2", "group3", "group3", "group3", "group4", "group4", "group4", "group4", "group5", "group5"],
"letter": ["B1", "B2", "B1", "B2", "B3", "B1", "B2", "B4", "B2", "B1", "B3", "B4", "B3", "B4"]
})
I want to get the possible combination of elements within each group, where the order is not important. For example, for group2 I would like to obtain something like this:
group letter_x letter_y
group2 B1 B2
group2 B1 B3
group2 B2 B3
I've merged the df with itself to obtain the combinations and then got rid of the equal values (e.g. B1B1) this way:
df_merge = df.merge(df, left_on='group', right_on='group', how="outer")
df_merge = df_merge[df_merge['letter_x'] != df_merge['letter_y']]
But I haven't been able to get rid of the symmetric pairs, meaning that, for example, for group2 I obtain:
group letter_x letter_y
group2 B1 B2
group2 B1 B3
group2 B2 B1
group2 B2 B3
group2 B3 B1
group2 B3 B2
Any ideas?
On the other hand, I want to get the occurrences of each pair across groups in a different dataframe. For instance:
letter_x letter_y count groups
B1 B2 4 (group1,group2,group3,group4)
B1 B3 2 (group2,group4)
B1 B4 2 (group3,group4)
...
My intention was to apply something like this:
df_overlap = df_merge.groupby(['letter_x', 'letter_y']).agg(lambda x: tuple(x)).reset_index()
And then obtain the counts with the length of the "groups" list in an old-fashioned way using a "for".
For this part, I worry that some elements in the "letter" column will have the mirror pair (as B2B1 in group4) and thus they will have two different counts, whilst I need them to be accounted for as the same.
Any solution for that? I am open to more efficient ways to do this as well. Thanks!
CodePudding user response:
IIUC you are looking for combinations
:
from itertools import combinations
out = (df.sort_values(["group", "letter"])
.groupby("group")["letter"]
.apply(lambda d: pd.DataFrame(combinations(d, 2), columns=["letter_x", "letter_y"]))
.droplevel(1).reset_index())
print (out.groupby(["letter_x","letter_y"])["group"].agg(["count", tuple]).reset_index())
letter_x letter_y count tuple
0 B1 B2 4 (group1, group2, group3, group4)
1 B1 B3 2 (group2, group4)
2 B1 B4 2 (group3, group4)
3 B2 B3 2 (group2, group4)
4 B2 B4 2 (group3, group4)
5 B3 B4 2 (group4, group5)