Home > front end >  Find the categories that frequently occur together based on another column
Find the categories that frequently occur together based on another column

Time:01-04

Consider that I have the following data in a Pandas dataframe:

Paper ID Author ID
Paper_1 Author_1
Paper_1 Author_2
Paper_2 Author_2
Paper_3 Author_1
Paper_3 Author_2
Paper_3 Author_3
Paper_4 Author_1
Paper_4 Author_3

I need to find the number of non-zero collaborations. So, the output should be:
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1

Any help or advice will be greatly appreciated.

CodePudding user response:

If the data is fairly small, then merging on the Paper ID will generate pairs that can be collapsed/aggregated:

# assume df has columns Paper ID, Author ID
df_merged = df.merge(df, on="Paper ID")

# keep only one instance of a collaboration
mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]

# aggregate (note the use of the mask to avoid double-
# counting and self-collaborations as noted in the
# comment by Riccardo Bucco)
counts = (
    df_merged[mask]
    .groupby(["Author ID_x", "Author ID_y"])
    .agg(collaboration_count=("Paper ID", "count"))
)
  • Related