Consider that I have the following data in a Pandas dataframe:
Paper ID | Author ID |
---|---|
Paper_1 | Author_1 |
Paper_1 | Author_2 |
Paper_2 | Author_2 |
Paper_3 | Author_1 |
Paper_3 | Author_2 |
Paper_3 | Author_3 |
Paper_4 | Author_1 |
Paper_4 | Author_3 |
I need to find the number of non-zero collaborations. So, the output should be:
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1
Any help or advice will be greatly appreciated.
CodePudding user response:
If the data is fairly small, then merging on the Paper ID
will generate pairs that can be collapsed/aggregated:
# assume df has columns Paper ID, Author ID
df_merged = df.merge(df, on="Paper ID")
# keep only one instance of a collaboration
mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]
# aggregate (note the use of the mask to avoid double-
# counting and self-collaborations as noted in the
# comment by Riccardo Bucco)
counts = (
df_merged[mask]
.groupby(["Author ID_x", "Author ID_y"])
.agg(collaboration_count=("Paper ID", "count"))
)