Home > other >  Python pandas complex group by
Python pandas complex group by

Time:10-17

I have a Pandas data frame (df) with 3 columns: "Drug" , "Affect" , "Journal" Every row is a scientific paper and each paper states what effect the drug has on the disease. Example:

       Drug      Affect    Journal
       aspirin  downregulate  paper1
       aspririn downregulate  paper2
       aspirin  upregulate    paper3
       aspirin  neutral       paper4
       aspirin  downregulate  paper5
       aspririn upregulate    paper6
       iboruprofen  upregulate   paper7
       iboruprofen  upregulate   paper8
       iboruprofen  downregulate paper9
       other_drug     ...            ...

The output should look like this:

Drug
aspirin

I want to make a filter of the df to get only the drugs that in the majority of the papers show to downregulate the disease, so if aspirin for example has 3 papers that tell that it downregulates and 2 papers that say that it upregulates and 1 that says that it has a neutral effect then we assume that the majority of the evidence show that the effect is to downregulate. So I want to have aspirin in my filtered df. Therefore, iboruprofen should not make it to my filtered df.

How do I make the group by? Thanks for your help!

CodePudding user response:

Group by drug and the affect to count the number of records.

df_grouped = df.groupby(["Drug","Affect"], as_index=False).count()

The output:

    Drug         Affect        Journal
0   aspirin      downregulate    3
1   aspirin      neutral         1 
2   aspirin      upregulate      2
3   iboruprofen  downregulate    1
4   iboruprofen  upregulate      2

Loop over the unique drug list and check the numbers. If downregulate > upregulate, add this drug to result list.

result_drugs = []
for drug in np.unique(df["Drug"]:
    numberOfdownregulate = df_grouped[(df_grouped["Drug"] == drug) & (df_grouped["Affect"] == "downregulate")].Journal.values[0]
    numberOfupregulate = df_grouped[(df_grouped["Drug"] == drug) & (df_grouped["Affect"] == "upregulate")].Journal.values[0]
    result = numberOfdownregulate - numberOfupregulate
    
    if result > 0:
        result_drugs.append(drug)

Finally, filter your original dataframe.

df[df["Drug"].isin(result_drugs)]

If you only want to have dataframe with the desired drug values.

pd.DataFrame(result_drugs, columns=["Drug"])

Final output:

    Drug
0   aspirin
  • Related