Home > Blockchain >  Finding non-unique rows in Pandas Dataframe
Finding non-unique rows in Pandas Dataframe

Time:06-17

Say I have a pandas dataframe like this:

Doctor Patient Days
Aaron Jeff 23
Aaron Josh 46
Aaron Josh 71
Jess Manny 55
Jess Manny 85
Jess Manny 46

I want to extract dataframes where a combination of a doctor and a patient occurs more than once. I will be doing further work on the procured dataframes.

So, for instance, in this example, dataframe

Doctor Patient Days
Aaron Josh 46
Aaron Josh 71

would be extracted AND dataframe

Doctor Patient Days
Jess Manny 55
Jess Manny 85
Jess Manny 46

would be extracted.

In accordance with my condition, dataframe

Doctor Patient Days
Aaron Jeff 23

will not be extracted because the combination of Aaron and Jeff occurs only once.

Now, I have a dataframe that has 400000 rows and the code I have written so far is, I think, inefficient in procuring the dataframes that I want. Here is the code:

    doctors = list(df_1.Doctor.unique()) # df_1 being the dataframe with 400K rows 
    for doctor in doctors:
        df_2 = df_1[df_1['Doctor'] == doctor] # extract one sub-dataframe per doctor
        patients = list(df_2.Patient.unique())
        for patient in patients:
            df_3 = df_2[df_2['patient'] == patient] # extract one sub-sub-dataframe per doctor and patient
            if len(df_3) >= 2:
                # do something

As you can see, this is already verging on O(n^2) runtime(I say verging because there are not 400K unique values in each column). Is there a way to minimize the runtime? If so, how can my code be improved?

Thanks!

Umesh

CodePudding user response:

You may check with groupby

d = {x : y  for x, y in df.groupby(['Doctor','Patient']) if len(y) > 1}
d
Out[36]: 
{('Aaron', 'Josh'):   Doctor Patient  Days
 1  Aaron    Josh    46
 2  Aaron    Josh    71, ('Jess', 'Manny'):   Doctor Patient  Days
 3   Jess   Manny    55
 4   Jess   Manny    85
 5   Jess   Manny    46}

CodePudding user response:

You can use pd.DataFrame.duplicated like so df.loc[df.duplicated()].

This selects rows where all values are duplicated, to choose for specific columns, you can set the subset parameter:

rows = df.loc[df.duplicated(subset=['doctor', 'patient'])]

CodePudding user response:

here is one way to do it

df2 = (df.groupby(['Doctor','Patient'])['Days'].count() > 1).reset_index()
df2 = df2.drop(df2[df2['Days']==False].index)
df.merge(df2, on=['Doctor','Patient'], suffixes=('','_y')).drop(columns='Days_y')
    Doctor  Patient     Days
0   Aaron   Josh        46
1   Aaron   Josh        71
2   Jess    Manny       55
3   Jess    Manny       85
4   Jess    Manny       46
  • Related