Home > Blockchain >  How to find duplicates from a full data frame using pandas?
How to find duplicates from a full data frame using pandas?

Time:10-07

I have a data frame with 3 columns of classes and 5 rows of students in each class. Some of these students are duplicates. I want to list the most common student names from all the classes and list them in descending order, the number of times they exist, and which the classes they exist in.

df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})

   biology statistics ecology
0     ryan      sarah  austin
1    sarah         ed    ryan
2      tom      jacob     tom
3       ed       ryan     sam
4  jackson         de   sarah

I want the output look something like this:

ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)

...and so on

Any help would be appreciated, I'm a beginner so I have been at this for a several hours. Brain is getting killed. Thanks!

CodePudding user response:

We can melt the DataFrame to get to long form, then groupby aggregate with Named Aggregation to get both the number of classes, and the names of the classes, lastly we can sort_values to get the highest frequency students first:

output_df = (
    df.melt(var_name='class name', value_name='student name')
        .groupby('student name', as_index=False)
        .agg(class_count=('class name', 'count'),
             classes=('class name', tuple))
        .sort_values('class_count', ascending=False, ignore_index=True)
)

output_df:

  student name  class_count                         classes
0         ryan            3  (biology, statistics, ecology)
1        sarah            3  (biology, statistics, ecology)
2           ed            2           (biology, statistics)
3          tom            2              (biology, ecology)
4       austin            1                      (ecology,)
5           de            1                   (statistics,)
6      jackson            1                      (biology,)
7        jacob            1                   (statistics,)
8          sam            1                      (ecology,)

We can further conditionally add classes/class to class_count and write to_csv:

# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str)   np.where(
    output_df['class_count'].eq(1),
    ' class',
    ' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)

output.csv:

ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"

Setup and imports:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})

CodePudding user response:

df = pd.DataFrame({
    'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
    'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
    'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})

results = {}
for h in df:
    for k,v in df[h].value_counts().items():
        print(k,v)
        if k in results:
            results[k]['value']  = v
            results[k]['class'].append(h)
        else:
            results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}
  • Related