Python - Count duplicates and sort by occurrence-CodePudding

I have a df like this:

df = pd.Dataframe({
          'project_id': ['61477', '61477', '3432', '3432', '3432],
          'conf.type': ['k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend']
})

It looks like this (there are 15 columns in total, I only show 2):

        project_id     conf.type
0       61477          k-vison-trend
1       61477          k-vison-trend
2       3432           k-vison-trend
3       3432           k-vison-trend
4       3432           k-vison-trend

I want to count the number of duplicate 'project_id's but I don't want to drop them. I want to create different dataframes based on the count of duplicate 'project_id's. If the 'project_id' occurs:

1 or 2 times, place the row in df1
3 or 4 times, place the row in df2
5 or 6 times, place the row in df3
7 or 8 times, place the row in df4
8 times or more, place the row in df5

I already tried to group on the 'project_id' and use the size function to count the number of times a 'project_id' occurred, like this:

df.groupby('project_id').size().reset_index(name='count')

This gave me wrong output, for example: I expected output of '2' for the project_id of '61477' but got 3268 which is unlikely.

What am I doing wrong?

CodePudding user response：

I'm not entirely sure I'm understanding your question correct, but try the following. The dfs object will contain a dataframe with the project_ids for each count interval.

from collections import Counter
d = {'1-2': [],
    '3-4' : [],
    '5-6' : [],
    '7-8' : [],
    '9 '  : []}
counts = Counter(df['project_id'])
print(counts)
for project_id in counts:
    count = counts[project_id]
    if count < 3:
        d['1-2'].append(project_id)
    elif 2 < count < 5:
        d['3-4'].append(project_id)
    elif 4 < count < 7:
        d['5-6'].append(project_id)
    elif 6 < count < 9:
        d['7-8'].append(project_id)
    else:
        d['9 '].append(project_id)

print(d)

dfs = []
for interval in d:
    dfs.append(df[df['project_id'].isin(d[interval])])

dfs[0]

Output:

    project_id  conf.type
0   61477   k-vison-trend
1   61477   k-vison-trend

CodePudding user response：

Use groupby() and count() to get the occurrences. Merge this with the original dataframe to keep the intial number of rows.

import pandas as pd

df = pd.DataFrame({
    'project_id': ['61477', '61477', '3432', '3432', '3432'],
    'conf.type': ['k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend']
})

counts = df.groupby('project_id').count()

output_df = df.merge(counts, on="project_id")

If you like you can filter based on the merged column (most right column in output_df) and assign the results to new dataframes.