I have a df like this:
df = pd.Dataframe({
'project_id': ['61477', '61477', '3432', '3432', '3432],
'conf.type': ['k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend']
})
It looks like this (there are 15 columns in total, I only show 2):
project_id conf.type
0 61477 k-vison-trend
1 61477 k-vison-trend
2 3432 k-vison-trend
3 3432 k-vison-trend
4 3432 k-vison-trend
I want to count the number of duplicate 'project_id's but I don't want to drop them. I want to create different dataframes based on the count of duplicate 'project_id's. If the 'project_id' occurs:
- 1 or 2 times, place the row in df1
- 3 or 4 times, place the row in df2
- 5 or 6 times, place the row in df3
- 7 or 8 times, place the row in df4
- 8 times or more, place the row in df5
I already tried to group on the 'project_id' and use the size function to count the number of times a 'project_id' occurred, like this:
df.groupby('project_id').size().reset_index(name='count')
This gave me wrong output, for example: I expected output of '2' for the project_id of '61477' but got 3268 which is unlikely.
What am I doing wrong?
CodePudding user response:
I'm not entirely sure I'm understanding your question correct, but try the following. The dfs
object will contain a dataframe with the project_ids for each count interval.
from collections import Counter
d = {'1-2': [],
'3-4' : [],
'5-6' : [],
'7-8' : [],
'9 ' : []}
counts = Counter(df['project_id'])
print(counts)
for project_id in counts:
count = counts[project_id]
if count < 3:
d['1-2'].append(project_id)
elif 2 < count < 5:
d['3-4'].append(project_id)
elif 4 < count < 7:
d['5-6'].append(project_id)
elif 6 < count < 9:
d['7-8'].append(project_id)
else:
d['9 '].append(project_id)
print(d)
dfs = []
for interval in d:
dfs.append(df[df['project_id'].isin(d[interval])])
dfs[0]
Output:
project_id conf.type
0 61477 k-vison-trend
1 61477 k-vison-trend
CodePudding user response:
Use groupby()
and count()
to get the occurrences. Merge this with the original dataframe to keep the intial number of rows.
import pandas as pd
df = pd.DataFrame({
'project_id': ['61477', '61477', '3432', '3432', '3432'],
'conf.type': ['k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend', 'k-vison-trend']
})
counts = df.groupby('project_id').count()
output_df = df.merge(counts, on="project_id")
If you like you can filter based on the merged column (most right column in output_df
) and assign the results to new dataframes.