I have a table:
country | ID | type |
---|---|---|
US | a1 | upload |
US | a1 | delete |
US | a1 | delete |
US | a2 | upload |
US | a2 | upload |
US | a2 | delete |
IT | k2 | upload |
IT | k2 | upload |
IT | k2 | delete |
I need to get data frame with unique countries, count number of unique ID's in it and average percent of "delete" value in type among each ID in country. So desired table must look like:
country | ID_cnt | delete_perc |
---|---|---|
US | 2 | 0.5 |
IT | 1 | 0.33 |
Its 0.498 since for ID a1, its 0.66; and for a2, it's 0.33; s (0.66 0.33)/2 = 0.5
How to do that with Python?
CodePudding user response:
Use groupby
with agg
:
>>> df.assign(delete_perc=df['type'].eq('delete')).groupby('country', as_index=False).agg({'ID': 'nunique', 'delete_perc': 'mean'})
country ID delete_perc
0 IT 1 0.333333
1 US 2 0.500000
>>>
Or possible with value_counts
and normalize=True
:
>>> df.groupby('country', as_index=False).agg(ID=('ID', 'nunique'), delete_perc=('type', lambda x: x.value_counts(normalize=True)['delete']))
country ID delete_perc
0 IT 1 0.333333
1 US 2 0.500000
>>>
CodePudding user response:
For table manipulation in Python, we often use the pandas package. You first need to create a pandas.DataFrame
object and then, for your use case, use the groupby
and agg
operations.
There is a complete example:
from io import StringIO
import pandas as pd
data = """country ID type
US a1 upload
US a1 delete
US a1 delete
US a2 upload
US a2 upload
US a2 delete
IT k2 upload
IT k2 upload
IT k2 delete
"""
df = pd.read_csv(StringIO(data), sep=" ")
country_stats = (
df
.groupby('country', as_index=False)
.agg(
ID=('ID', 'nunique'),
delete_perc=('type', lambda x: x.value_counts(normalize=True)['delete']))
)
The country_stats
DataFrame now contains the desired output:
country ID delete_perc
0 IT 1 0.333333
1 US 2 0.500000