I have Pandas DataFrame like below:
data types:
ID - int
TIME - int
TG - int
ID TIME TG 111 20210101 0 111 20210201 0 111 20210301 1 222 20210101 0 222 20210201 1 333 20210201 1
And I need to aggregate above DataFrame so as to know:
- how many IDs are per each value in TIME
- how many "1" from TG are per each value in TIME
- how many "0" from TG are per each value in TIME
So I need to something like below:
TIME | num_ID | num_1 | num_0
---------|--------|-------|--------
20210101 | 2 | 0 | 2
20210201 | 3 | 2 | 1
20210301 | 1 | 1 | 0
How can I do that in Python Padas ?
CodePudding user response:
Use GroupBy.size
for counts TIME
values with crosstab
for count number of 0
and 1
values:
df1 = (df.groupby('TIME').size().to_frame('num_ID')
.join(pd.crosstab(df['TIME'], df['TG']).add_prefix('num_'))
.reset_index())
print (df1)
TIME num_ID num_0 num_1
0 20210101 2 2 0
1 20210201 3 1 2
2 20210301 1 0 1
Another idea if need count only 0
and 1
values in GroupBy.agg
:
df1 = (df.assign(num_0 = df['TG'].eq(0),
num_1 = df['TG'].eq(1))
.groupby('TIME').agg(num_ID = ('TG','size'),
num_1=('num_1','sum'),
num_0=('num_0','sum'),
)
.reset_index()
)
print (df1)
TIME num_ID num_1 num_0
0 20210101 2 0 2
1 20210201 3 2 1
2 20210301 1 1 0
CodePudding user response:
import pandas as pd
# Create the DataFrame
df = pd.DataFrame({
'ID': [111, 111, 111, 222, 222, 333],
'TIME': [20210101, 20210201, 20210301, 20210101, 20210201, 20210201],
'TG': [0, 0, 1, 0, 1, 1]
})
# Group the DataFrame by the 'TIME' column
grouped_df = df.groupby('TIME')
# Aggregate the grouped DataFrame and create a new DataFrame
# that counts the number of IDs, number of 1s and number of 0s
# for each value in the 'TIME' column
result_df = grouped_df.agg({
'ID': 'nunique', # Count the number of unique IDs
'TG':'sum'
}).rename(columns={'ID': 'num_ID', 'TG': 'num_1'})
# Calculate the number of 0s in the 'TG' column
# by subtracting the number of 1s from the total number of entries
result_df['num_0'] = grouped_df['TG'].count() - result_df['num_1']
# Reorder the columns in the result DataFrame
result_df = result_df[['num_ID', 'num_1', 'num_0']]
# Print the result DataFrame
print(result_df)
CodePudding user response:
dict1 = {'ID':pd.Series.nunique, 'TG': [lambda x: x.eq(1).sum(), lambda x: x.eq(0).sum()]}
col1 = ['num_id', 'num_1', 'num_0']
df.groupby('TIME').agg(dict1).set_axis(col1, axis=1).reset_index()
result:
TIME num_id num_1 num_0
0 20210101 2 0 2
1 20210201 3 2 1
2 20210301 1 1 0