Home > Net >  How to aggregate 3 columns in DataFrame to have count and distribution of values in separated column
How to aggregate 3 columns in DataFrame to have count and distribution of values in separated column

Time:12-12

I have Pandas DataFrame like below:

data types:

  • ID - int

  • TIME - int

  • TG - int

    ID TIME TG
    111 20210101 0
    111 20210201 0
    111 20210301 1
    222 20210101 0
    222 20210201 1
    333 20210201 1

And I need to aggregate above DataFrame so as to know:

  1. how many IDs are per each value in TIME
  2. how many "1" from TG are per each value in TIME
  3. how many "0" from TG are per each value in TIME

So I need to something like below:

TIME     | num_ID | num_1 | num_0
---------|--------|-------|--------
20210101 | 2      | 0     | 2
20210201 | 3      | 2     | 1
20210301 | 1      | 1     | 0

How can I do that in Python Padas ?

CodePudding user response:

Use GroupBy.size for counts TIME values with crosstab for count number of 0 and 1 values:

df1 = (df.groupby('TIME').size().to_frame('num_ID')
         .join(pd.crosstab(df['TIME'], df['TG']).add_prefix('num_'))
         .reset_index())
print (df1)
       TIME  num_ID  num_0  num_1
0  20210101       2      2      0
1  20210201       3      1      2
2  20210301       1      0      1

Another idea if need count only 0 and 1 values in GroupBy.agg:

df1 = (df.assign(num_0 = df['TG'].eq(0),
                num_1 = df['TG'].eq(1))
        .groupby('TIME').agg(num_ID = ('TG','size'),
                             num_1=('num_1','sum'),
                             num_0=('num_0','sum'),
                             )
        .reset_index()
        )
print (df1)
       TIME  num_ID  num_1  num_0
0  20210101       2      0      2
1  20210201       3      2      1
2  20210301       1      1      0

CodePudding user response:

import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'ID': [111, 111, 111, 222, 222, 333],
    'TIME': [20210101, 20210201, 20210301, 20210101, 20210201, 20210201],
    'TG': [0, 0, 1, 0, 1, 1]
})

# Group the DataFrame by the 'TIME' column
grouped_df = df.groupby('TIME')

# Aggregate the grouped DataFrame and create a new DataFrame
# that counts the number of IDs, number of 1s and number of 0s
# for each value in the 'TIME' column
result_df = grouped_df.agg({
    'ID': 'nunique',  # Count the number of unique IDs
    'TG':'sum' 
}).rename(columns={'ID': 'num_ID', 'TG': 'num_1'})

# Calculate the number of 0s in the 'TG' column
# by subtracting the number of 1s from the total number of entries
result_df['num_0'] = grouped_df['TG'].count() - result_df['num_1']

# Reorder the columns in the result DataFrame
result_df = result_df[['num_ID', 'num_1', 'num_0']]

# Print the result DataFrame
print(result_df)

CodePudding user response:

dict1 = {'ID':pd.Series.nunique, 'TG': [lambda x: x.eq(1).sum(), lambda x: x.eq(0).sum()]}
col1 = ['num_id', 'num_1', 'num_0']
df.groupby('TIME').agg(dict1).set_axis(col1, axis=1).reset_index()

result:

    TIME        num_id  num_1   num_0
0   20210101    2       0       2
1   20210201    3       2       1
2   20210301    1       1       0
  • Related