pandas dataframe groupby get count by multi column condition-CodePudding

I have a excel like this:

year	month	day	time	a	b	c
2020	1	1	00：00	12	23	12
2020	1	1	01：00	31	0	13
2020	1	1	...	...	...	14
2020	1	1	23:00	...	...	23
2020	...	...	...	...	...	24
2020	12	31	23:00	32	0	35
2021	1	1	01：00	31	0	46
2021	1	1	...	...	...	56
2021	1	1	23:00	...	...	45
2021	...	...	...	...	...	34
2021	12	31	23:00	32	0	45

I wanna get data: the sum of c for the row that a >= 30 and b==0 group by year notice that as long as there's time that fulfill a >= 30 and b==0，we can count this day

I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python

CodePudding user response：

First filter rows by condition in boolean indexing and then aggregate sum for column c:

df['datetime'] = pd.to_datetime(df[['year','month','day']])
df1 = df[(df.a>=30) & (df.b==0)].groupby('datetime', as_index=False)['c'].sum()

CodePudding user response：

First of all, one will need to read the data.

If by "excel" you mean a .xlsx file, considering that the file is named data.xlsx, read the file into a dataframe with

df1 = pd.read_excel('data.xlsx')

If by "excel" you mean a .csv file, assuming the file is named data.csv, read the file into a dataframe with

df1 = pd.read_csv('data.csv')

Once you have the Excel data into a dataframe, now it is time to run queries on top of that.

In order to get a dataframe with only the rows where a>= 30 and b== 0, do the following

df2 = df1[(df1['a'] >= 30) & (df1['b'] == 0)]

Now, considering that you want to order by year, month and day (as you mentioned), one can generate a new column in the dataframe

df3['date'] = pd.to_datetime(df2[['year', 'month', 'day']])

Then, using the new dataframe, that was filtered above and to which we added a new column, one can group by variable/column date and sum the values of the column c with

df4['sum_c'] = df3.groupby('date')['c'].sum()

The final dataframe would be df4, and to print it

print(df4)

Note: To make it easier for you to understand, anytime one is doing a major change, one is creating a new dataframe.