I have a excel like this:
year | month | day | time | a | b | c |
---|---|---|---|---|---|---|
2020 | 1 | 1 | 00:00 | 12 | 23 | 12 |
2020 | 1 | 1 | 01:00 | 31 | 0 | 13 |
2020 | 1 | 1 | ... | ... | ... | 14 |
2020 | 1 | 1 | 23:00 | ... | ... | 23 |
2020 | ... | ... | ... | ... | ... | 24 |
2020 | 12 | 31 | 23:00 | 32 | 0 | 35 |
2021 | 1 | 1 | 01:00 | 31 | 0 | 46 |
2021 | 1 | 1 | ... | ... | ... | 56 |
2021 | 1 | 1 | 23:00 | ... | ... | 45 |
2021 | ... | ... | ... | ... | ... | 34 |
2021 | 12 | 31 | 23:00 | 32 | 0 | 45 |
I wanna get data: the sum of c for the row that a >= 30 and b==0 group by year notice that as long as there's time that fulfill a >= 30 and b==0,we can count this day
I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python
CodePudding user response:
First filter rows by condition in boolean indexing
and then aggregate sum
for column c
:
df['datetime'] = pd.to_datetime(df[['year','month','day']])
df1 = df[(df.a>=30) & (df.b==0)].groupby('datetime', as_index=False)['c'].sum()
CodePudding user response:
First of all, one will need to read the data.
If by "excel" you mean a .xlsx
file, considering that the file is named data.xlsx
, read the file into a dataframe with
df1 = pd.read_excel('data.xlsx')
If by "excel" you mean a .csv
file, assuming the file is named data.csv
, read the file into a dataframe with
df1 = pd.read_csv('data.csv')
Once you have the Excel data into a dataframe, now it is time to run queries on top of that.
In order to get a dataframe with only the rows where a>= 30 and b== 0, do the following
df2 = df1[(df1['a'] >= 30) & (df1['b'] == 0)]
Now, considering that you want to order by year, month and day (as you mentioned), one can generate a new column in the dataframe
df3['date'] = pd.to_datetime(df2[['year', 'month', 'day']])
Then, using the new dataframe, that was filtered above and to which we added a new column, one can group by variable/column date
and sum the values of the column c with
df4['sum_c'] = df3.groupby('date')['c'].sum()
The final dataframe would be df4
, and to print it
print(df4)
Note: To make it easier for you to understand, anytime one is doing a major change, one is creating a new dataframe.