I have a dataframe:
import pandas as pd
d = {
'Country': ["Austria", "Austria", "Belgium", "USA", "USA", "USA", "USA"],
'Number2020': [15, None, 18, 20, 22, None, 30],
'Number2021': [20, 25, 18, None, None, None, 32],
}
df = pd.DataFrame(data=d)
df
Country Number2020 Number2021
0 Austria 15.0 20.0
1 Austria NaN 25.0
2 Belgium 18.0 18.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
6 USA 30.0 32.0
and I want to sum up the nan values per each country. E.g.
Country Count_nans
Austria 1
USA 4
I have filtered the dataframe to leave only the rows with nans .
df_nan = df[df.Number2021.isna() | df.Number2020.isna()]
Country Number2020 Number2021
1 Austria NaN 25.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
So it looks like a groupby operation? I have tried this.
nasum2021 = df_nan['Number2021'].isna().sum()
df_nan['countNames2021'] = df_nan.groupby(['Number2021'])['Number2021'].transform('count').fillna(nasum2021)
df_nan
It gives me 1 nan for Austria but 3 for the United States while it should be 4. so that is not right. In my real dataframe, I have some 10 years and around 30 countries. thank you!
CodePudding user response:
Solution for processing all columns without Country
- first convert it to index, test missing values and aggregate sum
, last sum columns:
s = df.set_index('Country').isna().groupby('Country').sum().sum(axis=1)
print (s)
Country
Austria 1
Belgium 0
USA 4
dtype: int64