So have I some code that creates a new dataframe that counts the number of unique states with at least 1 non null value and the total number of non null values grouped by year. Code works fine, but I want to modify it to include a new rows that lists the unique states.
This is my data:
year state var1 var2
0 2018 CA NaN 2
1 2018 TX 1 NaN
2 2018 FL NaN NaN
3 2018 AL 1 2
4 2018 AL NaN 1
6 2019 CA NaN NaN
7 2019 TX 1 1
8 2019 FL NaN NaN
9 2019 AL 2 1
10 2019 AK 2 NaN
This my current output:
2018 2019
var1
Number of unique states with at least 1 non-null: 2 3
Number of respondents with non-null var: 2 3
Average: 1 1
var2
Number of unique states with at least 1 non-null: 2 2
Number of respondents with non-null var: 3 2
Average: 1.5 1
This is the code that I'm working with:
c = df.groupby(['year', 'state']).count()
res = c.groupby('YEAR').agg([np.count_nonzero, sum]).T
res.index = res.index.set_levels(['Number of unique states with at least 1 non-null:',
'Number of respondents with non-null var:'], level=1)
z = res.swaplevel().T
res4 = pd.concat([z, pd.concat([z['Number of respondents with non-null var:'] /
z['Number of unique states with at least 1 non-null:']],
keys=['Average:'], axis=1),],
axis=1).T.swaplevel().sort_index()
And this is what I want the new output to look like:
2018 2019
var1
Number of unique states with at least 1 non-null: 2 3
Unique states with at least 1 non-null: [TX, AL] [TX, AL, AK]
Number of respondents with non-null var: 2 3
Average: 1 1
var2
Number of unique states with at least 1 non-null: 2 2
Unique states with at least 1 non-null: [AL, CA] [TX, AL]
Number of respondents with non-null var: 3 2
Average: 1.5 1
Basically I want this row included "Unique states with at least 1 non-null:", listing the names of the states. How can I do this?
CodePudding user response:
I create new function f
for states
and also aggregate mean
for label in MultiIndex, values are set by division selected rows by DataFrame.xs
, last rename
for new second level of MultiIndex
:
c = df.groupby(['year', 'state']).count()
def f(x):
return x.index[x.ne(0)].droplevel(0).tolist()
df = c.groupby(['year']).agg([np.count_nonzero,f,'sum', 'mean']).T
df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0)
df.loc[pd.IndexSlice[:,'mean'],:] = df11.rename({'sum':'mean'}).astype(np.float64).round(1)
d = {'count_nonzero':'Number of unique states with at least 1 non-null:',
'sum':'Number of respondents with non-null var:',
'f':'Unique states with at least 1 non-null',
'mean':'Average:'}
df = df.rename(d)
print (df)
year 2018 2019
var1 Number of unique states with at least 1 non-null: 2 3
Unique states with at least 1 non-null [AL, TX] [AK, AL, TX]
Number of respondents with non-null var: 2 3
Average: 1.0 1.0
var2 Number of unique states with at least 1 non-null: 2 2
Unique states with at least 1 non-null [AL, CA] [AL, TX]
Number of respondents with non-null var: 3 2
Average: 1.5 1.0