Home > Enterprise >  Clean alternative to pandas' broken crosstab for NA values
Clean alternative to pandas' broken crosstab for NA values

Time:12-02

I'm trying to get a function that works similarly to R's table function, for which the parameter useNA allows me to include NA values in the cross table.

Here's a small example:

df = pd.DataFrame({"a": [0, 1, pd.NA, pd.NA], "b":[2, pd.NA, 3, pd.NA]})
print(pd.crosstab(df["a"], df["b"], dropna=False)

What I get from that is

b  2  3
a      
0  1  0

But I'd want it to be something like

b   2  3  NA
a      
0   1  0  0
1   0  0  1
NA  0  1  1

Not only does this ignore 3 quarters of the lines in the dataframe, the result is also dependant on the order the two series where inserted in, here's pd.crosstab(df["b"], df["a"], dropna=False):

a  0  1
b      
2  1  0

A workaround I can think of would be to look at the unique values in the two series and create a new value that is in neither of them and use it to temporarily replace NA values with it using fillna, but this feels very botchy and I'd be surprised if there weren't already something that cleanly does exactly what I'm looking for.

Plus, that solution wouldn't work as intended in situations where one of the two series had no missing data.

Edit: Adding an example to illustrate the last part.

df = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})
print(pd.crosstab(df["a"].fillna("NA"), df["b"].fillna("NA"), dropna=False)

Outputs:

b  2  3  NA
a          
0  1  0   0
1  0  0   1
2  0  1   0
3  0  0   1

Expected:

b   2  3  NA
a          
0   1  0   0
1   0  0   1
2   0  1   0
3   0  0   1
NA  0  0   0

CodePudding user response:

You can replace missing values to NA:

print(pd.crosstab(t["a"].fillna('NA'), t["b"].fillna('NA')))
b   2  3  NA
a           
0   1  0   0
1   0  0   1
NA  0  1   1

EDIT: Add new row filled by NA and then subtract 1 from intersection NA, NA:

t = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})

df = t.append(pd.DataFrame('NA', index=[-1], columns=t.columns)).fillna('NA')
df = pd.crosstab(df["a"], df["b"])
df.loc['NA','NA'] -= 1
print(df)
b   2  3  NA
a           
0   1  0   0
1   0  0   1
2   0  1   0
3   0  0   1
NA  0  0   0

With groupby.size stack is possible use:

t = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})

df = t.append(pd.DataFrame(np.nan, index=[-1], columns=t.columns))
df = df.groupby(['a', 'b'], dropna = False).size().unstack(fill_value=0)
df.loc[np.nan,np.nan] -= 1
print(df)
b    2.0  3.0  NaN
a                 
0.0    1    0    0
1.0    0    0    1
2.0    0    1    0
3.0    0    0    1
NaN    0    0    0

CodePudding user response:

crosstab is a convenience option, wrapped around pd.pivot_table; you could go straight to groupby (pd.pivot_table is a wrapper around groupby) and replicate your output:

df.groupby(['a', 'b'], dropna = False).size().unstack(fill_value=0)

b    2.0  3.0  NaN
a                 
0.0    1    0    0
1.0    0    0    1
NaN    0    1    1

It would also be helpful, if you could maybe make a PR on pandas to improve the crosstab functionality

  • Related