I am puzzled by how pivot_table
works with categorical variables used as values.
I did search the web for pandas pivot_table categorical variables and I do find some information, but nothing that really explains to my why I see what I see.
Test dataframe:
test_df = pd.DataFrame.from_dict({'val': ['pass','pass','fail','pass'], "col_a": ['a','b','a','b'], "col_b": ['x','x','y','y']})
test_df
val col_a col_b
0 pass a x
1 pass b x
2 fail a y
3 pass b y
Then I proceed to reshape it. I come from R/data.tables where this would be a cast.
test_df.pivot_table(index = "col_a", columns = "col_b", values = 'val')
and I get this:
/tmp/ipykernel_153608/3910840210.py:1: FutureWarning:
Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
col_b
col_a
a
b
Empty result, but with indices. After doing a million tests (on my real object, such as testing if there were duplicated values, NAs, etc), this seems to work:
test_df.pivot_table(index = "col_a", columns = "col_b", values = 'val', aggfunc=lambda x: x)
col_b x y
col_a
a pass fail
b pass pass
which is what I want. Besides the "dude, if it works, take it and be happy", does anyone know why I have to put the aggregation function?
CodePudding user response:
A pivot_table
exists in order to aggregate data. If you just want to pivot
the data and not aggregate it, then use pivot
:
test_df.pivot(index='col_a', columns='col_b')
val
col_b x y
col_a
a pass fail
b pass pass
In your actual data, IF there are duplicates, then you would need to aggregate data and use pivot_table
with first()
or something.