Pandas pivot_table and categorical variables as values-CodePudding

I am puzzled by how pivot_table works with categorical variables used as values.

I did search the web for pandas pivot_table categorical variables and I do find some information, but nothing that really explains to my why I see what I see.

Test dataframe:

test_df = pd.DataFrame.from_dict({'val': ['pass','pass','fail','pass'], "col_a": ['a','b','a','b'], "col_b": ['x','x','y','y']})
test_df
    val col_a   col_b
0   pass    a   x
1   pass    b   x
2   fail    a   y
3   pass    b   y

Then I proceed to reshape it. I come from R/data.tables where this would be a cast.

test_df.pivot_table(index = "col_a", columns = "col_b", values = 'val')

and I get this:

/tmp/ipykernel_153608/3910840210.py:1: FutureWarning:

Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.

col_b
col_a
a
b

Empty result, but with indices. After doing a million tests (on my real object, such as testing if there were duplicated values, NAs, etc), this seems to work:

test_df.pivot_table(index = "col_a", columns = "col_b", values = 'val', aggfunc=lambda x: x)
col_b   x   y
col_a       
a   pass    fail
b   pass    pass

which is what I want. Besides the "dude, if it works, take it and be happy", does anyone know why I have to put the aggregation function?

CodePudding user response：

A pivot_table exists in order to aggregate data. If you just want to pivot the data and not aggregate it, then use pivot:

test_df.pivot(index='col_a', columns='col_b')

        val      
col_b     x     y
col_a            
a      pass  fail
b      pass  pass

In your actual data, IF there are duplicates, then you would need to aggregate data and use pivot_table with first() or something.