How do you Pivot a Large DataFrame in Pandas?-CodePudding

So, I'm having serious issues pivoting a very large dataframe (9,296,848 rows with ~100,000 articles) using pandas .pivot(). I'm unsure how to proceed from here. I've had a look at similar queries, but the suggestions either didn't work or seem very convoluted. Is there a way to do this by .groupby() or using some much quicker alternative?

I have a dataframe that looks like this:

   customer article quantity
0  A        p       3
1  B        q       5
2  C        r       2

I'd like it to end up as:

   customer p    q    r
0  A        3    NaN  NaN
1  B        NaN  5    NaN
2  C        NaN  NaN  2

One thing I tried already was using:

df = df.set_index(['customer_id', 'article_id'])['quantity'].unstack('article_id')

But it didn't work. It throws up the same error as with .pivot(): Unstacked DataFrame is too big, causing int32 overflow.

Any suggestions very much welcome!

Thanks!

CodePudding user response：

I don't know if this will run, but you can try this alternative with pandas.get_dummie:

df[['customer']].join(pd.get_dummies(df['article']).mul(df['quantity'], axis=0))

NB. You can try to add sparse=True as parameter to get_dummies.

Output:

  customer  p  q  r
0        A  3  0  0
1        B  0  5  0
2        C  0  0  2

CodePudding user response：

df.pivot(index='customer', columns='article', values='quantity').rename_axis(columns='').reset_index()

Output:

  customer    p    q    r
0        A  3.0  NaN  NaN
1        B  NaN  5.0  NaN
2        C  NaN  NaN  2.0

And for the Get Dummies method:

pd.concat([df.customer, pd.get_dummies(df.article).replace([0,1], [np.nan, df.quantity])], axis=1)

Output:

  customer    p    q    r
0        A  3.0  NaN  NaN
1        B  NaN  5.0  NaN
2        C  NaN  NaN  2.0