So, I'm having serious issues pivoting a very large dataframe (9,296,848 rows with ~100,000 articles) using pandas .pivot(). I'm unsure how to proceed from here. I've had a look at similar queries, but the suggestions either didn't work or seem very convoluted. Is there a way to do this by .groupby() or using some much quicker alternative?
I have a dataframe that looks like this:
customer article quantity
0 A p 3
1 B q 5
2 C r 2
I'd like it to end up as:
customer p q r
0 A 3 NaN NaN
1 B NaN 5 NaN
2 C NaN NaN 2
One thing I tried already was using:
df = df.set_index(['customer_id', 'article_id'])['quantity'].unstack('article_id')
But it didn't work. It throws up the same error as with .pivot(): Unstacked DataFrame is too big, causing int32 overflow.
Any suggestions very much welcome!
Thanks!
CodePudding user response:
I don't know if this will run, but you can try this alternative with pandas.get_dummie
:
df[['customer']].join(pd.get_dummies(df['article']).mul(df['quantity'], axis=0))
NB. You can try to add sparse=True
as parameter to get_dummies
.
Output:
customer p q r
0 A 3 0 0
1 B 0 5 0
2 C 0 0 2
CodePudding user response:
df.pivot(index='customer', columns='article', values='quantity').rename_axis(columns='').reset_index()
Output:
customer p q r
0 A 3.0 NaN NaN
1 B NaN 5.0 NaN
2 C NaN NaN 2.0
And for the Get Dummies method:
pd.concat([df.customer, pd.get_dummies(df.article).replace([0,1], [np.nan, df.quantity])], axis=1)
Output:
customer p q r
0 A 3.0 NaN NaN
1 B NaN 5.0 NaN
2 C NaN NaN 2.0