I have a data frame with 60 columns, which most of them are just informations about other sold products, it looks like this:
store_id|seller_id | produtc_0_code |product_0_value |product_1_code |product_1_value
----------------------------------------------------------------------------------
12 | 1234 | a2356 | 52,20 | ae822, | 50,20
in order to reduce the amount of columns, I've been trying to put the values from the colomns in rows, something like this:
store_id |seller_id | product_ code |product_value
12 | 1234 | a2356 | 52,20
12 | 1234 | ae822 | 50,20
that means, keeping the values of store_id and seller_id, but taking the values from the products_code and product_value columns and inserting them in new rows.
CodePudding user response:
pivot_longer
from pyjanitor
can also help in this transformation; since your columns have an order to them (some end with code, others with value):
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.pivot_longer(index = ['store_id', 'seller_id'],
names_to = ['product_code', 'product_value'],
names_pattern = ["code$", "value$"])
store_id seller_id product_code product_value
0 12 1234 a2356 52,20
1 12 1234 ae822, 50,20
It is also possible to reshape using pd.wide_to_long
.
First, reorder the columns with str.replace
:
pat = "(?P<start>. )_(?P<middle>. )_(?P<end>. )"
repl = lambda m : f"{m.group('start')}_{m.group('end')}_{m.group('middle')}"
df.columns = df.columns.str.replace(pat, repl, regex = True)
# or just skip the steps above and use this option from @HenryEcker
# df.columns = df.columns.str.replace(r'(. )_(. )_(. )', r'\1_\3_\2', regex=True)
Apply pd.wide_to_long
:
pd.wide_to_long(df,
stubnames = ['product_code', 'product_value'],
i = ['store_id', 'seller_id'],
j = 'num',
sep='_').droplevel('num').reset_index()
store_id seller_id product_code product_value
0 12 1234 a2356 52,20
1 12 1234 ae822, 50,20
CodePudding user response:
We can use set_index
stack
, but we first need to create a MultiIndex using str.split
:
# Set index to the columns to be unaffected
new_df = df.set_index(['store_id', 'seller_id'])
# Split remaining column into a MultiIndex
new_df.columns = new_df.columns.str.split('_', expand=True)
# Stack Middle level (the numbers) into rows
new_df = new_df.stack(1).droplevel(-1)
# Join MultiIndex Columns together
new_df.columns = new_df.columns.map('_'.join)
# Reset to default range index
new_df = new_df.reset_index()
*Note this assumes that the remaining columns have values separated by _
with a number in the second split position.
new_df
:
store_id seller_id product_code product_value
0 12 1234 a2356 52,20
1 12 1234 ae822, 50,20
Setup Used:
import pandas as pd
df = pd.DataFrame({
'store_id': [12], 'seller_id': [1234], 'product_0_code': ['a2356'],
'product_0_value': ['52,20'], 'product_1_code': ['ae822,'],
'product_1_value': ['50,20']
})