I have this extremely messed up dataframe:
id letter_1 letter_2 letter_3 number_1 number_2
1 abc 123
2 def ghi jkl 456
3 mno pqr 789
and basically the dataframe I expect would be:
id letter/number data
1 letter abc
1 number 123
2 letter def
2 letter ghi
2 letter jkl
2 number 456
3 letter mno
3 letter pqr
4 number 789
I thought I'd go with the letters first and then the numbers. So I have my dataframe 'data':
data = pd.DataFrame({'id':['1','2','3'],letter_1':['abc','def',''],'letter_2':['','ghi','mno'],'letter_3':['','jkl','pqr'],'number_1':['123','','789'],'number_2':['','456','']})
1- Create a column in 'Category' format by concatenating the columns 'letter_1', 'letter_2' and 'letter_3' *here I'm having difficulties with null values not being part of the category but I'm using:
data['new_col_category'] = data.apply(lambda row: row['letter_1'] "," row['letter_2'] "," row['letter_3'], axis=1).astype('category')
2- explode that column turning each combination into a new row:
import numpy as np
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
# calculate lengths of splits
lens = data['new_col_category'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'id': np.repeat(data['id'], lens),
'number_1': np.repeat(data['number_1'], lens),
'number_2': np.repeat(data['number_2'], lens),
'new_col_category': chainer(data['new_col_category'])})
After that, I thought of creating the 'Letter/Number' column and assigning everything as 'Letter'. Then repeat the whole process using the columns of numbers and at the end assign data['Letter/Number'] = 'Number'
Is that make sense? I think I'm missing something. Any help?
CodePudding user response:
Here is a way using stack
. First remove the _n from the columns names, then set_index
the column id, mask
the cell with empty string that would be remove when stack
ing the data. Then use reset_index
and rename
to fit the expected output.
# to keep original data if needed
res = data.copy()
# remove the _n from columns names
res.columns = [c.split('_')[0] for c in res.columns]
res = (
res.set_index('id')
.mask(lambda x: x=='')
.stack()
.reset_index(name='data')
.rename(columns={'level_1':'letter/number'})
)
print(res)
# id letter/number data
# 0 1 letter abc
# 1 1 number 123
# 2 2 letter def
# 3 2 letter ghi
# 4 2 letter jkl
# 5 2 number 456
# 6 3 letter mno
# 7 3 letter pqr
# 8 3 number 789