Home > front end >  How to create a category column and explode that into new rows
How to create a category column and explode that into new rows

Time:03-10

I have this extremely messed up dataframe:

id     letter_1     letter_2    letter_3     number_1    number_2
1      abc                                   123         
2      def           ghi          jkl                    456
3                    mno          pqr        789         

and basically the dataframe I expect would be:

id     letter/number    data
1      letter           abc
1      number           123
2      letter           def
2      letter           ghi
2      letter           jkl
2      number           456
3      letter           mno
3      letter           pqr
4      number           789

I thought I'd go with the letters first and then the numbers. So I have my dataframe 'data':

data = pd.DataFrame({'id':['1','2','3'],letter_1':['abc','def',''],'letter_2':['','ghi','mno'],'letter_3':['','jkl','pqr'],'number_1':['123','','789'],'number_2':['','456','']})

1- Create a column in 'Category' format by concatenating the columns 'letter_1', 'letter_2' and 'letter_3' *here I'm having difficulties with null values not being part of the category but I'm using:

data['new_col_category'] = data.apply(lambda row: row['letter_1']   ","   row['letter_2']   ","   row['letter_3'], axis=1).astype('category')

2- explode that column turning each combination into a new row:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = data['new_col_category'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'id': np.repeat(data['id'], lens),
                    'number_1': np.repeat(data['number_1'], lens),
                    'number_2': np.repeat(data['number_2'], lens),
                    'new_col_category': chainer(data['new_col_category'])})

After that, I thought of creating the 'Letter/Number' column and assigning everything as 'Letter'. Then repeat the whole process using the columns of numbers and at the end assign data['Letter/Number'] = 'Number'

Is that make sense? I think I'm missing something. Any help?

CodePudding user response:

Here is a way using stack. First remove the _n from the columns names, then set_index the column id, mask the cell with empty string that would be remove when stacking the data. Then use reset_index and rename to fit the expected output.

# to keep original data if needed
res = data.copy()
# remove the _n from columns names
res.columns = [c.split('_')[0] for c in res.columns]
res = (
    res.set_index('id')
       .mask(lambda x: x=='')
       .stack()
       .reset_index(name='data')
       .rename(columns={'level_1':'letter/number'})
)
print(res)
#   id letter/number data
# 0  1        letter  abc
# 1  1        number  123
# 2  2        letter  def
# 3  2        letter  ghi
# 4  2        letter  jkl
# 5  2        number  456
# 6  3        letter  mno
# 7  3        letter  pqr
# 8  3        number  789
  • Related