cast data frame to long, preserving index columns-CodePudding

I have a data frame to pivot to longer:

import io

raw = '''
       date        speechnumber  result1 result2 p_a  p_b  p_c 
  0    2009-11-10             2        0       0   1    0    1
  1    2009-11-10             2        0       0   0    0    0
  2    2009-11-10             2        0       0   0    1    0
  3    2009-11-10             2        0       0   0    1    0
  4    2009-11-10             2        1       1   0    0    1
  5    2009-11-10             2        0       0   0    0    0
  7    2009-11-10             2        0       0   1    0    1
  7    2009-11-10             2        0       0   1    0    0
  8    2009-11-10             2        1       1   1    0    1
  9    2009-11-10             4        0       0   1    1    0
'''

_ = pd.read_table(io.StringIO(raw), delim_whitespace=True)

I would like to presetve date, speechnumber, result1 and result2 and combine the p_x columns (which are 0 or 1) into one (p). Where multiple p values are one, a new row should be created:

expected_out = '''
old_id       date  speechnumber     p       result1  result2
  0    2009-11-10             2     a            0         0
  0    2009-11-10             2     c            0         0
  2    2009-11-10             2     b            0         0
  3    2009-11-10             2     b            0         0
  4    2009-11-10             2     c            1         1
  6    2009-11-10             2     a            0         0
  6    2009-11-10             2     b            0         0
  7    2009-11-10             2     a            0         0
  8    2009-11-10             2     a            1         1
  8    2009-11-10             2     c            1         1
  9    2009-11-10             4     a            0         0
  9    2009-11-10             4     b            0         0
'''

I have tried pd.wide_to_long(_, stubnames='p', i=['index'], j=['result1', 'result2']) to no avail; can someone point out what I am doing wrong?

CodePudding user response：

Set the index to first four columns, then split and expand the remaining columns around delimiter _, then mask the 0 values and stack to reshape

s = df.set_index([*df.columns[:4]])
s.columns = s.columns.str.split('_', expand=True)
s[s == 1].stack().drop('p', axis=1).reset_index().rename(columns={'level_4': 'p'})

Alternative approach with melt

s = df.melt(df.columns[:4], var_name='p', ignore_index=False)\
           .query('value == 1').drop('value', axis=1).sort_index()
s['p'] = s['p'].str.split('_').str[1]

          date  speechnumber  result1  result2  p
0   2009-11-10             2        0        0  a
1   2009-11-10             2        0        0  c
2   2009-11-10             2        0        0  b
3   2009-11-10             2        0        0  b
4   2009-11-10             2        1        1  c
5   2009-11-10             2        0        0  a
6   2009-11-10             2        0        0  c
7   2009-11-10             2        0        0  a
8   2009-11-10             2        1        1  a
9   2009-11-10             2        1        1  c
10  2009-11-10             4        0        0  a
11  2009-11-10             4        0        0  b

CodePudding user response：

Another option:

#pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(_.pivot_longer(index = slice('date', 'result2'),  
                # new column names
                names_to = ('extra', 'p'), 
                # splitter
                names_sep='_', 
                ignore_index = False, 
                sort_by_appearance=True)
  .query('value == 1')
  .drop(columns=['extra', 'value'])
)
         date  speechnumber  result1  result2  p
0  2009-11-10             2        0        0  a
0  2009-11-10             2        0        0  c
2  2009-11-10             2        0        0  b
3  2009-11-10             2        0        0  b
4  2009-11-10             2        1        1  c
7  2009-11-10             2        0        0  a
7  2009-11-10             2        0        0  c
7  2009-11-10             2        0        0  a
8  2009-11-10             2        1        1  a
8  2009-11-10             2        1        1  c
9  2009-11-10             4        0        0  a
9  2009-11-10             4        0        0  b

This uses pivot_longer function from pyjanitor