I have a data frame to pivot to longer:
import io
raw = '''
date speechnumber result1 result2 p_a p_b p_c
0 2009-11-10 2 0 0 1 0 1
1 2009-11-10 2 0 0 0 0 0
2 2009-11-10 2 0 0 0 1 0
3 2009-11-10 2 0 0 0 1 0
4 2009-11-10 2 1 1 0 0 1
5 2009-11-10 2 0 0 0 0 0
7 2009-11-10 2 0 0 1 0 1
7 2009-11-10 2 0 0 1 0 0
8 2009-11-10 2 1 1 1 0 1
9 2009-11-10 4 0 0 1 1 0
'''
_ = pd.read_table(io.StringIO(raw), delim_whitespace=True)
I would like to presetve date
, speechnumber
, result1
and result2
and combine the p_x
columns (which are 0 or 1) into one (p
). Where multiple p values are one, a new row should be created:
expected_out = '''
old_id date speechnumber p result1 result2
0 2009-11-10 2 a 0 0
0 2009-11-10 2 c 0 0
2 2009-11-10 2 b 0 0
3 2009-11-10 2 b 0 0
4 2009-11-10 2 c 1 1
6 2009-11-10 2 a 0 0
6 2009-11-10 2 b 0 0
7 2009-11-10 2 a 0 0
8 2009-11-10 2 a 1 1
8 2009-11-10 2 c 1 1
9 2009-11-10 4 a 0 0
9 2009-11-10 4 b 0 0
'''
I have tried pd.wide_to_long(_, stubnames='p', i=['index'], j=['result1', 'result2'])
to no avail; can someone point out what I am doing wrong?
CodePudding user response:
Set the index to first four columns, then split
and expand the remaining columns around delimiter _
, then mask the 0
values and stack
to reshape
s = df.set_index([*df.columns[:4]])
s.columns = s.columns.str.split('_', expand=True)
s[s == 1].stack().drop('p', axis=1).reset_index().rename(columns={'level_4': 'p'})
Alternative approach with melt
s = df.melt(df.columns[:4], var_name='p', ignore_index=False)\
.query('value == 1').drop('value', axis=1).sort_index()
s['p'] = s['p'].str.split('_').str[1]
date speechnumber result1 result2 p
0 2009-11-10 2 0 0 a
1 2009-11-10 2 0 0 c
2 2009-11-10 2 0 0 b
3 2009-11-10 2 0 0 b
4 2009-11-10 2 1 1 c
5 2009-11-10 2 0 0 a
6 2009-11-10 2 0 0 c
7 2009-11-10 2 0 0 a
8 2009-11-10 2 1 1 a
9 2009-11-10 2 1 1 c
10 2009-11-10 4 0 0 a
11 2009-11-10 4 0 0 b
CodePudding user response:
Another option:
#pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(_.pivot_longer(index = slice('date', 'result2'),
# new column names
names_to = ('extra', 'p'),
# splitter
names_sep='_',
ignore_index = False,
sort_by_appearance=True)
.query('value == 1')
.drop(columns=['extra', 'value'])
)
date speechnumber result1 result2 p
0 2009-11-10 2 0 0 a
0 2009-11-10 2 0 0 c
2 2009-11-10 2 0 0 b
3 2009-11-10 2 0 0 b
4 2009-11-10 2 1 1 c
7 2009-11-10 2 0 0 a
7 2009-11-10 2 0 0 c
7 2009-11-10 2 0 0 a
8 2009-11-10 2 1 1 a
8 2009-11-10 2 1 1 c
9 2009-11-10 4 0 0 a
9 2009-11-10 4 0 0 b
This uses pivot_longer function from pyjanitor