I have a dataset that I need to parse and manipulate from long to wide. Each row represents a single person and there are multiple columns representing instances of a measure (uk-biobank formatted):
import pandas as pd
# initialize data of lists.
data = {'id': ['1', '2', '3', '4'],
'3-0.0': [20, 21, 19, 18],
'3-1.0': [10, 11, 29, 12],
'3-2.0': [5, 6, 7, 8]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('id')
3-0.0, 3-1.0, and 3-2.0 are three different measures of the same event for a given person. What I want is multiple rows for a given person and a column to indicate the event instance (0,1 or 2) and then a column for the associated value.
My inefficient approach is as follows and I know it can be way better. I am new to python so looking for ways of coding more efficiently:
# parsing out each instance
i0 = df.filter(regex="\-0\.")
i1 = df.filter(regex="\-1\.")
i2 = df.filter(regex="\-2\.")
# set index as column and melt each df
i0.reset_index(inplace=True)
i0 = pd.melt(i0, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '0')
i1.reset_index(inplace=True)
i1 = pd.melt(i1, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '1')
i2.reset_index(inplace=True)
i2 = pd.melt(i2, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '2')
# concatenate back together
fin = pd.concat([i0,i1,i2])
data = {'id': ['1', '2', '3', '4'],
'3-0.0': [20, 21, 19, 18],
'3-1.0': [10, 11, 29, 12],
'3-2.0': [5, 6, 7, 8]}
# final dataset looks like this
id, measure, instance
1 20 0
1 10 1
1 5 2
2 21 0
2 11 1
2 6 2
3 19 0
3 29 1
3 7 2
4 18 0
4 12 1
4 8 2
Bonus if you can incorporate the fact there are several measurement columns formatted like this 3-0.0','3-1.0', '3-2.0','4-0.0','4-1.0','4-2.0',...
Thanks in advance!
CodePudding user response:
Given:
id 3-0.0 3-1.0 3-2.0 4-0.0 4-1.0 4-2.0
0 1 20 10 5 10 5 20
1 2 21 11 6 11 6 21
2 3 19 29 7 29 7 19
3 4 18 12 8 12 8 18
Doing:
unique_people = df.filter(regex='\d-').columns.str.split('-').str[0].unique()
out = pd.wide_to_long(df, stubnames=unique_people, i='id', j='instance', sep='-', suffix='.*')
out = out.rename(int, level=1)
print(out.sort_index())
Output:
3 4
id instance
1 0 20 10
1 10 5
2 5 20
2 0 21 11
1 11 6
2 6 21
3 0 19 29
1 29 7
2 7 19
4 0 18 12
1 12 8
2 8 18
CodePudding user response:
Use str.split
by -
of columns names with id
converted to index with DataFrame.stack
- then are converted floats to integers, rename
column 3
and create 3 columns DataFrame:
df = df.set_index('id')
df.columns = df.columns.str.split('-', expand=True)
df = (df.stack()
.rename_axis(('id','instance'))
.rename(int, level=1)
.rename(columns={'3':'measure'})
.reset_index()[['id','measure','instance']]
)
print (df)
id measure instance
0 1 20 0
1 1 10 1
2 1 5 2
3 2 21 0
4 2 11 1
5 2 6 2
6 3 19 0
7 3 29 1
8 3 7 2
9 4 18 0
10 4 12 1
11 4 8 2
Solution with wide_to_long
with specify 3
before -
and regex r'\d \.\d '
is for match floats:
df = (pd.wide_to_long(df,
stubnames=['3'],
i='id',
j='instance',
sep='-',
suffix=r'\d \.\d ')
.rename(int, level=1)
.sort_index()
.rename(columns={'3':'measure'})
.reset_index()[['id','measure','instance']])
print (df)
id measure instance
0 1 20 0
1 1 10 1
2 1 5 2
3 2 21 0
4 2 11 1
5 2 6 2
6 3 19 0
7 3 29 1
8 3 7 2
9 4 18 0
10 4 12 1
11 4 8 2
Bonus: Solutions are modify for columns from part of columns names before -
:
df = df.set_index('id')
df.columns = df.columns.str.split('-', expand=True)
df = (df.stack()
.rename_axis(('id','instance'))
.rename(int, level=1)
.reset_index()
)
print (df)
id instance 3 4
0 1 0 20 10
1 1 1 10 5
2 1 2 5 20
3 2 0 21 11
4 2 1 11 6
5 2 2 6 21
6 3 0 19 29
7 3 1 29 7
8 3 2 7 19
9 4 0 18 12
10 4 1 12 8
11 4 2 8 18
df = (pd.wide_to_long(df,
stubnames=['3','4'],
i='id',
j='instance',
sep='-',
suffix=r'\d \.\d ')
.rename(int, level=1)
.sort_index()
.reset_index())
print (df)
id instance 3 4
0 1 0 20 10
1 1 1 10 5
2 1 2 5 20
3 2 0 21 11
4 2 1 11 6
5 2 2 6 21
6 3 0 19 29
7 3 1 29 7
8 3 2 7 19
9 4 0 18 12
10 4 1 12 8
11 4 2 8 18
CodePudding user response:
Convert the columns into MultiIndex and then unstack with reset index
df.columns = df.columns.str.split('-').map(tuple)
df = df.unstack(level=1).reset_index([0, 1]).sort_index(level='id')
df.columns = ['major_instance', 'instance', 'measure']
Output
major_instance instance measure
id
1 3 0.0 20
1 3 1.0 10
1 3 2.0 5
2 3 0.0 21
2 3 1.0 11
2 3 2.0 6
3 3 0.0 19
3 3 1.0 29
3 3 2.0 7
4 3 0.0 18
4 3 1.0 12
4 3 2.0 8