More efficient way of pandas dataframe manipulation: filtering and melting-CodePudding

I have a dataset that I need to parse and manipulate from long to wide. Each row represents a single person and there are multiple columns representing instances of a measure (uk-biobank formatted):

import pandas as pd
# initialize data of lists.
data = {'id': ['1', '2', '3', '4'],
        '3-0.0': [20, 21, 19, 18],
        '3-1.0': [10, 11, 29, 12],
        '3-2.0': [5, 6, 7, 8]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('id')

3-0.0, 3-1.0, and 3-2.0 are three different measures of the same event for a given person. What I want is multiple rows for a given person and a column to indicate the event instance (0,1 or 2) and then a column for the associated value.

My inefficient approach is as follows and I know it can be way better. I am new to python so looking for ways of coding more efficiently:

# parsing out each instance 
i0 = df.filter(regex="\-0\.")
i1 = df.filter(regex="\-1\.")
i2 = df.filter(regex="\-2\.")

# set index as column and melt each df
i0.reset_index(inplace=True)
i0 = pd.melt(i0, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '0')

i1.reset_index(inplace=True)
i1 = pd.melt(i1, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '1')

i2.reset_index(inplace=True)
i2 = pd.melt(i2, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '2')

# concatenate back together
fin = pd.concat([i0,i1,i2])

data = {'id': ['1', '2', '3', '4'],
        '3-0.0': [20, 21, 19, 18],
        '3-1.0': [10, 11, 29, 12],
        '3-2.0': [5, 6, 7, 8]}

# final dataset looks like this
id, measure, instance
1   20   0
1   10   1
1   5    2
2   21   0 
2   11   1
2   6   2
3   19   0
3   29   1
3   7   2
4   18   0 
4   12   1 
4   8   2

Bonus if you can incorporate the fact there are several measurement columns formatted like this 3-0.0','3-1.0', '3-2.0','4-0.0','4-1.0','4-2.0',...

Thanks in advance!

CodePudding user response：

Given:

  id  3-0.0  3-1.0  3-2.0  4-0.0  4-1.0  4-2.0
0  1     20     10      5     10      5     20
1  2     21     11      6     11      6     21
2  3     19     29      7     29      7     19
3  4     18     12      8     12      8     18

Doing:

unique_people = df.filter(regex='\d-').columns.str.split('-').str[0].unique()
out = pd.wide_to_long(df, stubnames=unique_people, i='id', j='instance', sep='-', suffix='.*')
out = out.rename(int, level=1)
print(out.sort_index())

Output:

              3   4
id instance
1  0         20  10
   1         10   5
   2          5  20
2  0         21  11
   1         11   6
   2          6  21
3  0         19  29
   1         29   7
   2          7  19
4  0         18  12
   1         12   8
   2          8  18

CodePudding user response：

Use str.split by - of columns names with id converted to index with DataFrame.stack - then are converted floats to integers, rename column 3 and create 3 columns DataFrame:

df = df.set_index('id')

df.columns = df.columns.str.split('-', expand=True)
df = (df.stack()
        .rename_axis(('id','instance'))
        .rename(int, level=1)
        .rename(columns={'3':'measure'})
        .reset_index()[['id','measure','instance']]
        )
print (df)

   id  measure  instance
0   1       20         0
1   1       10         1
2   1        5         2
3   2       21         0
4   2       11         1
5   2        6         2
6   3       19         0
7   3       29         1
8   3        7         2
9   4       18         0
10  4       12         1
11  4        8         2

Solution with wide_to_long with specify 3 before - and regex r'\d \.\d ' is for match floats:

df = (pd.wide_to_long(df, 
                      stubnames=['3'], 
                      i='id', 
                      j='instance', 
                      sep='-', 
                      suffix=r'\d \.\d ')
         .rename(int, level=1)
         .sort_index()
         .rename(columns={'3':'measure'})
         .reset_index()[['id','measure','instance']])
print (df)

   id  measure  instance
0   1       20         0
1   1       10         1
2   1        5         2
3   2       21         0
4   2       11         1
5   2        6         2
6   3       19         0
7   3       29         1
8   3        7         2
9   4       18         0
10  4       12         1
11  4        8         2

Bonus: Solutions are modify for columns from part of columns names before -:

df = df.set_index('id')

df.columns = df.columns.str.split('-', expand=True)
df = (df.stack()
        .rename_axis(('id','instance'))
        .rename(int, level=1)
        .reset_index()
        )
print (df)
    id  instance   3   4
0    1         0  20  10
1    1         1  10   5
2    1         2   5  20
3    2         0  21  11
4    2         1  11   6
5    2         2   6  21
6    3         0  19  29
7    3         1  29   7
8    3         2   7  19
9    4         0  18  12
10   4         1  12   8
11   4         2   8  18

df = (pd.wide_to_long(df, 
                      stubnames=['3','4'], 
                      i='id', 
                      j='instance', 
                      sep='-', 
                      suffix=r'\d \.\d ')
         .rename(int, level=1)
         .sort_index()
         .reset_index())
print (df)
    id  instance   3   4
0    1         0  20  10
1    1         1  10   5
2    1         2   5  20
3    2         0  21  11
4    2         1  11   6
5    2         2   6  21
6    3         0  19  29
7    3         1  29   7
8    3         2   7  19
9    4         0  18  12
10   4         1  12   8
11   4         2   8  18

CodePudding user response：

Convert the columns into MultiIndex and then unstack with reset index

df.columns = df.columns.str.split('-').map(tuple)
df = df.unstack(level=1).reset_index([0, 1]).sort_index(level='id')
df.columns = ['major_instance', 'instance', 'measure']

Output

   major_instance instance  measure
id                                 
1               3      0.0       20
1               3      1.0       10
1               3      2.0        5
2               3      0.0       21
2               3      1.0       11
2               3      2.0        6
3               3      0.0       19
3               3      1.0       29
3               3      2.0        7
4               3      0.0       18
4               3      1.0       12
4               3      2.0        8