Pandas transform moving two rows (of x,y) into feature columns-CodePudding

I'm trying to transform a pandas df that looks like:

import numpy as np
import pandas as pd

test_data = {'run': ['X', 'Y', 'A', 'B', 'C', 'D'],
             'time': [np.nan, np.nan, 10, 20, 40, 60],
             'temp': [np.nan, np.nan, 100, 120, 140, 160],
             'press': [np.nan, np.nan, 200, 220, 240, 260],
             'twist': [np.nan, np.nan, 300 ,320, 340, 360],
             'output_site_1': [0, 0, 11, 12, 13, 14],
             'output_site_2': [0, 1, 15, 16, 17, 18],
             'output_site_3': [1, 0, 19, 20, 21, 22],
             'output_site_4': [1, 1, 23, 24, 25, 26]
            }
pd.DataFrame(test_data)

run	time	temp	press	twist	output_site_1	output_site_2	output_site_3	output_site_4
X	NaN	NaN	NaN	NaN	0	0	1	1
Y	NaN	NaN	NaN	NaN	0	1	0	1
A	10.0	100.0	200.0	300.0	11	15	19	23
B	20.0	120.0	220.0	320.0	12	16	20	24
C	40.0	140.0	240.0	340.0	13	17	21	25
D	60.0	160.0	260.0	360.0	14	18	22	26

into a df that looks like the one below:

transform_test_data = {'run': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
                       'X': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
                       'Y': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
                       'time': [10, 10, 10, 10, 20, 20, 20, 20, 40, 40, 40, 40, 60, 60, 60, 60],
                       'temp': [100, 100, 100, 100, 120, 120, 120, 120, 140, 140, 140, 140, 160, 160, 160, 160],
                       'press': [200, 200, 200, 200, 220, 220, 220, 220, 240, 240, 240, 240, 260, 260, 260, 260],
                       'twist': [300 ,300, 300, 300, 320, 320, 320, 320, 340, 340, 340, 340, 360, 360, 360, 360],
                       'output': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
                      }
pd.DataFrame(transform_test_data)

run	X	Y	time	temp	press	twist	output
A	0	0	10	100	200	300	11
A	0	1	10	100	200	300	12
A	1	0	10	100	200	300	13
A	1	1	10	100	200	300	14
B	0	0	20	120	220	320	15
B	0	1	20	120	220	320	16
B	1	0	20	120	220	320	17
B	1	1	20	120	220	320	18
C	0	0	40	140	240	340	19
C	0	1	40	140	240	340	20
C	1	0	40	140	240	340	21
C	1	1	40	140	240	340	22
D	0	0	60	160	260	360	23
D	0	1	60	160	260	360	24
D	1	0	60	160	260	360	25
D	1	1	60	160	260	360	26

I would like to take the rows 'X' and 'Y' and turn them into two seperate columns in the df and then remove them from the df. Doing so would add two new columns ('X' and 'Y') while also increasing the number for each run by run*(number-output-sites).

I need the method to be generalized such that it takes in a dataframe (like the first one) and returns a df (like the last one), can handle arbitrary number of inputs, and sites. There will always be a delineation between inputs and outputs as the first output will always be labeled 'output_site_1'.

I supplied 'test_data' and 'transformed_test_data' as dictionaries so if anyone felt kind enough to help out a stuck human (that's me!) it would be easy to create the two dataframes.

CodePudding user response：

Like many situations, this is easier handled before moving into pandas. You really should have two different arrays there, but we can handle that in code.

import pandas as pd

test_data = {'run': ['X', 'Y', 'A', 'B', 'C', 'D'],
             'time': [np.nan, np.nan, 10, 20, 40, 60],
             'temp': [np.nan, np.nan, 100, 120, 140, 160],
             'press': [np.nan, np.nan, 200, 220, 240, 260],
             'twist': [np.nan, np.nan, 300 ,320, 340, 360],
             'output_site_1': [0, 0, 11, 12, 13, 14],
             'output_site_2': [0, 1, 15, 16, 17, 18],
             'output_site_3': [1, 0, 19, 20, 21, 22],
             'output_site_4': [1, 1, 23, 24, 25, 26]
            }

rows = []
store = {}
for key,row in test_data.items():
    if key == 'run':
        runs = row
    elif row[0] is np.nan:
        store[key] = { a:b for a,b in zip(runs,row) }
    else:
        row = { a:b for a,b in zip(runs, row) }
        for r in runs[2:]:
            new = [r, row['X'], row['Y']]
            new.extend( [v[r] for v in store.values()] )
            new.append( row[r] )
            rows.append( new )

columns = ['run', 'X', 'Y']   list(store.keys())   ['output']
df = pd.DataFrame( rows, columns=columns )
print(df)

Output:

   run  X  Y  time  temp  press  twist  output
0    A  0  0    10   100    200    300      11
1    B  0  0    20   120    220    320      12
2    C  0  0    40   140    240    340      13
3    D  0  0    60   160    260    360      14
4    A  0  1    10   100    200    300      15
5    B  0  1    20   120    220    320      16
6    C  0  1    40   140    240    340      17
7    D  0  1    60   160    260    360      18
8    A  1  0    10   100    200    300      19
9    B  1  0    20   120    220    320      20
10   C  1  0    40   140    240    340      21
11   D  1  0    60   160    260    360      22
12   A  1  1    10   100    200    300      23
13   B  1  1    20   120    220    320      24
14   C  1  1    40   140    240    340      25
15   D  1  1    60   160    260    360      26

If you're bothered that the order doesn't match yours, add

rows.sort()

just before making the DataFrame.

CodePudding user response：

def setter(df):
    tmp = pd.concat([df.run, df.filter(like="output_")], axis=1).set_index("run")
    df2 = tmp[2:].stack().reset_index()[["run", 0]].rename(columns={0: "output"})

    df3 = tmp[:2].T.reset_index(drop=True)

    return pd.concat([pd.concat([df3] * (df2.size// df3.size), ignore_index=True), pd.merge(df[2:][['run', 'time',  'temp', 'press',    'twist']], df2)], axis=1)

CodePudding user response：

one option is to transform into long form, split into different tables, munge, then recombine. As suggested in Tim's solution, if you can do this before dumping into Pandas, that is preferable, and should be more efficient.

This solution uses pivot_longer from pyjanitor for the wide to long form; you can do this using pd.wide_to_long; pivot_longer offers an easier abstraction and is comparably more efficient in a number of cases.

# pip install pyjanitor
import pandas as pd
import janitor as jn

# reshape to long form
extract = (test_data
          .pivot_longer(index = slice('run','twist'), 
                        names_to = ('.value', 'site', 'num'), 
                        names_sep='_')
          .drop(columns='site'))

# extract X and Y rows
runs = (extract
       .loc[extract.run.isin(['X','Y'])]
       .pivot(columns='run', values='output', index='num'))

# extract non X and Y rows
others = extract.loc[~extract.run.isin(['X','Y'])]

# merge the two dataframes
(runs
.merge(others, on='num')
.drop(columns='num')
)
    X  Y run  time   temp  press  twist  output
0   0  0   A  10.0  100.0  200.0  300.0      11
1   0  0   B  20.0  120.0  220.0  320.0      12
2   0  0   C  40.0  140.0  240.0  340.0      13
3   0  0   D  60.0  160.0  260.0  360.0      14
4   0  1   A  10.0  100.0  200.0  300.0      15
5   0  1   B  20.0  120.0  220.0  320.0      16
6   0  1   C  40.0  140.0  240.0  340.0      17
7   0  1   D  60.0  160.0  260.0  360.0      18
8   1  0   A  10.0  100.0  200.0  300.0      19
9   1  0   B  20.0  120.0  220.0  320.0      20
10  1  0   C  40.0  140.0  240.0  340.0      21
11  1  0   D  60.0  160.0  260.0  360.0      22
12  1  1   A  10.0  100.0  200.0  300.0      23
13  1  1   B  20.0  120.0  220.0  320.0      24
14  1  1   C  40.0  140.0  240.0  340.0      25
15  1  1   D  60.0  160.0  260.0  360.0      26

If your data has just two rows for X and Y, you can avoid flipping to wide form(pivot), by doing some preprocessing; this way you get to do the flip once, and even avoid the merge (and you don't need another library :)):

index = ['run', 'time', 'temp', 'press', 'twist']
temp = test_data.set_index(index)

# preprocessing begins here
cols = temp.loc[['X','Y']]
# get the output header name
output_only = cols.columns.str.split('_').str[0].tolist()
# combine and generate a MultiIndex
cols = list(cols.to_numpy())   [output_only]
cols = pd.MultiIndex.from_arrays(cols, names = ['X','Y', None])
# get rid of X and Y rows
# sorting is to avoid performance drop on MultiIndex
temp = temp.sort_index().drop(['X', 'Y'])
temp.columns = cols
# flip to long form
temp.stack(['X','Y']).reset_index()
   run  time   temp  press  twist  X  Y  output
0    A  10.0  100.0  200.0  300.0  0  0      11
1    A  10.0  100.0  200.0  300.0  0  1      15
2    A  10.0  100.0  200.0  300.0  1  0      19
3    A  10.0  100.0  200.0  300.0  1  1      23
4    B  20.0  120.0  220.0  320.0  0  0      12
5    B  20.0  120.0  220.0  320.0  0  1      16
6    B  20.0  120.0  220.0  320.0  1  0      20
7    B  20.0  120.0  220.0  320.0  1  1      24
8    C  40.0  140.0  240.0  340.0  0  0      13
9    C  40.0  140.0  240.0  340.0  0  1      17
10   C  40.0  140.0  240.0  340.0  1  0      21
11   C  40.0  140.0  240.0  340.0  1  1      25
12   D  60.0  160.0  260.0  360.0  0  0      14
13   D  60.0  160.0  260.0  360.0  0  1      18
14   D  60.0  160.0  260.0  360.0  1  0      22
15   D  60.0  160.0  260.0  360.0  1  1      26

Again, just wide to long form was used, so no need to flip back to wide (potentially expensive) and more importantly, merging is avoided (merging on duplicate indexes is potentially expensive as well)

run	X	Y	time	temp	press	twist	output
A	0	0	10	100	200	300	11
A	0	1	10	100	200	300	12
A	1	0	10	100	200	300	13
A	1	1	10	100	200	300	14
B	0	0	20	120	220	320	15
B	0	1	20	120	220	320	16
B	1	0	20	120	220	320	17
B	1	1	20	120	220	320	18
C	0	0	40	140	240	340	19
C	0	1	40	140	240	340	20
C	1	0	40	140	240	340	21
C	1	1	40	140	240	340	22
D	0	0	60	160	260	360	23
D	0	1	60	160	260	360	24
D	1	0	60	160	260	360	25
D	1	1	60	160	260	360	26

run	X	Y	time	temp	press	twist	output
A	0	0	10	100	200	300	11
A	0	1	10	100	200	300	12
A	1	0	10	100	200	300	13
A	1	1	10	100	200	300	14
B	0	0	20	120	220	320	15
B	0	1	20	120	220	320	16
B	1	0	20	120	220	320	17
B	1	1	20	120	220	320	18
C	0	0	40	140	240	340	19
C	0	1	40	140	240	340	20
C	1	0	40	140	240	340	21
C	1	1	40	140	240	340	22
D	0	0	60	160	260	360	23
D	0	1	60	160	260	360	24
D	1	0	60	160	260	360	25
D	1	1	60	160	260	360	26

run	X	Y	time	temp	press	twist	output
A	0	0	10	100	200	300	11
A	0	1	10	100	200	300	12
A	1	0	10	100	200	300	13
A	1	1	10	100	200	300	14
B	0	0	20	120	220	320	15
B	0	1	20	120	220	320	16
B	1	0	20	120	220	320	17
B	1	1	20	120	220	320	18
C	0	0	40	140	240	340	19
C	0	1	40	140	240	340	20
C	1	0	40	140	240	340	21
C	1	1	40	140	240	340	22
D	0	0	60	160	260	360	23
D	0	1	60	160	260	360	24
D	1	0	60	160	260	360	25
D	1	1	60	160	260	360	26