I'm trying to transform a pandas df that looks like:
import numpy as np
import pandas as pd
test_data = {'run': ['X', 'Y', 'A', 'B', 'C', 'D'],
'time': [np.nan, np.nan, 10, 20, 40, 60],
'temp': [np.nan, np.nan, 100, 120, 140, 160],
'press': [np.nan, np.nan, 200, 220, 240, 260],
'twist': [np.nan, np.nan, 300 ,320, 340, 360],
'output_site_1': [0, 0, 11, 12, 13, 14],
'output_site_2': [0, 1, 15, 16, 17, 18],
'output_site_3': [1, 0, 19, 20, 21, 22],
'output_site_4': [1, 1, 23, 24, 25, 26]
}
pd.DataFrame(test_data)
run | time | temp | press | twist | output_site_1 | output_site_2 | output_site_3 | output_site_4 |
---|---|---|---|---|---|---|---|---|
X | NaN | NaN | NaN | NaN | 0 | 0 | 1 | 1 |
Y | NaN | NaN | NaN | NaN | 0 | 1 | 0 | 1 |
A | 10.0 | 100.0 | 200.0 | 300.0 | 11 | 15 | 19 | 23 |
B | 20.0 | 120.0 | 220.0 | 320.0 | 12 | 16 | 20 | 24 |
C | 40.0 | 140.0 | 240.0 | 340.0 | 13 | 17 | 21 | 25 |
D | 60.0 | 160.0 | 260.0 | 360.0 | 14 | 18 | 22 | 26 |
into a df that looks like the one below:
transform_test_data = {'run': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
'X': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
'Y': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
'time': [10, 10, 10, 10, 20, 20, 20, 20, 40, 40, 40, 40, 60, 60, 60, 60],
'temp': [100, 100, 100, 100, 120, 120, 120, 120, 140, 140, 140, 140, 160, 160, 160, 160],
'press': [200, 200, 200, 200, 220, 220, 220, 220, 240, 240, 240, 240, 260, 260, 260, 260],
'twist': [300 ,300, 300, 300, 320, 320, 320, 320, 340, 340, 340, 340, 360, 360, 360, 360],
'output': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
}
pd.DataFrame(transform_test_data)
run | X | Y | time | temp | press | twist | output |
---|---|---|---|---|---|---|---|
A | 0 | 0 | 10 | 100 | 200 | 300 | 11 |
A | 0 | 1 | 10 | 100 | 200 | 300 | 12 |
A | 1 | 0 | 10 | 100 | 200 | 300 | 13 |
A | 1 | 1 | 10 | 100 | 200 | 300 | 14 |
B | 0 | 0 | 20 | 120 | 220 | 320 | 15 |
B | 0 | 1 | 20 | 120 | 220 | 320 | 16 |
B | 1 | 0 | 20 | 120 | 220 | 320 | 17 |
B | 1 | 1 | 20 | 120 | 220 | 320 | 18 |
C | 0 | 0 | 40 | 140 | 240 | 340 | 19 |
C | 0 | 1 | 40 | 140 | 240 | 340 | 20 |
C | 1 | 0 | 40 | 140 | 240 | 340 | 21 |
C | 1 | 1 | 40 | 140 | 240 | 340 | 22 |
D | 0 | 0 | 60 | 160 | 260 | 360 | 23 |
D | 0 | 1 | 60 | 160 | 260 | 360 | 24 |
D | 1 | 0 | 60 | 160 | 260 | 360 | 25 |
D | 1 | 1 | 60 | 160 | 260 | 360 | 26 |
I would like to take the rows 'X' and 'Y' and turn them into two seperate columns in the df and then remove them from the df. Doing so would add two new columns ('X' and 'Y') while also increasing the number for each run by run*(number-output-sites).
I need the method to be generalized such that it takes in a dataframe (like the first one) and returns a df (like the last one), can handle arbitrary number of inputs, and sites. There will always be a delineation between inputs and outputs as the first output will always be labeled 'output_site_1'.
I supplied 'test_data' and 'transformed_test_data' as dictionaries so if anyone felt kind enough to help out a stuck human (that's me!) it would be easy to create the two dataframes.
CodePudding user response:
Like many situations, this is easier handled before moving into pandas. You really should have two different arrays there, but we can handle that in code.
import pandas as pd
test_data = {'run': ['X', 'Y', 'A', 'B', 'C', 'D'],
'time': [np.nan, np.nan, 10, 20, 40, 60],
'temp': [np.nan, np.nan, 100, 120, 140, 160],
'press': [np.nan, np.nan, 200, 220, 240, 260],
'twist': [np.nan, np.nan, 300 ,320, 340, 360],
'output_site_1': [0, 0, 11, 12, 13, 14],
'output_site_2': [0, 1, 15, 16, 17, 18],
'output_site_3': [1, 0, 19, 20, 21, 22],
'output_site_4': [1, 1, 23, 24, 25, 26]
}
rows = []
store = {}
for key,row in test_data.items():
if key == 'run':
runs = row
elif row[0] is np.nan:
store[key] = { a:b for a,b in zip(runs,row) }
else:
row = { a:b for a,b in zip(runs, row) }
for r in runs[2:]:
new = [r, row['X'], row['Y']]
new.extend( [v[r] for v in store.values()] )
new.append( row[r] )
rows.append( new )
columns = ['run', 'X', 'Y'] list(store.keys()) ['output']
df = pd.DataFrame( rows, columns=columns )
print(df)
Output:
run X Y time temp press twist output
0 A 0 0 10 100 200 300 11
1 B 0 0 20 120 220 320 12
2 C 0 0 40 140 240 340 13
3 D 0 0 60 160 260 360 14
4 A 0 1 10 100 200 300 15
5 B 0 1 20 120 220 320 16
6 C 0 1 40 140 240 340 17
7 D 0 1 60 160 260 360 18
8 A 1 0 10 100 200 300 19
9 B 1 0 20 120 220 320 20
10 C 1 0 40 140 240 340 21
11 D 1 0 60 160 260 360 22
12 A 1 1 10 100 200 300 23
13 B 1 1 20 120 220 320 24
14 C 1 1 40 140 240 340 25
15 D 1 1 60 160 260 360 26
If you're bothered that the order doesn't match yours, add
rows.sort()
just before making the DataFrame.
CodePudding user response:
def setter(df):
tmp = pd.concat([df.run, df.filter(like="output_")], axis=1).set_index("run")
df2 = tmp[2:].stack().reset_index()[["run", 0]].rename(columns={0: "output"})
df3 = tmp[:2].T.reset_index(drop=True)
return pd.concat([pd.concat([df3] * (df2.size// df3.size), ignore_index=True), pd.merge(df[2:][['run', 'time', 'temp', 'press', 'twist']], df2)], axis=1)
CodePudding user response:
one option is to transform into long form, split into different tables, munge, then recombine. As suggested in Tim's solution, if you can do this before dumping into Pandas, that is preferable, and should be more efficient.
This solution uses pivot_longer from pyjanitor for the wide to long form; you can do this using pd.wide_to_long
; pivot_longer offers an easier abstraction and is comparably more efficient in a number of cases.
# pip install pyjanitor
import pandas as pd
import janitor as jn
# reshape to long form
extract = (test_data
.pivot_longer(index = slice('run','twist'),
names_to = ('.value', 'site', 'num'),
names_sep='_')
.drop(columns='site'))
# extract X and Y rows
runs = (extract
.loc[extract.run.isin(['X','Y'])]
.pivot(columns='run', values='output', index='num'))
# extract non X and Y rows
others = extract.loc[~extract.run.isin(['X','Y'])]
# merge the two dataframes
(runs
.merge(others, on='num')
.drop(columns='num')
)
X Y run time temp press twist output
0 0 0 A 10.0 100.0 200.0 300.0 11
1 0 0 B 20.0 120.0 220.0 320.0 12
2 0 0 C 40.0 140.0 240.0 340.0 13
3 0 0 D 60.0 160.0 260.0 360.0 14
4 0 1 A 10.0 100.0 200.0 300.0 15
5 0 1 B 20.0 120.0 220.0 320.0 16
6 0 1 C 40.0 140.0 240.0 340.0 17
7 0 1 D 60.0 160.0 260.0 360.0 18
8 1 0 A 10.0 100.0 200.0 300.0 19
9 1 0 B 20.0 120.0 220.0 320.0 20
10 1 0 C 40.0 140.0 240.0 340.0 21
11 1 0 D 60.0 160.0 260.0 360.0 22
12 1 1 A 10.0 100.0 200.0 300.0 23
13 1 1 B 20.0 120.0 220.0 320.0 24
14 1 1 C 40.0 140.0 240.0 340.0 25
15 1 1 D 60.0 160.0 260.0 360.0 26
If your data has just two rows for X
and Y
, you can avoid flipping to wide form(pivot), by doing some preprocessing; this way you get to do the flip once, and even avoid the merge (and you don't need another library :)):
index = ['run', 'time', 'temp', 'press', 'twist']
temp = test_data.set_index(index)
# preprocessing begins here
cols = temp.loc[['X','Y']]
# get the output header name
output_only = cols.columns.str.split('_').str[0].tolist()
# combine and generate a MultiIndex
cols = list(cols.to_numpy()) [output_only]
cols = pd.MultiIndex.from_arrays(cols, names = ['X','Y', None])
# get rid of X and Y rows
# sorting is to avoid performance drop on MultiIndex
temp = temp.sort_index().drop(['X', 'Y'])
temp.columns = cols
# flip to long form
temp.stack(['X','Y']).reset_index()
run time temp press twist X Y output
0 A 10.0 100.0 200.0 300.0 0 0 11
1 A 10.0 100.0 200.0 300.0 0 1 15
2 A 10.0 100.0 200.0 300.0 1 0 19
3 A 10.0 100.0 200.0 300.0 1 1 23
4 B 20.0 120.0 220.0 320.0 0 0 12
5 B 20.0 120.0 220.0 320.0 0 1 16
6 B 20.0 120.0 220.0 320.0 1 0 20
7 B 20.0 120.0 220.0 320.0 1 1 24
8 C 40.0 140.0 240.0 340.0 0 0 13
9 C 40.0 140.0 240.0 340.0 0 1 17
10 C 40.0 140.0 240.0 340.0 1 0 21
11 C 40.0 140.0 240.0 340.0 1 1 25
12 D 60.0 160.0 260.0 360.0 0 0 14
13 D 60.0 160.0 260.0 360.0 0 1 18
14 D 60.0 160.0 260.0 360.0 1 0 22
15 D 60.0 160.0 260.0 360.0 1 1 26
Again, just wide to long form was used, so no need to flip back to wide (potentially expensive) and more importantly, merging is avoided (merging on duplicate indexes is potentially expensive as well)