Im trying to use the method pivot_table() to convert test analysis from rows to columns, I got a test analysis, a test result and a test date, I need to convert the test date to column and drop this column used in the conversion. This is the actual table result from a sql query in long format
lot_nbr | batch_nbr | sku | variety | factory_nbr | code_nbr | date_test | test | result | inspector |
---|---|---|---|---|---|---|---|---|---|
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | physical_purity | 0.87 | jhon_martens |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 11/02/22 | genetic_contamination | 0.98 | julian_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | chemical_treatment_1 | 0.88 | martha_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 09/02/22 | chemical_treatment_2 | 0.91 | jhon_martens |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | chemical_treatment_3 | 0.93 | julian_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 11/02/22 | total_x_content | 0.77 | martha_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | total_y_content | 0.75 | jhon_martens |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 09/02/22 | total_z_content | 0.79 | julian_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | oc_count | 1 | martha_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 11/02/22 | tx_count | 0.1 | jhon_martens |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | optic_total | 0.57 | julian_doe |
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 09/02/22 | optic_total_1 | 0.68 | martha_doe |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 10/02/22 | physical_purity | 0.87 | jhon_martens |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 11/02/22 | genetic_contamination | 0.98 | julian_doe |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 10/02/22 | oc_count | 1 | martha_doe |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 11/02/22 | tx_count | 0.1 | jhon_martens |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 10/02/22 | optic_total | 0.57 | julian_doe |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 09/02/22 | optic_total_1 | 0.68 | martha_doe |
Desired output:
lot_nbr | batch_nbr | sku | variety | factory_nbr | code_nbr | date_test | physical_purity | date_test | genetic_contamination | date_test | chemical_treatment_1 | date_test | chemical_treatment_2 | date_test | chemical_treatment_3 | date_test | total_x_content | date_test | total_y_content | date_test | total_z_content | date_test | oc_count | date_test | tx_count | date_test | optic_total | date_test | optic_total_1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | 0.87 | 11/02/22 | 0.98 | 10/02/22 | 0.88 | 09/02/22 | 0.91 | 10/02/22 | 0.93 | 11/02/22 | 0.77 | 10/02/22 | 0.75 | 09/02/22 | 0.79 | 10/02/22 | 1 | 11/02/22 | 0.1 | 10/02/22 | 0.57 | 09/02/22 | 0.68 |
01983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 11/02/22 | 0.87 | 12/02/22 | 0.98 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 11/02/22 | 0.1 | 10/02/22 | 0.57 | 09/02/22 | 0.68 |
what is the most efficient method to get this I was thinking on using a lambda after using pivot_table, this is the pivot_table code:
#convert date to datetime
df.date_test = pd.to_datetime(df['date_test'])
df1 = df.pivot_table(index =['lot_nbr', 'batch_nbr','sku', 'variety', 'factory_nbr', 'code_nbr','date_test'],columns=['test'], values=['result']).reset_index()
#flatten MultiIndex in Pandas
df1.columns = df1.columns.map('_'.join)
df1
I got this table:
lot_nbr_ | batch_nbr_ | sku_ | variety_ | factory_nbr_ | code_nbr_ | date_test_ | result_chemical_treatment_1 | result_chemical_treatment_2 | result_chemical_treatment_3 | result_genetic_contamination | result_oc_count | result_optic_total | result_optic_total_1 | result_physical_purity | result_total_x_content | result_total_y_content | result_total_z_content | result_tx_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1983244324 | 45A6545 | 764H | max | 3456 | 45099 | 09/02/22 | 0.91 | 0.68 | 0.79 | |||||||||
1983244324 | 45A6545 | 764H | max | 3456 | 45099 | 10/02/22 | 0.88 | 0.93 | 1 | 0.57 | 0.87 | 0.75 | ||||||
1983244324 | 45A6545 | 764H | max | 3456 | 45099 | 11/02/22 | 0.98 | 0.77 | 0.1 | |||||||||
1983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 09/02/22 | 0.68 | |||||||||||
1983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 10/02/22 | 1 | 0.57 | 0.87 | |||||||||
1983244325 | 45A6546 | 56H | optimum | 3456 | 45099 | 11/02/22 | 0.98 | 0.1 |
a lambda function will be the best option after this transformation to append the date column per each test? or pivot only test, test result and date, using lot_nbr as index and then create a date column containing this dates per each result. Then group by the lot_nbr,batch_nbr,sku, variety,factory_nbr,code_nbr, and merge this tables using lot_nbr? I Know I can subset the columns like:
columns_dates= df1[['physical_purity','genetic_contamination','chemical_treatment_1', 'chemical_treatment_2','chemical_treatment_3','total_x_content','total_y_content', 'total_z_content','oc_count','tx_count','optic_total','optic_total_1']]
Then apply lambda or function to this columns.
I attach the initial Dataframe in dict format:
{'lot_nbr': {0: 1983244324,
1: 1983244324,
2: 1983244324,
3: 1983244324,
4: 1983244324,
5: 1983244324,
6: 1983244324,
7: 1983244324,
8: 1983244324,
9: 1983244324,
10: 1983244324,
11: 1983244324,
12: 1983244325,
13: 1983244325,
14: 1983244325,
15: 1983244325,
16: 1983244325,
17: 1983244325},
'batch_nbr': {0: '45A6545',
1: '45A6545',
2: '45A6545',
3: '45A6545',
4: '45A6545',
5: '45A6545',
6: '45A6545',
7: '45A6545',
8: '45A6545',
9: '45A6545',
10: '45A6545',
11: '45A6545',
12: '45A6546',
13: '45A6546',
14: '45A6546',
15: '45A6546',
16: '45A6546',
17: '45A6546'},
'sku': {0: '764H',
1: '764H',
2: '764H',
3: '764H',
4: '764H',
5: '764H',
6: '764H',
7: '764H',
8: '764H',
9: '764H',
10: '764H',
11: '764H',
12: '56H',
13: '56H',
14: '56H',
15: '56H',
16: '56H',
17: '56H'},
'variety': {0: 'max',
1: 'max',
2: 'max',
3: 'max',
4: 'max',
5: 'max',
6: 'max',
7: 'max',
8: 'max',
9: 'max',
10: 'max',
11: 'max',
12: 'optimum',
13: 'optimum',
14: 'optimum',
15: 'optimum',
16: 'optimum',
17: 'optimum'},
'factory_nbr': {0: 3456,
1: 3456,
2: 3456,
3: 3456,
4: 3456,
5: 3456,
6: 3456,
7: 3456,
8: 3456,
9: 3456,
10: 3456,
11: 3456,
12: 3456,
13: 3456,
14: 3456,
15: 3456,
16: 3456,
17: 3456},
'code_nbr': {0: 45099,
1: 45099,
2: 45099,
3: 45099,
4: 45099,
5: 45099,
6: 45099,
7: 45099,
8: 45099,
9: 45099,
10: 45099,
11: 45099,
12: 45099,
13: 45099,
14: 45099,
15: 45099,
16: 45099,
17: 45099},
'date_test': {0: Timestamp('2022-02-10 00:00:00'),
1: Timestamp('2022-02-11 00:00:00'),
2: Timestamp('2022-02-10 00:00:00'),
3: Timestamp('2022-02-09 00:00:00'),
4: Timestamp('2022-02-10 00:00:00'),
5: Timestamp('2022-02-11 00:00:00'),
6: Timestamp('2022-02-10 00:00:00'),
7: Timestamp('2022-02-09 00:00:00'),
8: Timestamp('2022-02-10 00:00:00'),
9: Timestamp('2022-02-11 00:00:00'),
10: Timestamp('2022-02-10 00:00:00'),
11: Timestamp('2022-02-09 00:00:00'),
12: Timestamp('2022-02-10 00:00:00'),
13: Timestamp('2022-02-11 00:00:00'),
14: Timestamp('2022-02-10 00:00:00'),
15: Timestamp('2022-02-11 00:00:00'),
16: Timestamp('2022-02-10 00:00:00'),
17: Timestamp('2022-02-09 00:00:00')},
'test': {0: 'physical_purity',
1: 'genetic_contamination',
2: 'chemical_treatment_1',
3: 'chemical_treatment_2',
4: 'chemical_treatment_3',
5: 'total_x_content',
6: 'total_y_content',
7: 'total_z_content',
8: 'oc_count',
9: 'tx_count',
10: 'optic_total',
11: 'optic_total_1',
12: 'physical_purity',
13: 'genetic_contamination',
14: 'oc_count',
15: 'tx_count',
16: 'optic_total',
17: 'optic_total_1'},
'result': {0: 0.87,
1: 0.98,
2: 0.88,
3: 0.91,
4: 0.93,
5: 0.77,
6: 0.75,
7: 0.79,
8: 1.0,
9: 0.1,
10: 0.57,
11: 0.68,
12: 0.87,
13: 0.98,
14: 1.0,
15: 0.1,
16: 0.57,
17: 0.68},
'inspector': {0: 'jhon_martens',
1: 'julian_doe',
2: 'martha_doe',
3: 'jhon_martens',
4: 'julian_doe',
5: 'martha_doe',
6: 'jhon_martens',
7: 'julian_doe',
8: 'martha_doe',
9: 'jhon_martens',
10: 'julian_doe',
11: 'martha_doe',
12: 'jhon_martens',
13: 'julian_doe',
14: 'martha_doe',
15: 'jhon_martens',
16: 'julian_doe',
17: 'martha_doe'}}
CodePudding user response:
So, I think I've managed to achieve the desired result. You can use df.pivot
. This is in fact the easy part:
import pandas as pd
from pandas import Timestamp
import re
# importing your dict
df = pd.DataFrame(data)
df_pivot = df.pivot(
index=['lot_nbr','batch_nbr','sku','variety','factory_nbr','code_nbr'],
columns='test',
values=['date_test','result']
)
df_pivot.reset_index(drop=False, inplace=True)
cols = [re.sub(r'_$','',x) for x in list(df_pivot.columns.map('_'.join))]
df_pivot.columns = cols
Unfortunately, pivot
does not maintain order in the way you want it to be, so we need to do a little reordering:
# remain the same
first_cols = cols[:6]
# reorder 'date_test_|result_'-cols based on order in df['test']
date_cols = [f'date_test_{i}' for i in df.test.unique()]
test_cols = [f'result_{i}' for i in df.test.unique()]
# now group 'date_test_|result_'-pairs
comb_cols = list()
for i,j in zip(date_cols,test_cols):
comb_cols.extend([i,j])
# rebuild entire df_pivot.columns in correct order
all_cols = list()
all_cols.extend(first_cols comb_cols)
# assign df_pivot to a .loc-selection on those cols
df_pivot = df_pivot.loc[:, all_cols]
The col names are rather long now. You can of course make them a bit shorter, e.g. by replacing 'date_test_' with 'dt_' or something, but I trust you get the idea there. Result:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df_pivot)
lot_nbr batch_nbr sku variety factory_nbr code_nbr \
0 1983244324 45A6545 764H max 3456 45099
1 1983244325 45A6546 56H optimum 3456 45099
date_test_physical_purity result_physical_purity \
0 2022-02-10 0.87
1 2022-02-10 0.87
date_test_genetic_contamination result_genetic_contamination \
0 2022-02-11 0.98
1 2022-02-11 0.98
date_test_chemical_treatment_1 result_chemical_treatment_1 \
0 2022-02-10 0.88
1 NaT NaN
date_test_chemical_treatment_2 result_chemical_treatment_2 \
0 2022-02-09 0.91
1 NaT NaN
date_test_chemical_treatment_3 result_chemical_treatment_3 \
0 2022-02-10 0.93
1 NaT NaN
date_test_total_x_content result_total_x_content date_test_total_y_content \
0 2022-02-11 0.77 2022-02-10
1 NaT NaN NaT
result_total_y_content date_test_total_z_content result_total_z_content \
0 0.75 2022-02-09 0.79
1 NaN NaT NaN
date_test_oc_count result_oc_count date_test_tx_count result_tx_count \
0 2022-02-10 1.0 2022-02-11 0.1
1 2022-02-10 1.0 2022-02-11 0.1
date_test_optic_total result_optic_total date_test_optic_total_1 \
0 2022-02-10 0.57 2022-02-09
1 2022-02-10 0.57 2022-02-09
result_optic_total_1
0 0.68
1 0.68
N.B. While comparing with your 'desired output', I noted a couple of differences. These seem to be mistakes in your version. E.g. your 2nd 'date_test' column has 12/02/22
for 2nd row, a date that does not actually occur in the original df
.