Create multiple date columns from a date column, where row is not empty or null-CodePudding

Im trying to use the method pivot_table() to convert test analysis from rows to columns, I got a test analysis, a test result and a test date, I need to convert the test date to column and drop this column used in the conversion. This is the actual table result from a sql query in long format

lot_nbr	batch_nbr	sku	variety	factory_nbr	code_nbr	date_test	test	result	inspector
01983244324	45A6545	764H	max	3456	45099	10/02/22	physical_purity	0.87	jhon_martens
01983244324	45A6545	764H	max	3456	45099	11/02/22	genetic_contamination	0.98	julian_doe
01983244324	45A6545	764H	max	3456	45099	10/02/22	chemical_treatment_1	0.88	martha_doe
01983244324	45A6545	764H	max	3456	45099	09/02/22	chemical_treatment_2	0.91	jhon_martens
01983244324	45A6545	764H	max	3456	45099	10/02/22	chemical_treatment_3	0.93	julian_doe
01983244324	45A6545	764H	max	3456	45099	11/02/22	total_x_content	0.77	martha_doe
01983244324	45A6545	764H	max	3456	45099	10/02/22	total_y_content	0.75	jhon_martens
01983244324	45A6545	764H	max	3456	45099	09/02/22	total_z_content	0.79	julian_doe
01983244324	45A6545	764H	max	3456	45099	10/02/22	oc_count	1	martha_doe
01983244324	45A6545	764H	max	3456	45099	11/02/22	tx_count	0.1	jhon_martens
01983244324	45A6545	764H	max	3456	45099	10/02/22	optic_total	0.57	julian_doe
01983244324	45A6545	764H	max	3456	45099	09/02/22	optic_total_1	0.68	martha_doe
01983244325	45A6546	56H	optimum	3456	45099	10/02/22	physical_purity	0.87	jhon_martens
01983244325	45A6546	56H	optimum	3456	45099	11/02/22	genetic_contamination	0.98	julian_doe
01983244325	45A6546	56H	optimum	3456	45099	10/02/22	oc_count	1	martha_doe
01983244325	45A6546	56H	optimum	3456	45099	11/02/22	tx_count	0.1	jhon_martens
01983244325	45A6546	56H	optimum	3456	45099	10/02/22	optic_total	0.57	julian_doe
01983244325	45A6546	56H	optimum	3456	45099	09/02/22	optic_total_1	0.68	martha_doe

Desired output:

lot_nbr	batch_nbr	sku	variety	factory_nbr	code_nbr	date_test	physical_purity	date_test	genetic_contamination	date_test	chemical_treatment_1	date_test	chemical_treatment_2	date_test	chemical_treatment_3	date_test	total_x_content	date_test	total_y_content	date_test	total_z_content	date_test	oc_count	date_test	tx_count	date_test	optic_total	date_test	optic_total_1
01983244324	45A6545	764H	max	3456	45099	10/02/22	0.87	11/02/22	0.98	10/02/22	0.88	09/02/22	0.91	10/02/22	0.93	11/02/22	0.77	10/02/22	0.75	09/02/22	0.79	10/02/22	1	11/02/22	0.1	10/02/22	0.57	09/02/22	0.68
01983244325	45A6546	56H	optimum	3456	45099	11/02/22	0.87	12/02/22	0.98	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	11/02/22	0.1	10/02/22	0.57	09/02/22	0.68

what is the most efficient method to get this I was thinking on using a lambda after using pivot_table, this is the pivot_table code:

#convert date to datetime
df.date_test = pd.to_datetime(df['date_test'])
df1 = df.pivot_table(index =['lot_nbr', 'batch_nbr','sku', 'variety', 'factory_nbr',    'code_nbr','date_test'],columns=['test'], values=['result']).reset_index()
#flatten MultiIndex in Pandas
df1.columns = df1.columns.map('_'.join)
df1

I got this table:

lot_nbr_	batch_nbr_	sku_	variety_	factory_nbr_	code_nbr_	date_test_	result_chemical_treatment_1	result_chemical_treatment_2	result_chemical_treatment_3	result_genetic_contamination	result_oc_count	result_optic_total	result_optic_total_1	result_physical_purity	result_total_x_content	result_total_y_content	result_total_z_content	result_tx_count
1983244324	45A6545	764H	max	3456	45099	09/02/22		0.91					0.68				0.79
1983244324	45A6545	764H	max	3456	45099	10/02/22	0.88		0.93		1	0.57		0.87		0.75
1983244324	45A6545	764H	max	3456	45099	11/02/22				0.98					0.77			0.1
1983244325	45A6546	56H	optimum	3456	45099	09/02/22							0.68
1983244325	45A6546	56H	optimum	3456	45099	10/02/22					1	0.57		0.87
1983244325	45A6546	56H	optimum	3456	45099	11/02/22				0.98								0.1

a lambda function will be the best option after this transformation to append the date column per each test? or pivot only test, test result and date, using lot_nbr as index and then create a date column containing this dates per each result. Then group by the lot_nbr,batch_nbr,sku, variety,factory_nbr,code_nbr, and merge this tables using lot_nbr? I Know I can subset the columns like:

columns_dates= df1[['physical_purity','genetic_contamination','chemical_treatment_1',   'chemical_treatment_2','chemical_treatment_3','total_x_content','total_y_content',  'total_z_content','oc_count','tx_count','optic_total','optic_total_1']]

Then apply lambda or function to this columns.

I attach the initial Dataframe in dict format:

{'lot_nbr': {0: 1983244324,
  1: 1983244324,
  2: 1983244324,
  3: 1983244324,
  4: 1983244324,
  5: 1983244324,
  6: 1983244324,
  7: 1983244324,
  8: 1983244324,
  9: 1983244324,
  10: 1983244324,
  11: 1983244324,
  12: 1983244325,
  13: 1983244325,
  14: 1983244325,
  15: 1983244325,
  16: 1983244325,
  17: 1983244325},
 'batch_nbr': {0: '45A6545',
  1: '45A6545',
  2: '45A6545',
  3: '45A6545',
  4: '45A6545',
  5: '45A6545',
  6: '45A6545',
  7: '45A6545',
  8: '45A6545',
  9: '45A6545',
  10: '45A6545',
  11: '45A6545',
  12: '45A6546',
  13: '45A6546',
  14: '45A6546',
  15: '45A6546',
  16: '45A6546',
  17: '45A6546'},
 'sku': {0: '764H',
  1: '764H',
  2: '764H',
  3: '764H',
  4: '764H',
  5: '764H',
  6: '764H',
  7: '764H',
  8: '764H',
  9: '764H',
  10: '764H',
  11: '764H',
  12: '56H',
  13: '56H',
  14: '56H',
  15: '56H',
  16: '56H',
  17: '56H'},
 'variety': {0: 'max',
  1: 'max',
  2: 'max',
  3: 'max',
  4: 'max',
  5: 'max',
  6: 'max',
  7: 'max',
  8: 'max',
  9: 'max',
  10: 'max',
  11: 'max',
  12: 'optimum',
  13: 'optimum',
  14: 'optimum',
  15: 'optimum',
  16: 'optimum',
  17: 'optimum'},
 'factory_nbr': {0: 3456,
  1: 3456,
  2: 3456,
  3: 3456,
  4: 3456,
  5: 3456,
  6: 3456,
  7: 3456,
  8: 3456,
  9: 3456,
  10: 3456,
  11: 3456,
  12: 3456,
  13: 3456,
  14: 3456,
  15: 3456,
  16: 3456,
  17: 3456},
 'code_nbr': {0: 45099,
  1: 45099,
  2: 45099,
  3: 45099,
  4: 45099,
  5: 45099,
  6: 45099,
  7: 45099,
  8: 45099,
  9: 45099,
  10: 45099,
  11: 45099,
  12: 45099,
  13: 45099,
  14: 45099,
  15: 45099,
  16: 45099,
  17: 45099},
 'date_test': {0: Timestamp('2022-02-10 00:00:00'),
  1: Timestamp('2022-02-11 00:00:00'),
  2: Timestamp('2022-02-10 00:00:00'),
  3: Timestamp('2022-02-09 00:00:00'),
  4: Timestamp('2022-02-10 00:00:00'),
  5: Timestamp('2022-02-11 00:00:00'),
  6: Timestamp('2022-02-10 00:00:00'),
  7: Timestamp('2022-02-09 00:00:00'),
  8: Timestamp('2022-02-10 00:00:00'),
  9: Timestamp('2022-02-11 00:00:00'),
  10: Timestamp('2022-02-10 00:00:00'),
  11: Timestamp('2022-02-09 00:00:00'),
  12: Timestamp('2022-02-10 00:00:00'),
  13: Timestamp('2022-02-11 00:00:00'),
  14: Timestamp('2022-02-10 00:00:00'),
  15: Timestamp('2022-02-11 00:00:00'),
  16: Timestamp('2022-02-10 00:00:00'),
  17: Timestamp('2022-02-09 00:00:00')},
 'test': {0: 'physical_purity',
  1: 'genetic_contamination',
  2: 'chemical_treatment_1',
  3: 'chemical_treatment_2',
  4: 'chemical_treatment_3',
  5: 'total_x_content',
  6: 'total_y_content',
  7: 'total_z_content',
  8: 'oc_count',
  9: 'tx_count',
  10: 'optic_total',
  11: 'optic_total_1',
  12: 'physical_purity',
  13: 'genetic_contamination',
  14: 'oc_count',
  15: 'tx_count',
  16: 'optic_total',
  17: 'optic_total_1'},
 'result': {0: 0.87,
  1: 0.98,
  2: 0.88,
  3: 0.91,
  4: 0.93,
  5: 0.77,
  6: 0.75,
  7: 0.79,
  8: 1.0,
  9: 0.1,
  10: 0.57,
  11: 0.68,
  12: 0.87,
  13: 0.98,
  14: 1.0,
  15: 0.1,
  16: 0.57,
  17: 0.68},
 'inspector': {0: 'jhon_martens',
  1: 'julian_doe',
  2: 'martha_doe',
  3: 'jhon_martens',
  4: 'julian_doe',
  5: 'martha_doe',
  6: 'jhon_martens',
  7: 'julian_doe',
  8: 'martha_doe',
  9: 'jhon_martens',
  10: 'julian_doe',
  11: 'martha_doe',
  12: 'jhon_martens',
  13: 'julian_doe',
  14: 'martha_doe',
  15: 'jhon_martens',
  16: 'julian_doe',
  17: 'martha_doe'}}

CodePudding user response：

So, I think I've managed to achieve the desired result. You can use df.pivot. This is in fact the easy part:

import pandas as pd
from pandas import Timestamp
import re

# importing your dict
df = pd.DataFrame(data)

df_pivot = df.pivot(
    index=['lot_nbr','batch_nbr','sku','variety','factory_nbr','code_nbr'], 
    columns='test', 
    values=['date_test','result']
    )

df_pivot.reset_index(drop=False, inplace=True)

cols = [re.sub(r'_$','',x) for x in list(df_pivot.columns.map('_'.join))]
df_pivot.columns = cols

Unfortunately, pivot does not maintain order in the way you want it to be, so we need to do a little reordering:

# remain the same
first_cols = cols[:6]

# reorder 'date_test_|result_'-cols based on order in df['test']
date_cols = [f'date_test_{i}' for i in df.test.unique()]
test_cols = [f'result_{i}' for i in df.test.unique()]

# now group 'date_test_|result_'-pairs
comb_cols = list()
for i,j in zip(date_cols,test_cols):
    comb_cols.extend([i,j])

# rebuild entire df_pivot.columns in correct order
all_cols = list()
all_cols.extend(first_cols comb_cols)

# assign df_pivot to a .loc-selection on those cols
df_pivot = df_pivot.loc[:, all_cols]

The col names are rather long now. You can of course make them a bit shorter, e.g. by replacing 'date_test_' with 'dt_' or something, but I trust you get the idea there. Result:

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_pivot)

      lot_nbr batch_nbr   sku  variety  factory_nbr  code_nbr  \
0  1983244324   45A6545  764H      max         3456     45099   
1  1983244325   45A6546   56H  optimum         3456     45099   

  date_test_physical_purity result_physical_purity  \
0                2022-02-10                   0.87   
1                2022-02-10                   0.87   

  date_test_genetic_contamination result_genetic_contamination  \
0                      2022-02-11                         0.98   
1                      2022-02-11                         0.98   

  date_test_chemical_treatment_1 result_chemical_treatment_1  \
0                     2022-02-10                        0.88   
1                            NaT                         NaN   

  date_test_chemical_treatment_2 result_chemical_treatment_2  \
0                     2022-02-09                        0.91   
1                            NaT                         NaN   

  date_test_chemical_treatment_3 result_chemical_treatment_3  \
0                     2022-02-10                        0.93   
1                            NaT                         NaN   

  date_test_total_x_content result_total_x_content date_test_total_y_content  \
0                2022-02-11                   0.77                2022-02-10   
1                       NaT                    NaN                       NaT   

  result_total_y_content date_test_total_z_content result_total_z_content  \
0                   0.75                2022-02-09                   0.79   
1                    NaN                       NaT                    NaN   

  date_test_oc_count result_oc_count date_test_tx_count result_tx_count  \
0         2022-02-10             1.0         2022-02-11             0.1   
1         2022-02-10             1.0         2022-02-11             0.1   

  date_test_optic_total result_optic_total date_test_optic_total_1  \
0            2022-02-10               0.57              2022-02-09   
1            2022-02-10               0.57              2022-02-09   

  result_optic_total_1  
0                 0.68  
1                 0.68

N.B. While comparing with your 'desired output', I noted a couple of differences. These seem to be mistakes in your version. E.g. your 2nd 'date_test' column has 12/02/22 for 2nd row, a date that does not actually occur in the original df.