Pandas- Combine multiple rows into a single row and create new columns-CodePudding

I am working with order data using pandas. Each order can contain up to two rows (one row for each component of the order, which has a max of two components).

My goal is to take two rows and turn them into one.

example input:

Order_Number	INVENTORY CODE	description1
4304	STDROFENHBSM608.511WH	8-1/2 x 11, 60# Soporset Digital HiBright Smooth
4304	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE

example output:

Order_Number	INVENTORY CODE	description1	INVENTORY CODE2	description2
4304	STDROFENHBSM608.511WH	8-1/2 x 11, 60# Soporset Digital HiBright Smooth	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE

Here is my current code:

#name of infile to be the name of the file being ingested
infile= ('file1.csv')
infile2= ('file2.csv')
infile3= ('file3.csv')
#length of the infile for naming purposes
size = len(infile)
size2 = len(infile2)
#name outfile to be the name of the desired output file
outfile = (infile[:size -4] "_" infile2[:size2 -4] "_output.csv")


#Data read in as dataframe
df = pd.read_csv(infile, encoding = "ISO-8859-1", engine= 'python')
df2 = pd.read_csv(infile2, encoding = "ISO-8859-1", engine= 'python')
df3 = pd.read_csv(infile3, encoding = "ISO-8859-1", engine= 'python')

df_merge =df.merge(df2, on='Order_Number', how='left')
df_final = pd.merge(df_merge,df3[['PkgID?','Printer', 'Inserter']],on='PkgID?', how='left')

The above code combines the various data I am working with into a single dataframe, but has duplicate rows based off order number as mentioned above.

Any help is appreciated, TIA.

CodePudding user response：

You need to reshape your dataframe. This should work

res = pd.DataFrame(df.drop(columns='Order_Number').values.reshape(len(df)//2, 4), # reshape Inventory code and description columns
                   columns=['INVENTORY CODE', 'description1', 'INVENTORY CODE2', 'description2'], # set new column names
                   index=df['Order_Number'].drop_duplicates()).reset_index() # set index by Order_Number and reset_index for a new column
res

CodePudding user response：

LIMITATIONS: the separator used here is | which mustn't be in your data

import pandas as pd

df = pd.read_csv('df.csv')

# combine based on Order_Number
df_ = df.groupby('Order_Number').agg({'INVENTORY CODE':'|'.join,'description':'|'.join}).reset_index()

# split and expand
df_1 = df_['INVENTORY CODE'].str.split('|', expand=True).add_prefix('INVENTORY CODE_')
df_2 = df_['description'].str.split('|', expand=True).add_prefix('description_')

# combine again and remove the ['INVENTORY CODE', 'description'] columns
df_ = pd.concat([df_, df_1, df_2], axis=1).drop(['INVENTORY CODE', 'description'], axis=1)

# sort the columns to match the output
base_cols = [col for col in df_.columns if col.split('_', 1)[0] not in ['INVENTORY CODE', 'description']]
sort_columns = [col for col in df_.columns if col not in base_cols]

df_ = df_[base_cols   sorted(sort_columns, key=lambda x: (x=='Order_Number', int(x.split('_')[-1])))]

output:

Order_Number	INVENTORY CODE_0	description_0	INVENTORY CODE_1	description_1	INVENTORY CODE_2	description_2	INVENTORY CODE_3	description_3
4304	STDROFENHBSM608.511WH	8-1/2 x 11, 60# Soporset Digital HiBright Smooth	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE
4305	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE	STNDEN695WOL	6 x 9.5 DBL WDW ENVELOPE