Home > Mobile >  Python - How to remove duplicate columns and add as rows in csv
Python - How to remove duplicate columns and add as rows in csv

Time:02-02

I have following csv

Name     Date     Qty   Date     Qty   Date     Qty
---------------------------------------------------
ABC       Jan 2023   10    Feb 2023    11    Mar 2023    12
XYZ       Jan 2023   20    Feb 2023    21    Mar 2023    22

I want output as follows in csv/dataframe

Name     Date     Qty
---------------------
ABC       Jan 2023   10
ABC       Feb 2023   11
ABC       Mar 2023   12
XYZ       Jan 2023   20
XYZ       Feb 2023   21
XYZ       Mar 2023   22

How I achieve this result?

CodePudding user response:

A bit complicated but does the job. You can execute step by step to view the transformation:

>>> (df.melt('Name').assign(row=lambda x: x.groupby('variable').cumcount())
       .pivot(['row', 'Name'], 'variable', 'value')
       .reset_index('Name').rename_axis(index=None, columns=None))

  Name      Date Qty
0  ABC  Jan 2023  10
1  XYZ  Jan 2023  20
2  ABC  Feb 2023  11
3  XYZ  Feb 2023  21
4  ABC  Mar 2023  12
5  XYZ  Mar 2023  22

CodePudding user response:

Less streamlined solution compared to @Corralien's. Also uses melt and pivot.

import pandas as pd
import io

#-----------------------------------------------#
#Recreate OP's table with duplicate column names#
#-----------------------------------------------#
df = pd.read_csv(io.StringIO("""
ABC       Jan-2023   10    Feb-2023    11    Mar-2023    12
XYZ       Jan-2023   20    Feb-2023    21    Mar-2023    22
"""),header=None,delim_whitespace=True)

df.columns = ['Name','Date','Qty','Date','Qty','Date','Qty']


#-----------------#
#Start of solution#
#-----------------#
#melt from wide to long (maintains order)
melted_df = df.melt(
    id_vars='Name',
    var_name='col',
    value_name='val',
)

#add a number for Date1/Date2/Date3 to keep track of Qty1/Qty2/Qty3 etc
melted_df['col_number'] = melted_df.groupby(['Name','col']).cumcount()

#pivot back to wide form
wide_df = melted_df.pivot(
    index=['Name','col_number'],
    columns='col',
    values='val',
).reset_index().drop(columns=['col_number'])

wide_df.columns.name = None #remove column index name

#Final output
print(wide_df)

Output

  Name      Date Qty
0  ABC  Jan-2023  10
1  ABC  Feb-2023  11
2  ABC  Mar-2023  12
3  XYZ  Jan-2023  20
4  XYZ  Feb-2023  21
5  XYZ  Mar-2023  22

CodePudding user response:

import pandas as pd

# create a dataframe from your original csv
df = pd.read_csv("original.csv")

# use pandas' melt function to transform the dataframe
df = df.melt(id_vars=["Name"], var_name="Date", value_name="Qty")

# split the "Date" column into two separate columns for "Date" and "Qty"
df[['Date', 'Qty']] = df['Date'].str.split(expand=True)

# change the "Date" and "Qty" columns to appropriate data types
df["Date"] = pd.to_datetime(df["Date"])
df["Qty"] = df["Qty"].astype(int)

# sort the values by Name and Date
df = df.sort_values(by=["Name", "Date"])

# write the transformed dataframe to a new csv
df.to_csv("transformed.csv", index=False)
  • Related