Home > Back-end >  Extract data from pandas dataframe columns with duplicate names
Extract data from pandas dataframe columns with duplicate names

Time:03-15

I have a dataframe which has duplicate column names:

Accepted    Accepted    Accepted    Reject    Accepted    Reject
ABC         IJK         JKL         XYJ       LMN         UIO
BCD         PQR         EFG         YVG       GHIJ        PLK

...and want to convert it into two dataframes; one only of "Accepted" columns and other for "Reject" Columns:

df1:

Accepted    Accepted    Accepted    Accepted
ABC         IJK         JKL         LMN     
BCD         PQR         EFG         GHIJ    

df2:

Reject    Reject
XYJ       UIO
YVG       PLK 

Tried:

df1=df["Accepted"]
df2=df["Reject"]

... but this only gives the first column matching this name.

CodePudding user response:

If select one column with same name are selected all columns with same name in DataFrame:

df1 = df['Accepted']
df2 = df['Reject']

Then is possible deduplicate columns:

df1.columns = [f'{x}_{i}' for i, x in enumerate(df1.columns, 1)]
df2.columns = [f'{x}_{i}' for i, x in enumerate(df2.columns, 1)]

EDIT: If get only first column name it means there are not duplicated columns names, so is possible use DataFrame.filter:

df1 = df.filter(like='Accepted')
df2 = df.filter(like='Reject')

CodePudding user response:

You can also use this

df1 = df.loc[:,df.columns.isin(['accepted'])]

OR

df2 = df.loc[:,df.columns.isin(['rejected'])]

CodePudding user response:

It's not recommended to have duplicate column names, but anyway:

df1 = df.loc[:, df.columns.get_loc('Accepted')]

  Accepted Accepted Accepted Accepted
0      ABC      IJK      JKL      LMN
1      BCD      PQR      EFG     GHIJ

df2 = df.loc[:, df.columns.get_loc('Reject')]

CodePudding user response:

IMO give the same name to columns is not a good practice. use unique names for every column. it doesn't matter if you put a number or add a unique character into the name. just make it unique

Solution from me, rename the column first

df.columns = ['Accepted1','Accepted2','Accepted3','Reject1','Accepted4','Reject2']
df1 = df[['Accepted1','Accepted2','Accepted3','Accepted4']]
df2 = df[['Reject1','Reject2']]
  • Related