Home > Software design >  Distinguish repeating column names by adding an integer using pandas
Distinguish repeating column names by adding an integer using pandas

Time:04-01

I have some columns that have the same names. I would like to add a 1 to the repeating column names

Data

Date        Type    hi  hello   stat    hi  hello   
1/1/2022    a       0   0       1       1   0

Desired

Date        Type    hi  hello   stat    hi1     hello1  
1/1/2022    a       0   0       1       1       0

Doing

mask = df['col2'].duplicated(keep=False)

I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.

Any suggestion is appreciated

CodePudding user response:

Use the built-in method _maybe_dedup_names() to deduplicate column headers:

from pandas.io.parsers.base_parser import ParserBase

df.columns = ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
#        Date  Type  hi  hello  stat  hi.1  hello.1
# 0  1/1/2022     a   0      0     1     1        0

This is what pandas uses internally to fix duplicate headers from read_csv().

CodePudding user response:

You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.

df.columns = df.columns [{False:'',True:'1'}[x] for x in df.columns.duplicated()]

CodePudding user response:

We can do

s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns   s).str.strip('.')
df
Out[153]: 
       Date Type  hi  hello  stat  hi.1  hello.1
0  1/1/2022    a   0      0     1     1        0
  • Related