Home > Blockchain >  Rename pandas column iteratively
Rename pandas column iteratively

Time:10-25

I have several columns named the same in a data frame. How can I rename the below normal and KIRC to normal_1, normal_2, KIRC_1, KIRC_2?

import pandas as pd

gene_exp.columns = gene_exp.iloc[-1]
gene_exp = gene_exp.iloc[:-1]
gene_exp

# Append "_[number]" 
c = pd.Series(gene_exp.columns)
for dup in gene_exp.columns[gene_exp.columns.duplicated(keep=False)]: 
    c[df.columns.get_loc(dup)] = ([dup   '_'   str(d_idx) 
                                     if d_idx != 0 
                                     else dup 
                                     for d_idx in range(gene_exp.columns.get_loc(dup).sum())]
                                    )
gene_exp

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/opt/conda/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/opt/conda/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'KIRC'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_27/3403075751.py in <module>
      5                                      if d_idx != 0
      6                                      else dup
----> 7                                      for d_idx in range(gene_exp.columns.get_loc(dup).sum())]
      8                                     )
      9 gene_exp

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'KIRC'

Sample data

Gene NAME KIRC normal normal KIRC
0 ABC DEF GHI JKL MNO PQR
1 STU VWX YZ ABC DEF GHI

Desired output:

Gene NAME KIRC_1 normal_1 normal_2 KIRC_2
0 ABC DEF GHI JKL MNO PQR
1 STU VWX YZ ABC DEF GHI

CodePudding user response:

# set Gene and Name as Index, as we don't need these renamed
df.set_index(['Gene','NAME'], inplace=True)

# create a dataframe from the columns
df2=pd.DataFrame(df.columns.values, columns=['col'])

# create new columns by counting repeated names and adding 1 to count
# assign columns to the dataframe
df.columns=df2['col']  '_'  (df2.groupby('col').cumcount() 1).astype(str)

# reset index
out=df.reset_index()
   Gene     NAME    KIRC_1  normal_1    normal_2    KIRC_2
0   ABC     DEF     GHI          JKL         MNO       PQR
1   STU     VWX     YZ           ABC        DEF        GHI

CodePudding user response:

Can't see your starting dataset, but this should do what you want - you don't look like you're assigning the columns back to the dataframe in your code, and you're not assigning the incrementer to dup if it is 0

data = {"Gene": "ABC", "NAME": "DEF", "KIRC": "GHI", "normal": "MNO"}

df = pd.DataFrame.from_records([data])
df = pd.concat([df, df[["KIRC", "normal"]]], axis=1)
cols = pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]:
    cols[df.columns.get_loc(dup)] = ([dup   '_'   str(d_idx 1)
                                     for d_idx in range(df.columns.get_loc(dup).sum())]
                                    )
df.columns = cols
print(df)
  • Related