I have several columns named the same in a data frame. How can I rename the below normal
and KIRC
to normal_1
, normal_2
, KIRC_1
, KIRC_2
?
import pandas as pd
gene_exp.columns = gene_exp.iloc[-1]
gene_exp = gene_exp.iloc[:-1]
gene_exp
# Append "_[number]"
c = pd.Series(gene_exp.columns)
for dup in gene_exp.columns[gene_exp.columns.duplicated(keep=False)]:
c[df.columns.get_loc(dup)] = ([dup '_' str(d_idx)
if d_idx != 0
else dup
for d_idx in range(gene_exp.columns.get_loc(dup).sum())]
)
gene_exp
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
/opt/conda/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/opt/conda/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'KIRC'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_27/3403075751.py in <module>
5 if d_idx != 0
6 else dup
----> 7 for d_idx in range(gene_exp.columns.get_loc(dup).sum())]
8 )
9 gene_exp
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'KIRC'
Sample data
Gene | NAME | KIRC | normal | normal | KIRC | |
---|---|---|---|---|---|---|
0 | ABC | DEF | GHI | JKL | MNO | PQR |
1 | STU | VWX | YZ | ABC | DEF | GHI |
Desired output:
Gene | NAME | KIRC_1 | normal_1 | normal_2 | KIRC_2 | |
---|---|---|---|---|---|---|
0 | ABC | DEF | GHI | JKL | MNO | PQR |
1 | STU | VWX | YZ | ABC | DEF | GHI |
CodePudding user response:
# set Gene and Name as Index, as we don't need these renamed
df.set_index(['Gene','NAME'], inplace=True)
# create a dataframe from the columns
df2=pd.DataFrame(df.columns.values, columns=['col'])
# create new columns by counting repeated names and adding 1 to count
# assign columns to the dataframe
df.columns=df2['col'] '_' (df2.groupby('col').cumcount() 1).astype(str)
# reset index
out=df.reset_index()
Gene NAME KIRC_1 normal_1 normal_2 KIRC_2
0 ABC DEF GHI JKL MNO PQR
1 STU VWX YZ ABC DEF GHI
CodePudding user response:
Can't see your starting dataset, but this should do what you want - you don't look like you're assigning the columns back to the dataframe in your code, and you're not assigning the incrementer to dup if it is 0
data = {"Gene": "ABC", "NAME": "DEF", "KIRC": "GHI", "normal": "MNO"}
df = pd.DataFrame.from_records([data])
df = pd.concat([df, df[["KIRC", "normal"]]], axis=1)
cols = pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]:
cols[df.columns.get_loc(dup)] = ([dup '_' str(d_idx 1)
for d_idx in range(df.columns.get_loc(dup).sum())]
)
df.columns = cols
print(df)