Home > Enterprise >  Remove part of the column name of a dataframe using a regular expression in Python
Remove part of the column name of a dataframe using a regular expression in Python

Time:07-27

I have a dataframe "counts" and I would like to change the name of the second column using a regular expression because I have multiple files with this "extra information", so I have:

| GeneID |  /home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam   |
| -------- | -------------- |
|  Ciclev10010164m.g.v1.0    | 2            |
|  Ciclev10007306m.g.v1.0    | 647            |
|  Ciclev10009318m.g.v1.0   | 39            |
|  Ciclev...   | ...           |
|  Ciclev10007306m.g.v1.0    | 112            |

I tried with the following code with no success:

for col in counts1:
  counts1.rename(columns={col:col.upper().replace("/home/rmachado/Biotec/ARJNA231684/mapa_fin_starterar/SRR1212121_mapped.bamAligned.sortedByCoord.out.bam","SRR[\d]{6}")},inplace=True)

How can I obtain a df with the following format?

| GeneID |  SRR1212121   |
| -------- | -------------- |
|  Ciclev10010164m.g.v1.0    | 2            |
|  Ciclev10007306m.g.v1.0    | 647            |
|  Ciclev10009318m.g.v1.0   | 39            |
|  Ciclev...   | ...           |
|  Ciclev10007306m.g.v1.0    | 112            |

CodePudding user response:

You could try:

df.columns = df.columns.str.extract(r'((?<=/)SRR\d |^[^/] $)', expand=False)

regex:

(?<=/)SRR\d   # match SDD   digits if preceded by "/"
^[^/] $       # else match full string if it doesn't contain "/"
  • Related