Home > Software engineering >  Remove unwanted characters from Dataframe values in Pandas
Remove unwanted characters from Dataframe values in Pandas

Time:02-13

I have the following Dataframe full of locus/gen names from a multiple genome alignment.

However, I am trying to get only a full list of the locus/name without the coordinates.

    Tuberculosis_locus  Smagmatis_locus             H37RA_locus             Bovis_locus
0   0:Rv0001:1-1524     1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524    3:BQ2027_RS00005:1-1524
1   0:Rv0002:2052-3260  1:MSMEG_RS00005:499-1692    2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2   0:Rv0003:3280-4437  1:MSMEG_RS00015:2624-3778   2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437

To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.

for value in orthologs['Tuberculosis_locus']:
    orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
    orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d '))

Any idea on what I am doing wrong? I'd like the following output:

Tuberculosis_locus  Smagmatis_locus  H37RA_locus  Bovis_locus
    0   Rv0001  MSMEG_RS33460   MRA_RS00005 BQ2027_RS00005
    1   Rv0002  MSMEG_RS00005   MRA_RS00010 BQ2027_RS00010
    2   Rv0003  MSMEG_RS00015   MRA_RS00015 BQ2027_RS00015

CodePudding user response:

Split by : with a maximum split of two and then take the 2nd elements, eg:

df.applymap(lambda v: v.split(':', 2)[1])

CodePudding user response:

def clean(x):
    x = x.split(':')[1].strip()
    return x

orthologs = orthologs.applymap(clean)

should work.

  • Related