Home > Software design >  Trouble writing function to remove prefixes in Python
Trouble writing function to remove prefixes in Python

Time:01-03

I'm trying to write a function that will remove a prefix from every element of a column in a pandas dataframe. I've made a few attempts but none have seemed to work:

prefixes = ['mm10---', 'GRCh38-']
def clean_genes(column):
    for gene in CTRL_data[f'{column}']:
        for prefix in prefixes:
            if row[f"{column}"].str.startswith(f"{prefix}"):
                gene = str.replace(f"{prefix}", '', gene)
    return column

def clean_genes(column):
    for gene in CTRL_data[f"{column}"]:
        gene = gene[7:]
    return column

clean_genes(gene)

Could someone point out where these attempts have gone wrong, or how I could better write this function? The error in both cases is:

NameError                                 Traceback (most recent call last)
/var/folders/pg/d3z5dn_x0f51tlwtj7391tjh0000gn/T/ipykernel_10029/2341573264.py in <module>
     16     return column
     17 
---> 18 clean_genes(gene)

NameError: name 'gene' is not defined

EDIT: I've also looked at some other questions on this site and others, including this one which I thought was helpful (Remove specific characters from a string in Python).

CodePudding user response:

If your question actually is "how to remove a number of prefixes from a Pandas dataframe series", then I'd maybe say

  1. create a regular expression to match those prefixes
  2. use .str.replace on those series

This will likely be a lot faster than a manual loop too.

import re
prefixes = ['mm10---', 'GRCh38-']

# Build a regexp that matches either of the given prefixes, anchored
# to the start of the string.
prefix_re = re.compile("^("   "|".join(re.escape(prefix) for prefix in prefixes)   ")")

df["my_series"] = df["my_series"].str.replace(prefix_re, "")

CodePudding user response:

You can remove the prefix by building regular expression that matches either of your prefixes and then use the regex to replace them with empty string like this:

re = r'^(mm10\-\-\-|GRCh38\-)'
df["my_series"] = df["my_series"].str.replace(re, "") 
  • Related