Trouble writing function to remove prefixes in Python-CodePudding

I'm trying to write a function that will remove a prefix from every element of a column in a pandas dataframe. I've made a few attempts but none have seemed to work:

prefixes = ['mm10---', 'GRCh38-']
def clean_genes(column):
    for gene in CTRL_data[f'{column}']:
        for prefix in prefixes:
            if row[f"{column}"].str.startswith(f"{prefix}"):
                gene = str.replace(f"{prefix}", '', gene)
    return column

def clean_genes(column):
    for gene in CTRL_data[f"{column}"]:
        gene = gene[7:]
    return column

clean_genes(gene)

Could someone point out where these attempts have gone wrong, or how I could better write this function? The error in both cases is:

NameError                                 Traceback (most recent call last)
/var/folders/pg/d3z5dn_x0f51tlwtj7391tjh0000gn/T/ipykernel_10029/2341573264.py in <module>
     16     return column
     17 
---> 18 clean_genes(gene)

NameError: name 'gene' is not defined

EDIT: I've also looked at some other questions on this site and others, including this one which I thought was helpful (Remove specific characters from a string in Python).

CodePudding user response：

If your question actually is "how to remove a number of prefixes from a Pandas dataframe series", then I'd maybe say

create a regular expression to match those prefixes
use .str.replace on those series

This will likely be a lot faster than a manual loop too.

import re
prefixes = ['mm10---', 'GRCh38-']

# Build a regexp that matches either of the given prefixes, anchored
# to the start of the string.
prefix_re = re.compile("^("   "|".join(re.escape(prefix) for prefix in prefixes)   ")")

df["my_series"] = df["my_series"].str.replace(prefix_re, "")

CodePudding user response：

You can remove the prefix by building regular expression that matches either of your prefixes and then use the regex to replace them with empty string like this:

re = r'^(mm10\-\-\-|GRCh38\-)'
df["my_series"] = df["my_series"].str.replace(re, "")