I'm trying to write a function that will remove a prefix from every element of a column in a pandas dataframe. I've made a few attempts but none have seemed to work:
prefixes = ['mm10---', 'GRCh38-']
def clean_genes(column):
for gene in CTRL_data[f'{column}']:
for prefix in prefixes:
if row[f"{column}"].str.startswith(f"{prefix}"):
gene = str.replace(f"{prefix}", '', gene)
return column
def clean_genes(column):
for gene in CTRL_data[f"{column}"]:
gene = gene[7:]
return column
clean_genes(gene)
Could someone point out where these attempts have gone wrong, or how I could better write this function? The error in both cases is:
NameError Traceback (most recent call last)
/var/folders/pg/d3z5dn_x0f51tlwtj7391tjh0000gn/T/ipykernel_10029/2341573264.py in <module>
16 return column
17
---> 18 clean_genes(gene)
NameError: name 'gene' is not defined
EDIT: I've also looked at some other questions on this site and others, including this one which I thought was helpful (Remove specific characters from a string in Python).
CodePudding user response:
If your question actually is "how to remove a number of prefixes from a Pandas dataframe series", then I'd maybe say
- create a regular expression to match those prefixes
- use
.str.replace
on those series
This will likely be a lot faster than a manual loop too.
import re
prefixes = ['mm10---', 'GRCh38-']
# Build a regexp that matches either of the given prefixes, anchored
# to the start of the string.
prefix_re = re.compile("^(" "|".join(re.escape(prefix) for prefix in prefixes) ")")
df["my_series"] = df["my_series"].str.replace(prefix_re, "")
CodePudding user response:
You can remove the prefix by building regular expression that matches either of your prefixes and then use the regex to replace them with empty string like this:
re = r'^(mm10\-\-\-|GRCh38\-)'
df["my_series"] = df["my_series"].str.replace(re, "")