How to explicitly find string using str.contains() in a loop?-CodePudding

I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:

My file structure:

miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94

when I run my code search

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]

I am expecting to only get only the second raw:

miR-17-5p/31-5p,Gnp,9606,0.92

but I de get both first and second raw - 331-5p come in the result too which should not:

miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92

Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.

Thank you.

CodePudding user response：

Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:

DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b('   '|'.join(DE_miRNAs)   r')\b'

targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]

CodePudding user response：

contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.

In your case, I suggest you use /31-5p instead of 31-5p:

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains("/"   miRNA)]