I am searching particular strings in first column using str.contain()
in a big file. There are some cases are reported even if they partially match with the provided string. For example:
My file structure:
miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94
when I run my code search
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]
I am expecting to only get only the second raw:
miR-17-5p/31-5p,Gnp,9606,0.92
but I de get both first and second raw - 331-5p come in the result too which should not:
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
Is there a way to make the str.contains()
more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b")
does not work.
Thank you.
CodePudding user response:
Use str.contains
with a regex alternation which is surrounded by word boundaries on both sides:
DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' '|'.join(DE_miRNAs) r')\b'
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]
CodePudding user response:
contains
is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.
In your case, I suggest you use /31-5p
instead of 31-5p
:
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains("/" miRNA)]