Home > Mobile >  How to search for multiple substrings using text.find
How to search for multiple substrings using text.find

Time:10-13

I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.

I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.

The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:

def requires_bachelor(text):    
    if text.find('bachelor|ba/bs|bs/ba')>-1:
        return True
    else:
        return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})  

 

Thanks so much to anyone who is willing to help!

CodePudding user response:

So, you are checking for a string bachelor|ba/bs|bs/ba in the list, Which I don't believe will exist in any case...

What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:

def requires_bachelor(text):    
    if text.find('bachelor') or text.find('ba/bs') or text.find('bs/ba')>-1:
        return True
    else:
        return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})  

 

CodePudding user response:

The | in search string does not work like or operator. You should divide it into three calls like this:

if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:

CodePudding user response:

You could try doing:

bachelors = ["bachelor", "ba/bs", "bs/ba"]

if any(bachelor in text for bachelor in bachelors):
    return True

CodePudding user response:

Instead of writing a custom function that requires .apply (which will be quite slow), you can use str.contains for this. Also, you don't need map to turn booleans into 1 and 0; try using astype(int) instead.

df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor', 
                                        'job bs/ba', 'job ba']})

df_jobs['bachelor'] = df_jobs.description.str.contains(
    'bachelor|ba/bs|bs/ba', regex=True).astype(int)

print(df_jobs)

    description  bachelor
0     job ba/bs         1
1  job bachelor         1
2     job bs/ba         1
3        job ba         0 

# note that the pattern does not look for match on simply "ba"!

CodePudding user response:

Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0}), you can use map(bool) to make it a bit nicer. Good luck!

import pandas as pd

df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
    return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found

df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)

CodePudding user response:

It can all be done simply in one line in Pandas

df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')
  • Related