I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.
I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.
The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:
def requires_bachelor(text):
if text.find('bachelor|ba/bs|bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
Thanks so much to anyone who is willing to help!
CodePudding user response:
So, you are checking for a string bachelor|ba/bs|bs/ba
in the list, Which I don't believe will exist in any case...
What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:
def requires_bachelor(text):
if text.find('bachelor') or text.find('ba/bs') or text.find('bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
CodePudding user response:
The |
in search string does not work like or
operator. You should divide it into three calls like this:
if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:
CodePudding user response:
You could try doing:
bachelors = ["bachelor", "ba/bs", "bs/ba"]
if any(bachelor in text for bachelor in bachelors):
return True
CodePudding user response:
Instead of writing a custom function that requires .apply
(which will be quite slow), you can use str.contains
for this. Also, you don't need map
to turn booleans into 1
and 0
; try using astype(int)
instead.
df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor',
'job bs/ba', 'job ba']})
df_jobs['bachelor'] = df_jobs.description.str.contains(
'bachelor|ba/bs|bs/ba', regex=True).astype(int)
print(df_jobs)
description bachelor
0 job ba/bs 1
1 job bachelor 1
2 job bs/ba 1
3 job ba 0
# note that the pattern does not look for match on simply "ba"!
CodePudding user response:
Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0})
, you can use map(bool)
to make it a bit nicer. Good luck!
import pandas as pd
df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)
CodePudding user response:
It can all be done simply in one line in Pandas
df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')