I need to edit the below script to perform a action, but as i'm new to Python im struggling.
The below script classifies words within a data set:
import pandas as pd
d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}
def classify(text):
gender = 'None of any'
for i in d:
if any(j in text for j in d[i]):
gender = i
return gender
df = pd.DataFrame({'text':['this is a boy', 'a girl', 'two women', 'cat']})
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)
From my understanding, (please correct me if I'm wrong), the "pd.DataFrame" is "inline" with the script. I need the "pd.DataFrame" to pull from a CSV, running through the data and classifying, rather than it being "inline".
Does anyone know how to do this?
Thanks in advance
CodePudding user response:
Suppose your csv file is named data.csv
.
And it looks like this:
You can simply replace the line that contains pd.DataFrame(...)
as df = pd.read_csv('data.csv')
.
Here's the final result:
import pandas as pd
d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}
def classify(text):
gender = 'None of any'
for i in d:
if any(j in text for j in d[i]):
gender = i
return gender
df = pd.read_csv('data.csv') # I changed this line
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)
CodePudding user response:
First create DataFrame
by read_csv
:
df = pd.read_csv(file)
Then use Series.str.contains
:
for k, v in d.items():
df.loc[df['text'].str.contains('|'.join(v)), 'cat'] = k
df['cat'] = df['cat'].fillna('None of any')
Word boundaries alternative:
for k, v in d.items():
df.loc[df['text'].str.contains('|'.join(r"\b{}\b".format(x) for x in v)), 'cat'] = k
df['cat'] = df['cat'].fillna('None of any')
Or:
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
pat = '|'.join(r"\b{}\b".format(x) for x in d1)
df['cat'] = df['text'].str.extract(f'({pat})', expand=False).map(d1).fillna('None of any')
print(df)
text cat
0 this is a boy men
1 a girl women
2 two women women
3 cat None of any