Python Bucket Script - Help on editing-CodePudding

I need to edit the below script to perform a action, but as i'm new to Python im struggling.

The below script classifies words within a data set:

import pandas as pd

d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}

def classify(text):
    gender = 'None of any'
    for i in d:
        if any(j in text for j in d[i]):
            gender = i
    return gender

df = pd.DataFrame({'text':['this is a boy', 'a girl', 'two women', 'cat']})
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)

From my understanding, (please correct me if I'm wrong), the "pd.DataFrame" is "inline" with the script. I need the "pd.DataFrame" to pull from a CSV, running through the data and classifying, rather than it being "inline".

Does anyone know how to do this?

Thanks in advance

CodePudding user response：

Suppose your csv file is named data.csv.

And it looks like this:

You can simply replace the line that contains pd.DataFrame(...) as df = pd.read_csv('data.csv').

Here's the final result:

import pandas as pd

d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}

def classify(text):
    gender = 'None of any'
    for i in d:
        if any(j in text for j in d[i]):
            gender = i
    return gender

df = pd.read_csv('data.csv')  # I changed this line
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)

CodePudding user response：

First create DataFrame by read_csv:

df  = pd.read_csv(file)

Then use Series.str.contains:

for k, v in d.items():
    df.loc[df['text'].str.contains('|'.join(v)), 'cat'] = k
df['cat'] = df['cat'].fillna('None of any')

Word boundaries alternative:

for k, v in d.items():
    df.loc[df['text'].str.contains('|'.join(r"\b{}\b".format(x) for x in v)), 'cat'] = k

df['cat'] = df['cat'].fillna('None of any')

Or:

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
pat = '|'.join(r"\b{}\b".format(x) for x in d1)

df['cat'] = df['text'].str.extract(f'({pat})', expand=False).map(d1).fillna('None of any')


print(df)
            text          cat
0  this is a boy          men
1         a girl        women
2      two women        women
3            cat  None of any