Home > Net >  Python Bucket Script - Help on editing
Python Bucket Script - Help on editing

Time:11-25

I need to edit the below script to perform a action, but as i'm new to Python im struggling.

The below script classifies words within a data set:

import pandas as pd

d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}

def classify(text):
    gender = 'None of any'
    for i in d:
        if any(j in text for j in d[i]):
            gender = i
    return gender

df = pd.DataFrame({'text':['this is a boy', 'a girl', 'two women', 'cat']})
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)

From my understanding, (please correct me if I'm wrong), the "pd.DataFrame" is "inline" with the script. I need the "pd.DataFrame" to pull from a CSV, running through the data and classifying, rather than it being "inline".

Does anyone know how to do this?

Thanks in advance

CodePudding user response:

Suppose your csv file is named data.csv.

And it looks like this:

enter image description here

You can simply replace the line that contains pd.DataFrame(...) as df = pd.read_csv('data.csv').

Here's the final result:

import pandas as pd

d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}

def classify(text):
    gender = 'None of any'
    for i in d:
        if any(j in text for j in d[i]):
            gender = i
    return gender

df = pd.read_csv('data.csv')  # I changed this line
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)

CodePudding user response:

First create DataFrame by read_csv:

df  = pd.read_csv(file)

Then use Series.str.contains:

for k, v in d.items():
    df.loc[df['text'].str.contains('|'.join(v)), 'cat'] = k
df['cat'] = df['cat'].fillna('None of any')

Word boundaries alternative:

for k, v in d.items():
    df.loc[df['text'].str.contains('|'.join(r"\b{}\b".format(x) for x in v)), 'cat'] = k

df['cat'] = df['cat'].fillna('None of any')

Or:

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
pat = '|'.join(r"\b{}\b".format(x) for x in d1)

df['cat'] = df['text'].str.extract(f'({pat})', expand=False).map(d1).fillna('None of any')


print(df)
            text          cat
0  this is a boy          men
1         a girl        women
2      two women        women
3            cat  None of any
  • Related