I must modularize the name_2_sex function that receives a dataframe, for this I call it from a file called test.py but it gives me this error. The function receives a dataframe with data on people and returns the dataframe with 2 extra columns, one with the patient's first name and the other with their gender.
NameError: free variable 'gender_list' referenced before assignment in enclosing scope
The algorithm worked without modularizing.
name_2_sex code:
import pandas as pd
import operator
import re
def name_2_sex(df):
def clean_text(txt):
txt = re.sub("[^a-záéíóúñüäë]", " ", txt.lower())
txt = re.sub(' ',' ', txt)
return txt.strip().split()
def df_to_dict(df, key_column, val_column):
"""convierte dos pandas series en un diccionario"""
xkey = df[key_column].tolist()
xval = df[val_column].tolist()
return dict(zip(xkey,xval))
def get_gender2(names):
names = clean_text(names)
names = [x for x in names if gender_list.get(x,'a') != 'a']
gender ={'m':0, 'f':0, 'a':0}
for i, name in enumerate(names):
g = gender_list.get(name,'a')
gender[g] = 1
gender[g] = 2 if len(names) > 1 and i == 0 and g != 'a' else 0
gender['a'] = 0 if (gender['f'] gender['m']) > 0 else 1
return max(gender.items(), key=operator.itemgetter(1))[0]
if __name__ == '__main__':
path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
gender_list = pd.read_csv(path)
gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')
df_nombre_completo= df["patient_full_name"]
pacientes_primer_nombre = []
for name in df_nombre_completo:
if (isinstance(name, str)):
pacientes_primer_nombre.append(name.split(" ")[0])
for name in df["patient_full_name"]:
if (isinstance(name, str)):
df["first_name"] = name.split(" ")[0]
else:
df["first_name"] = 0
df["first_name"] = [str(name).split(" ")[0] for name in df["patient_full_name"]]
df["gender"] = df["first_name"]
df["gender"] = [get_gender2(name) for name in df["first_name"]]
return df
code of the file where I want to execute it (test.py):
from nombre_a_sexo import name_2_sex
import pandas as pd
df = pd.read_csv("nuevo_dataset.csv", index_col=0)
print(name_2_sex(df))
Both files are in the same folder. I did not do the algorithm that filters by gender, so I would not know what to edit if the problem comes from there.
CodePudding user response:
You only assign gender_list
in this block:
if __name__ == '__main__':
path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
gender_list = pd.read_csv(path)
gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')
But this condition will only be true if you execute nombre_a_sexo.py
as a top-level script, not when you import from it.
So you never assign gender_list
before the rest of the code tries to use it.
When the function is called from another file, I think you want to use the df
parameter instead of reading from this file. So change it to:
if __name__ == '__main__':
path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
gender_list = pd.read_csv(path)
gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')
else:
gender_list = df_to_dict(df, key_column='nombre', val_column='genero')