NameError: free variable 'gender_list' referenced before assignment in enclosing scope-CodePudding

I must modularize the name_2_sex function that receives a dataframe, for this I call it from a file called test.py but it gives me this error. The function receives a dataframe with data on people and returns the dataframe with 2 extra columns, one with the patient's first name and the other with their gender.

NameError: free variable 'gender_list' referenced before assignment in enclosing scope

The algorithm worked without modularizing.

name_2_sex code:

import pandas as pd
import operator
import re


def name_2_sex(df):

  def clean_text(txt):
    txt = re.sub("[^a-záéíóúñüäë]", " ", txt.lower())
    txt = re.sub('  ',' ', txt)
    return txt.strip().split()

  def df_to_dict(df, key_column, val_column):
    """convierte dos pandas series en un diccionario"""
    xkey = df[key_column].tolist()
    xval = df[val_column].tolist()
    return dict(zip(xkey,xval))

  def get_gender2(names):
    names = clean_text(names)
    names = [x for x in names if gender_list.get(x,'a') != 'a']
    gender ={'m':0, 'f':0, 'a':0}
    for i, name in enumerate(names):
      g = gender_list.get(name,'a')
      gender[g]  = 1
      gender[g]  = 2 if len(names) > 1 and i == 0 and g != 'a' else 0 
      gender['a'] = 0 if (gender['f'] gender['m']) > 0 else 1
    return max(gender.items(), key=operator.itemgetter(1))[0]

  if __name__ == '__main__':
    path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
    gender_list = pd.read_csv(path)
    gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')


  df_nombre_completo= df["patient_full_name"]
  pacientes_primer_nombre = []


  for name in df_nombre_completo:
    if (isinstance(name, str)):
      pacientes_primer_nombre.append(name.split(" ")[0])


  for name in df["patient_full_name"]:
    if (isinstance(name, str)):
      df["first_name"] =  name.split(" ")[0]
    else:
      df["first_name"] = 0

  df["first_name"] = [str(name).split(" ")[0] for name in df["patient_full_name"]]
  df["gender"] = df["first_name"]


  df["gender"] = [get_gender2(name) for name in df["first_name"]]

  return df

code of the file where I want to execute it (test.py):

from nombre_a_sexo import name_2_sex
import pandas as pd


df = pd.read_csv("nuevo_dataset.csv", index_col=0)

print(name_2_sex(df))

Both files are in the same folder. I did not do the algorithm that filters by gender, so I would not know what to edit if the problem comes from there.

CodePudding user response：

You only assign gender_list in this block:

if __name__ == '__main__':
    path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
    gender_list = pd.read_csv(path)
    gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')

But this condition will only be true if you execute nombre_a_sexo.py as a top-level script, not when you import from it.

So you never assign gender_list before the rest of the code tries to use it.

When the function is called from another file, I think you want to use the df parameter instead of reading from this file. So change it to:

  if __name__ == '__main__':
    path = 'https://www.dropbox.com/s/edm5383iffurv4x/nombres.csv?dl=1'
    gender_list = pd.read_csv(path)
    gender_list = df_to_dict(gender_list, key_column='nombre', val_column='genero')
  else:
    gender_list = df_to_dict(df, key_column='nombre', val_column='genero')