How to replace values in a column in pandas for values from a dictionary?-CodePudding

I have a dataframe with two columns called df['job_title'], df['job_industry_category'], I have a dictonary_of_jobs where there's for every job_title appended a value that's a list of every job category where the job title appears, as follows:

dictonary_of_jobs = 
{'Tax Accountant' : ['Health', 'Financial Services', 'Property', 'IT'],
'Office Assistant': ['Financial Services', 'Property', 'Manufacturing']}


data = [['Tax Accountant', np.nan], ['Office Assistant', np.nan], ['Tax Accountant', np.nan]]
  
df = pd.DataFrame(data, columns=['job_title', 'job_category'])

I would like to choose randomly a job category from dictonary and replace that value in my dataframe, where df['job_industry_category'] all consists of Nan values, but nothing's changed. I tried the .replace method and .at method, Why is that?

for w,z in zip(df['job_title'], df['job_industry_category']):
     for x,y in dictonary_of_jobs.items():
            if w == x:
                #df.at[z,'job_category'] = random.choice(y)
                df['job_title'].replace(z, random.choice(y))

I expected to have a value from dictonary, but nothing happened.

CodePudding user response：

You can do

df['job_category'] = df['job_category'].astype(str)
for v, w, z in zip(df.index, df['job_title'], df['job_category']):
     for x, y in dictonary_of_jobs.items():
            if w == x:
                df.at[v,'job_category'] = random.choice(y)

print(df)

          job_title        job_category
0    Tax Accountant                  IT
1  Office Assistant            Property
2    Tax Accountant  Financial Services

df.at provides a label-based lookups. Therefore, we need row label. We can add df.index to iterate through the row index in zip.

Do note that .at[] does not do auto casting for the column type. Hence, we should use .astype(str) to convert the column to string beforehand. (refer to this answer).

CodePudding user response：

An alternative is apply() which applies a function to each row (by using axis=1) of the data frame:

def is_nan_job(row):
    if np.isnan(row['job_category']):
        return random.choice(dictonary_of_jobs[row['job_title']])
    else:
        return row['job_category']

df['job_category'] = df.apply(is_nan_job, axis=1)

This will go through each row and check whether the job_category is NaN. If so, it will choose a random job from dictionary_of_jobs based on the job_title in the row.

Output:

job_title job_category
0    Tax Accountant     Property
1  Office Assistant     Property
2    Tax Accountant       Health

CodePudding user response：

The replace is not working because np.nan does not equal np.nan.

In [1]: import numpy as np

In [2]: np.nan == np.nan
Out[2]: False

There are other issues. For instance replace does not mutate the dataframe column you are referencing. The results of the replace call are being dropped.

Rather than identifying fixes to your code I'm going to suggest an alternate approach. If I read your question correctly what you are aiming to do is map the values of the column job_title to some random appropriate industry.

Write the code so it matches the informal description, for instance...

In [1]: import pandas as pd
   ...: import numpy as np
   ...: import random

In [2]: job_titles = ['Tax Accountant',
   ...:               'Office Assistant',
   ...:               'Tax Accountant']
   ...: df = pd.DataFrame({'job_title': job_titles})
   ...: df
Out[2]:
          job_title
0    Tax Accountant
1  Office Assistant
2    Tax Accountant

In [3]: job_title_to_industries = {'Tax Accountant':
   ...:                               ['Health',
   ...:                                'Financial Services',
   ...:                                'Property',
   ...:                                'IT'],
   ...:                            'Office Assistant':
   ...:                                ['Financial Services',
   ...:                                 'Property',
   ...:                                 'Manufacturing']}

In [4]: def pick_an_industry(job_title):
   ...:     if job_title in job_title_to_industries:
   ...:         return random.choice(job_title_to_industries[job_title])
   ...:     else:
   ...:         return np.nan
   ...:

In [5]: df['job_industry'] = df['job_title'].map(pick_an_industry)
   ...: df
Out[5]:
          job_title        job_industry
0    Tax Accountant  Financial Services
1  Office Assistant            Property
2    Tax Accountant  Financial Services