I have a dataframe with two columns called df['job_title'], df['job_industry_category'], I have a dictonary_of_jobs where there's for every job_title appended a value that's a list of every job category where the job title appears, as follows:
dictonary_of_jobs =
{'Tax Accountant' : ['Health', 'Financial Services', 'Property', 'IT'],
'Office Assistant': ['Financial Services', 'Property', 'Manufacturing']}
data = [['Tax Accountant', np.nan], ['Office Assistant', np.nan], ['Tax Accountant', np.nan]]
df = pd.DataFrame(data, columns=['job_title', 'job_category'])
I would like to choose randomly a job category from dictonary and replace that value in my dataframe, where df['job_industry_category'] all consists of Nan values, but nothing's changed. I tried the .replace method and .at method, Why is that?
for w,z in zip(df['job_title'], df['job_industry_category']):
for x,y in dictonary_of_jobs.items():
if w == x:
#df.at[z,'job_category'] = random.choice(y)
df['job_title'].replace(z, random.choice(y))
I expected to have a value from dictonary, but nothing happened.
CodePudding user response:
You can do
df['job_category'] = df['job_category'].astype(str)
for v, w, z in zip(df.index, df['job_title'], df['job_category']):
for x, y in dictonary_of_jobs.items():
if w == x:
df.at[v,'job_category'] = random.choice(y)
print(df)
job_title job_category
0 Tax Accountant IT
1 Office Assistant Property
2 Tax Accountant Financial Services
df.at
provides a label-based lookups. Therefore, we need row label. We can add df.index
to iterate through the row index in zip
.
Do note that .at[]
does not do auto casting for the column type. Hence, we should use .astype(str)
to convert the column to string beforehand. (refer to this answer).
CodePudding user response:
An alternative is apply() which applies a function to each row (by using axis=1) of the data frame:
def is_nan_job(row):
if np.isnan(row['job_category']):
return random.choice(dictonary_of_jobs[row['job_title']])
else:
return row['job_category']
df['job_category'] = df.apply(is_nan_job, axis=1)
This will go through each row and check whether the job_category is NaN. If so, it will choose a random job from dictionary_of_jobs based on the job_title in the row.
Output:
job_title job_category
0 Tax Accountant Property
1 Office Assistant Property
2 Tax Accountant Health
CodePudding user response:
The replace is not working because np.nan does not equal np.nan.
In [1]: import numpy as np
In [2]: np.nan == np.nan
Out[2]: False
There are other issues. For instance replace
does not mutate the dataframe column you are referencing. The results of the replace call are being dropped.
Rather than identifying fixes to your code I'm going to suggest an alternate approach. If I read your question correctly what you are aiming to do is map
the values of the column job_title
to some random appropriate industry.
Write the code so it matches the informal description, for instance...
In [1]: import pandas as pd
...: import numpy as np
...: import random
In [2]: job_titles = ['Tax Accountant',
...: 'Office Assistant',
...: 'Tax Accountant']
...: df = pd.DataFrame({'job_title': job_titles})
...: df
Out[2]:
job_title
0 Tax Accountant
1 Office Assistant
2 Tax Accountant
In [3]: job_title_to_industries = {'Tax Accountant':
...: ['Health',
...: 'Financial Services',
...: 'Property',
...: 'IT'],
...: 'Office Assistant':
...: ['Financial Services',
...: 'Property',
...: 'Manufacturing']}
In [4]: def pick_an_industry(job_title):
...: if job_title in job_title_to_industries:
...: return random.choice(job_title_to_industries[job_title])
...: else:
...: return np.nan
...:
In [5]: df['job_industry'] = df['job_title'].map(pick_an_industry)
...: df
Out[5]:
job_title job_industry
0 Tax Accountant Financial Services
1 Office Assistant Property
2 Tax Accountant Financial Services