Home > Enterprise >  Fill empty Pandas column based on condition on substring
Fill empty Pandas column based on condition on substring

Time:10-05

I have this dataset with the following data. I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles. For example, all the job titles that contains the word 'Analytics' will be categorize as Data. This label Data will appear on the Categories table.

Dataset 1

I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.

#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
   key, value = line.split(";")
   cat_type_dic[key] = value

print(cat_type_dic)

Then, I tried to create a loop based on a condition. Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value. This is what I tried:

for i in range(len(df)):
   if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
      df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))

Of course, it's not working. I think I am not accessing correctly to the key and value. Any clue?

This is the message error that I am getting:

TypeError Traceback (most recent call last) in 1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple):

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate the location

TypeError: Cannot index by location index with a non-integer key

Thanks a lot!

CodePudding user response:

Does the following code give you what you need?

import pandas as pd

df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']

cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}

list_keys = list(cat_type_dic.keys())

def label_extracter(x):
    list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
    category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
    return category_label

df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)

print(df)

          Job_Title         Categories
0  Business Analyst        CatB1 CatB2
1    Data Scientist  CatS1 CatS2 CatS3
2    Server Analyst                   
EDIT: Explaination added. @SofyPond
  • apply helps when loop necessary.
  • I defined a function which checks if Job_Title contains a key in the dictionary which is assigned earlier. I preferred convert keys to a list to make checking process easier.
  • (list_label renamed to category_label since it is not list anymore) category_label in function label_extracter gets values assigned to key in list format. It is converted to str by putting ' ' (white space) between values. In the case, length of list_matched_keys is greater than 0, it will contains list of string which are created by inner ' '.join. So outer ' '.join convert it to string format.
  • Related