I have this dataset with the following data. I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles. For example, all the job titles that contains the word 'Analytics' will be categorize as Data. This label Data will appear on the Categories table.
I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.
#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
key, value = line.split(";")
cat_type_dic[key] = value
print(cat_type_dic)
Then, I tried to create a loop based on a condition. Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value. This is what I tried:
for i in range(len(df)):
if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
Of course, it's not working. I think I am not accessing correctly to the key and value. Any clue?
This is the message error that I am getting:
TypeError Traceback (most recent call last) in 1 for i in range(len(df)): ----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]): 3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in getitem(self, key) 929 930 maybe_callable = com.apply_if_callable(key, self.obj) --> 931 return self._getitem_axis(maybe_callable, axis=axis) 932 933 def _is_scalar_access(self, key: tuple):
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key): -> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate the location
TypeError: Cannot index by location index with a non-integer key
Thanks a lot!
CodePudding user response:
Does the following code give you what you need?
import pandas as pd
df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']
cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}
list_keys = list(cat_type_dic.keys())
def label_extracter(x):
list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
return category_label
df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)
print(df)
Job_Title Categories
0 Business Analyst CatB1 CatB2
1 Data Scientist CatS1 CatS2 CatS3
2 Server Analyst
EDIT: Explaination added. @SofyPond
apply
helps when loop necessary.- I defined a function which checks if
Job_Title
contains a key in the dictionary which is assigned earlier. I preferred convert keys to a list to make checking process easier. - (list_label renamed to category_label since it is not list anymore) category_label in function
label_extracter
gets values assigned to key in list format. It is converted to str by putting ' ' (white space) between values. In the case, length of list_matched_keys is greater than 0, it will contains list of string which are created by inner' '.join
. So outer' '.join
convert it to string format.