I'm trying to remove all the unnecessary words and characters from the values in this column. I want the rows to contain 'Entry level', 'Mid-Senior level' etc. Also is there anyway to translate the arabic to english or shall I use replace function?
df_africa.seniority_level.value_counts()
{'Seniority level': 'Entry level'} 1073
{'Seniority level': 'Mid-Senior level'} 695
{'Seniority level': 'Associate'} 481
{'Seniority level': 'Not Applicable'} 150
{'مستوى الأقدمية': 'مستوى متوسط الأقدمية'} 115
{'مستوى الأقدمية': 'مستوى المبتدئين'} 82
{'نوع التوظيف': 'دوام كامل'} 73
{'مستوى الأقدمية': 'مساعد'} 48
{'مستوى الأقدمية': 'غير مطبق'} 42
{'Seniority level': 'Internship'} 39
{'Employment type': 'Contract'} 21
{'Employment type': 'Full-time'} 1
I've tried the split function but i couldn't get it to work properly.
CodePudding user response:
IIUC, use this :
import ast
#Is there any non-latin letters?
m = ~df_africa["seniority_level"].str.contains("[A-Z]")
s = df_africa["seniority_level"].apply(lambda x: ast.literal_eval(x))
df_africa["new_col"] = s.str["مستوى الأقدمية"].where(m, s.str["Seniority level"])
If you need to translate the words extracted, use
CodePudding user response:
Would be useful to know the type of the 'seniority_level' column, but I'm just gonna assume the column is made up of literal strings (e.g. "{'Seniority level': 'Entry level'}")
Can translate all the text with this googletrans package, it piggybacks off google translate so use it while it lasts. Make sure to install version 4.0.0rc1.
$ pip install googletrans==4.0.0rc1
translate:
from googletrans import Translator
translator = Translator()
def translate_to_english(words):
for character in words:
if ord(character) > 127:
return translator.translate(words, dest="en").text
return words
df_africa["new_seniority_level"] = df_africa["seniority_level"].map(lambda row: translate_to_english(row))
print(df_africa)
seniority_level new_seniority_level
0 {'Seniority level': 'Entry level'} {'Seniority level': 'Entry level'}
1 {'Seniority level': 'Mid-Senior level'} {'Seniority level': 'Mid-Senior level'}
2 {'Seniority level': 'Associate'} {'Seniority level': 'Associate'}
3 {'Seniority level': 'Not Applicable'} {'Seniority level': 'Not Applicable'}
4 {'مستوى الأقدمية': 'مستوى متوسط الأقدمية'} {'Seniority level': 'average level of seniority'}
5 {'مستوى الأقدمية': 'مستوى المبتدئين'} {'Seniority level': 'beginners' level'}
6 {'نوع التوظيف': 'دوام كامل'} {'Recruitment type': 'full time'}
7 {'مستوى الأقدمية': 'مساعد'} {'Seniority level': 'assistant'}
8 {'مستوى الأقدمية': 'غير مطبق'} {'Senior level': 'unprecedented'}
9 {'Seniority level': 'Internship'} {'Seniority level': 'Internship'}
10 {'Employment type': 'Contract'} {'Employment type': 'Contract'}
11 {'Employment type': 'Full-time'} {'Employment type': 'Full-time'}
then get the text you want:
import re
df_africa["new_seniority_level"] = df_africa["new_seniority_level"].map(lambda row: re.match(r". : '(.*)'", row).group(1))
print(df_africa)
seniority_level new_seniority_level
0 {'Seniority level': 'Entry level'} Entry level
1 {'Seniority level': 'Mid-Senior level'} Mid-Senior level
2 {'Seniority level': 'Associate'} Associate
3 {'Seniority level': 'Not Applicable'} Not Applicable
4 {'مستوى الأقدمية': 'مستوى متوسط الأقدمية'} average level of seniority
5 {'مستوى الأقدمية': 'مستوى المبتدئين'} beginners' level
6 {'نوع التوظيف': 'دوام كامل'} full time
7 {'مستوى الأقدمية': 'مساعد'} assistant
8 {'مستوى الأقدمية': 'غير مطبق'} unprecedented
9 {'Seniority level': 'Internship'} Internship
10 {'Employment type': 'Contract'} Contract
11 {'Employment type': 'Full-time'} Full-time
Look into official google translate api if googletrans eventually breaks.