Home > Software design >  I'm trying to split and remove unnecessary characters from a column using pandas
I'm trying to split and remove unnecessary characters from a column using pandas

Time:12-05

I'm trying to remove all the unnecessary words and characters from the values in this column. I want the rows to contain 'Entry level', 'Mid-Senior level' etc. Also is there anyway to translate the arabic to english or shall I use replace function?

df_africa.seniority_level.value_counts()

{'Seniority level': 'Entry level'}            1073
{'Seniority level': 'Mid-Senior level'}        695
{'Seniority level': 'Associate'}               481
{'Seniority level': 'Not Applicable'}          150
{'مستوى الأقدمية': 'مستوى متوسط الأقدمية'}     115
{'مستوى الأقدمية': 'مستوى المبتدئين'}           82
{'نوع التوظيف': 'دوام كامل'}                    73
{'مستوى الأقدمية': 'مساعد'}                     48
{'مستوى الأقدمية': 'غير مطبق'}                  42
{'Seniority level': 'Internship'}               39
{'Employment type': 'Contract'}                 21
{'Employment type': 'Full-time'}                 1



I've tried the split function but i couldn't get it to work properly.

CodePudding user response:

IIUC, use this :

import ast

#Is there any non-latin letters?
m = ~df_africa["seniority_level"].str.contains("[A-Z]")
​
s = df_africa["seniority_level"].apply(lambda x: ast.literal_eval(x))
df_africa["new_col"] = s.str["مستوى الأقدمية"].where(m, s.str["Seniority level"])

If you need to translate the words extracted, use enter image description here

CodePudding user response:

Would be useful to know the type of the 'seniority_level' column, but I'm just gonna assume the column is made up of literal strings (e.g. "{'Seniority level': 'Entry level'}")

Can translate all the text with this googletrans package, it piggybacks off google translate so use it while it lasts. Make sure to install version 4.0.0rc1.

$ pip install googletrans==4.0.0rc1

translate:

from googletrans import Translator

translator = Translator()

def translate_to_english(words):
    for character in words:
        if ord(character) > 127:
            return translator.translate(words, dest="en").text
    return words

df_africa["new_seniority_level"] = df_africa["seniority_level"].map(lambda row: translate_to_english(row))
print(df_africa)

                               seniority_level                                new_seniority_level
0           {'Seniority level': 'Entry level'}                 {'Seniority level': 'Entry level'}
1      {'Seniority level': 'Mid-Senior level'}            {'Seniority level': 'Mid-Senior level'}
2             {'Seniority level': 'Associate'}                   {'Seniority level': 'Associate'}
3        {'Seniority level': 'Not Applicable'}              {'Seniority level': 'Not Applicable'}
4                {'مستوى الأقدمية': 'مستوى متوسط الأقدمية'}  {'Seniority level': 'average level of seniority'}
5                    {'مستوى الأقدمية': 'مستوى المبتدئين'}            {'Seniority level': 'beginners' level'}
6                          {'نوع التوظيف': 'دوام كامل'}                  {'Recruitment type': 'full time'}
7                          {'مستوى الأقدمية': 'مساعد'}                   {'Seniority level': 'assistant'}
8                        {'مستوى الأقدمية': 'غير مطبق'}                  {'Senior level': 'unprecedented'}
9            {'Seniority level': 'Internship'}                  {'Seniority level': 'Internship'}
10             {'Employment type': 'Contract'}                    {'Employment type': 'Contract'}
11            {'Employment type': 'Full-time'}                   {'Employment type': 'Full-time'}

then get the text you want:

import re

df_africa["new_seniority_level"] = df_africa["new_seniority_level"].map(lambda row: re.match(r". : '(.*)'", row).group(1))
print(df_africa)
                               seniority_level         new_seniority_level
0           {'Seniority level': 'Entry level'}                 Entry level
1      {'Seniority level': 'Mid-Senior level'}            Mid-Senior level
2             {'Seniority level': 'Associate'}                   Associate
3        {'Seniority level': 'Not Applicable'}              Not Applicable
4                {'مستوى الأقدمية': 'مستوى متوسط الأقدمية'}  average level of seniority
5                    {'مستوى الأقدمية': 'مستوى المبتدئين'}            beginners' level
6                         {'نوع التوظيف': 'دوام كامل'}                   full time
7                         {'مستوى الأقدمية': 'مساعد'}                   assistant
8                       {'مستوى الأقدمية': 'غير مطبق'}               unprecedented
9            {'Seniority level': 'Internship'}                  Internship
10             {'Employment type': 'Contract'}                    Contract
11            {'Employment type': 'Full-time'}                   Full-time

Look into official google translate api if googletrans eventually breaks.

  • Related