If df records is in English move it to another column using python-CodePudding

I had a list of Arabic and English elements, I transfer it into a dataframe BUT the issue is I have all values in One single column, I want to move the records that contains English words to another column: so what I have now:

COLUMN 1
هلا
السلام
WELCOMING
شي اخر

THE OUTPUT THAT I WANT IS:

COLUMN 1	COLUMN 2
هلا	welcoming
السلام	others eng. words

hope its clear..

CodePudding user response：

You could go through the dataframe and use regex to see if the word is within the alphabet

reg = re.compile(r'[a-zA-Z]')

if reg.match(word):
    # Matches English
else:
    # Doesn't match English

or use isAlpha:

if word.encode().isalpha():
    # Matches English
else:
    # Doesn't match English

Depending on that, you could create a new dataframe and populate the appropriate columns.

CodePudding user response：

You can check for each entry if the first character is part of ASCII. If so, move to new column.

Disclaimer: Only works if one language contains no ASCII at all and the second language only contains ASCII-Characters

CodePudding user response：

You can use the langdetect library along with the pandas library like (and it works for any language):

import pandas as pd
from langdetect import detect, DetectorFactory

# init seed
DetectorFactory.seed = 0

# read data
df = pd.read_csv('data.csv')

# filter data
df_ar = df.drop(df[(df['col_1'].apply(detect) != 'ar')].index).reset_index()
df_other_lang = df.drop(df[(df['col_1'].apply(detect) == 'ar')].index).reset_index()

# get the result
result = pd.concat([df_ar, df_other_lang], axis=1).drop('index', axis=1)

# testing .. 
print(result)

output :

-before :

       col_1
0        هلا
1      hello
2     السلام
3  WELCOMING
4      other

-after

    col_1      col_1
0     هلا      hello
1  السلام  WELCOMING
2     NaN      other

You can then rename the cols afterwards