I had a list of Arabic and English elements, I transfer it into a dataframe BUT the issue is I have all values in One single column, I want to move the records that contains English words to another column: so what I have now:
COLUMN 1 |
---|
هلا |
السلام |
WELCOMING |
شي اخر |
THE OUTPUT THAT I WANT IS:
COLUMN 1 | COLUMN 2 |
---|---|
هلا | welcoming |
السلام | others eng. words |
hope its clear..
CodePudding user response:
You could go through the dataframe and use regex to see if the word is within the alphabet
reg = re.compile(r'[a-zA-Z]')
if reg.match(word):
# Matches English
else:
# Doesn't match English
or use isAlpha:
if word.encode().isalpha():
# Matches English
else:
# Doesn't match English
Depending on that, you could create a new dataframe and populate the appropriate columns.
CodePudding user response:
You can check for each entry if the first character is part of ASCII. If so, move to new column.
Disclaimer: Only works if one language contains no ASCII at all and the second language only contains ASCII-Characters
CodePudding user response:
You can use the langdetect library along with the pandas library like (and it works for any language):
import pandas as pd
from langdetect import detect, DetectorFactory
# init seed
DetectorFactory.seed = 0
# read data
df = pd.read_csv('data.csv')
# filter data
df_ar = df.drop(df[(df['col_1'].apply(detect) != 'ar')].index).reset_index()
df_other_lang = df.drop(df[(df['col_1'].apply(detect) == 'ar')].index).reset_index()
# get the result
result = pd.concat([df_ar, df_other_lang], axis=1).drop('index', axis=1)
# testing ..
print(result)
output :
-before :
col_1
0 هلا
1 hello
2 السلام
3 WELCOMING
4 other
-after
col_1 col_1
0 هلا hello
1 السلام WELCOMING
2 NaN other
You can then rename the cols afterwards