Home > front end >  How to remove English-like but non-English words in Python?
How to remove English-like but non-English words in Python?

Time:05-15

I have a *.csv file that has 2 columns with 4 rows of data. I want to delete those rows that contains English-like (Hinglish words eg. kya haal hai) but non-English words. Example given in image

enter image description here

Thinking about above problem, I want to solve this below list first.

a = [ "This is not good so mai yah row hatana chahta hun.", "Nice!, kya haal pyare friend"]

Output should be:

This is not good so row.
Nice! friend

Note - This Data for copy paste purpose only

This is not good so mai yah row hatana chahta hun.  ok
Nice!, kya haal pyare friend thik hu
Please help Me  Definitely
Google is a comPaNY yes it is

CodePudding user response:

You will need an English library here. The enchant Python library is one option.

import enchant

d = enchant.Dict("en_US")

def all_english(s):
    words = s.split()
    return len(words) == sum([d.check(re.sub(r'[!@#$?:;,.] ', '', x.lower())) for x in words])

df = df[df["A"].map(lambda x: all_english(x))]

CodePudding user response:

I got the correct output. Thanks to Tim Biegeleisen and tripleee

import pandas as pd
import io

df = pd.read_csv(r'C:\Users\Mini-PC\Desktop\data.csv')
#print(df.head())

import enchant
import re

d = enchant.Dict("en_US")

def all_english(s):
    words = s.split()
    return len(words) == sum([d.check(re.sub(r'[!@#$?:;,.] ', '', x.lower())) for x in words])

df = df[df["A"].map(lambda x: all_english(x))]
print(df)

Output:

                     A           B
       Please help Me  Definitely
  Google is a comPaNY   yes it is
  • Related