Home > database >  remove rows with non english item in json files using python
remove rows with non english item in json files using python

Time:12-07

I have a json file (data.json) that contain rows of data in json format. I want to loop through each row and remove rows of data that contains thai language under the name portion using python. May I know how to do this? Thanks

input:

{"name":"John", "age":30, "car":audi}
{"name":"สมศักดิ์", "age":25, "car":mercedes}
{"name":"อาทิตย์", "age":49, "car":bently}
{"name":"Mark", "age":20, "car":null}
...

output:

{"name":"John", "age":30, "car":audi}
{"name":"Mark", "age":20, "car":null}
...

CodePudding user response:

I would harness unicodedata built-in module for this as follows, let say that you have file.txt with content as follows

{"name":"John","age":30,"car":"audi"}
{"name":"สมศักดิ์","age":25,"car":"mercedes"}
{"name":"อาทิตย์","age":49,"car":"bently"}
{"name":"Mark","age":20,"car":null}

then

import json
import unicodedata
with open("file.txt",encoding="utf-8") as f:
    for line in f:
        name = json.loads(line)["name"]
        if "THAI" not in unicodedata.name(name[0]):
            print(line, end="")

output

{"name":"John","age":30,"car":"audi"}
{"name":"Mark","age":20,"car":null}

Disclaimer: I assume every line is legal JSON which does hold some name. Explanation: I iterate over lines, for line I parse it using json.loads and get name then using unicodedata I get Unicode name of first character and if it does not have THAI in its name do print said line. As lines does already have their newlines I used end="" (empty string) in print.

CodePudding user response:

I have not used this with Thai strings myself, but you could try the function isalpha(). This should return true if all the characters are are in the (a-z) alphabet.

if row["name"].isalpha():
    # english
else:
    # thai

Note: Using it this way will filter everything non-english, it's not specific to Thai. Not sure if that's a problem for you.

  • Related