I have a json file (data.json) that contain rows of data in json format. I want to loop through each row and remove rows of data that contains thai language under the name portion using python. May I know how to do this? Thanks
input:
{"name":"John", "age":30, "car":audi}
{"name":"สมศักดิ์", "age":25, "car":mercedes}
{"name":"อาทิตย์", "age":49, "car":bently}
{"name":"Mark", "age":20, "car":null}
...
output:
{"name":"John", "age":30, "car":audi}
{"name":"Mark", "age":20, "car":null}
...
CodePudding user response:
I would harness unicodedata
built-in module for this as follows, let say that you have file.txt
with content as follows
{"name":"John","age":30,"car":"audi"}
{"name":"สมศักดิ์","age":25,"car":"mercedes"}
{"name":"อาทิตย์","age":49,"car":"bently"}
{"name":"Mark","age":20,"car":null}
then
import json
import unicodedata
with open("file.txt",encoding="utf-8") as f:
for line in f:
name = json.loads(line)["name"]
if "THAI" not in unicodedata.name(name[0]):
print(line, end="")
output
{"name":"John","age":30,"car":"audi"}
{"name":"Mark","age":20,"car":null}
Disclaimer: I assume every line is legal JSON which does hold some name
. Explanation: I iterate over lines, for line I parse it using json.loads
and get name
then using unicodedata
I get Unicode name of first character and if it does not have THAI
in its name do print said line. As lines does already have their newlines I used end=""
(empty string) in print
.
CodePudding user response:
I have not used this with Thai strings myself, but you could try the function isalpha(). This should return true if all the characters are are in the (a-z) alphabet.
if row["name"].isalpha():
# english
else:
# thai
Note: Using it this way will filter everything non-english, it's not specific to Thai. Not sure if that's a problem for you.