Home > Net >  How to determine if a particular string or text is in US-english or UK-english using python?
How to determine if a particular string or text is in US-english or UK-english using python?

Time:05-26

I wanted to achieve something like this..

input_text = "The body is burnt"

output = "en-uk"

input_text = "The body is burned" 

output = "en-us"

CodePudding user response:

Try TextBlob This requires NLTK package, uses Google

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

A side note this uses Google translate API so it requires internet

CodePudding user response:

Similar to this answer you could use the American-British translator.

import re
import requests
url = "https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/"
# The two dictionaries differ slightly so we import both
uk_to_us = requests.get(url   "british_spellings.json").json()    
us_to_uk = requests.get(url   "american_spellings.json").json()   
us_only = requests.get(url   "american_only.json").json()
uk_only = requests.get(url   "british_only.json").json()

# Save these word lists in a local text file if you want to avoid requesting the data every time
uk_words = set(uk_to_us) | set(uk_only)
us_words = set(us_to_uk) | set(us_only)

def get_dialect(s):
    words = re.findall(r"([a-z] )", s.lower()) # list of lowercase words only
    uk = sum(word in uk_words for word in words)
    us = sum(word in us_words for word in words)
    print("Scores", br, am)  # You might want to return these scores instead of the final verdict
    if uk > us:
        return "en-uk"
    if us > uk:
        return "en-us"
    return "Unknown"

print(get_dialect("The color of the ax")) # en-us
print(get_dialect("The colour of the axe"))  # en-us
print(get_dialect("The body is burned")) # Unknown

This simply tests at the individual word level and cannot check for differences in how words are used in grammatical context (e.g. some words used only as an adjective in one dialect but can also be a past tense verb in the other).

A slight improvement you could make would be to identify the 2 and 3 word phrases listed in the american_only and british_only lists, which are currently ignored. The american_only and british_only lists also do not contain different forms of the same words (e.g. "abseil" is there but not "abseiled", "abseiling" etc.) so you would ideally convert your text to stems first.

  • Related