I have a lot of txt files containing product description written in English or other languages. I'm only interested in the txt files containing English. I do not know all the languages the txt may be written in. Those non-English description may also contain English characters such as a url. Is there any algorithm to calculate the ratio of English characters (excluding punctuation) so that if there are more than 50% non-English characters in the txt file, the file will be discarded.
CodePudding user response:
Here is a Wikipedia help page that gives hints on analyzing languages based on the character set used. One would still need to convert that to code, but it is a hint. There's probably no foul-proof way of detecting english. There are other languages as well that do not use diacritics, or a particular text may not. Just from the character set, Dutch and Afrikaans are very close to english, for instance.
So you can use the table to give an estimate on the probability of a piece of text being english, but getting it 100% is hard.
CodePudding user response:
We had a requirement to detect the language used in a given string.
We chose to use the first 5 characters of the string for determining the language:
Parse the first 5 characters and store the highest Unicode value.
Check this highest Unicode value against the known Unicode ranges of the languages.
Whichever language's range matches with the value, that's the language.
Obviously this solution isn't 100% accurate, but, it solved our purposes.
For your case, choose any length of characters for determining if the language is English.
Find the highest Unicode value of these characters.
Check this highest value against the Unicode ranges of English(Latin).
If it is within the ranges, you may conclude that the language is English.