I am working on a Python project that reads an CSV file to extract reviews, then it group the reviews by the host id in order to summarize them. I managed to open the file to read and group it by the host ids.
However, I found a problem I could not solve: the reviews are written in many different languages such as: English, French, Spanish, Korean, Chinese, and others that I do not recognise. In an attempt to tackle it was translating all reviews to English, so I used a library named "translators to translate the many languages.
Also, my best bet at reading the file was using an encoding that would accept as many languages as possible, and the best result I got was using the Latin1 encoding, but it cannot read Chinese or Korean, which return a lot of strange symbols. For example: a Chinese review returns a bunch of hexadecimal values plus some gibberish, which throws an error when I try to translate into English because it does not recognize it as string.
Botton line: I have two problems:
- Some way to read all languages, so I could translate it to English.
- Increase the code performance, for that it would be preferable that it does not attempt to translate English reviews, as it waste both time and processing power. Since this is a pilot project, the latter is more an extra point than really a need, but the former has been giving me headaches for some time now.
What I have up to now:
import pandas as pd
from collections import defaultdict, namedtuple
import csv
import translators as ts
import translators.server as tss
group_sentences = {}
with open(file_name, 'r',encoding='latin1') as myfile:
reader = csv.reader(myfile, delimiter=';')
for n, row in enumerate(reader):
if not n:
continue
listing_id, id, date, reviewer_id, reviewer_name, comments = row
if listing_id not in group_sentences:
group_sentences[listing_id] = list()
if len(comments) > 10:
group_sentences[listing_id].append(tss.google(comments))
CodePudding user response:
For the first, try using the chardet library to automatically detect the encoding of the text in the file:
import chardet
with open(file_name, 'rb') as f:
data = f.read()
result = chardet.detect(data)
encoding = result['encoding']
print(f"The text in the file is encoded using {encoding}.")
For the second, you can use a language detection library, such as langdetect, to detect the language of each review before attempting to translate it. If the review is already in English you can skip.
from langdetect import detect
# Assume the review is stored in a variable called `review`
review_language = detect(review)
if review_language != 'en':
review = tss.google(review)