I have a dataset of paired sentences (a sentence in English and its translation to German). Suppose I have WORD1, which is a word in English. I'm trying to know what German words (and how many times) appear in a German sentence that is in a pair where WORD1 appears in the English sentence. I need to do this for every English word available (I have a list of the English words). To do it, I have created a dictionary, where keys are all English words and values are a dictionary, that is empty at the beginning and that I want to update to add all German words that co-appear in a pair with that English word.
The code I did is:
print('Building English dictionary...')
d = dict.fromkeys(words_english, {})
print('Start iterating over samples...')
for sentence_german, sentence_english in zip(data_german, data_english):
sentence_german = sentence_german.split()
sentence_english = sentence_english.split()
for word in sentence_english:
update_dict(word, sentence_german, d)
Where the function update_dict is:
def update_dict(w, sentence, d):
if sentence:
for paired_word in sentence:
if paired_word in d[w].keys():
d[w][paired_word] = d[w][paired_word] 1
else:
d[w][paired_word] = 1
However, this code is not working as I expected. Every time that the function update_dict is called, all the values of d are updated simultaneously. For example, if I execute this code (same one as before but with prints and breaking the loops for simplification):
print('Building English dictionary...')
d = dict.fromkeys(words_english, {})
print('Start iterating over samples...')
for sentence_german, sentence_english in zip(data_german, data_english):
sentence_german = sentence_german.split()
sentence_english = sentence_english.split()
for word in sentence_english:
print('WORD', word)
print(sentence_german)
update_dict(word, sentence_german, d)
break
break
The output is:
WORD we
['wir', 'glauben', 'nicht', ',', 'daß', 'wir', 'nur', 'ro', 's', 'inen', 'her', 'au', 'sp', 'ick', 'en', 'sollten', '.']
But dictionary d now has this:
{'wir': 2, 'glauben': 1, 'nicht': 1, ',': 1, 'daß': 1, 'nur': 1, 'ro': 1, 's': 1, 'inen': 1, 'her': 1, 'au': 1, 'sp': 1, 'ick': 1, 'en': 1, 'sollten': 1, '.': 1}
in all of its values. Why did words that are not "we" update?
I also tried to use a different implementation of update_dict (the one below) but the same problem keeps happening.
def update_dict(w, sentence, d):
if sentence:
for paired_word in sentence:
if paired_word in d[w].keys():
d[w].update({paired_word:(d[w][paired_word] 1)})
else:
d[w].update({paired_word:1})
Why are all the keys (English words) updated if the value of w is just "we"?
CodePudding user response:
Because
d = dict.fromkeys(words_english, {})
uses the same {}
for all keys. (fromkeys
is generally more useful when the value is not mutable.)
Use a dictionary comprehension, á la
d = {word: {} for word in words_english}
instead to have a separate dict for each key.
Even better, since you're counting occurrences, use our good friends defaultdict
and Counter
from the collections
module, like
import collections
d = collections.defaultdict(collections.Counter)
and you can just do
d[w][paired_word] = 1
without any if in
/update
shenanigans, and you don't need to even know the words_english
in advance.