Dictionary of dictionaries updates all keys simultaneously-CodePudding

I have a dataset of paired sentences (a sentence in English and its translation to German). Suppose I have WORD1, which is a word in English. I'm trying to know what German words (and how many times) appear in a German sentence that is in a pair where WORD1 appears in the English sentence. I need to do this for every English word available (I have a list of the English words). To do it, I have created a dictionary, where keys are all English words and values are a dictionary, that is empty at the beginning and that I want to update to add all German words that co-appear in a pair with that English word.

The code I did is:

print('Building English dictionary...')
d = dict.fromkeys(words_english, {})

print('Start iterating over samples...')
for sentence_german, sentence_english in zip(data_german, data_english):
    sentence_german = sentence_german.split()
    sentence_english = sentence_english.split()
    for word in sentence_english:
        update_dict(word, sentence_german, d)

Where the function update_dict is:

def update_dict(w, sentence, d):
    if sentence:
        for paired_word in sentence:
            if paired_word in d[w].keys():
                d[w][paired_word] = d[w][paired_word]   1
            else:
                d[w][paired_word] = 1

However, this code is not working as I expected. Every time that the function update_dict is called, all the values of d are updated simultaneously. For example, if I execute this code (same one as before but with prints and breaking the loops for simplification):

print('Building English dictionary...')
d = dict.fromkeys(words_english, {})

print('Start iterating over samples...')
for sentence_german, sentence_english in zip(data_german, data_english):
    sentence_german = sentence_german.split()
    sentence_english = sentence_english.split()
    for word in sentence_english:
        print('WORD', word)
        print(sentence_german)
        update_dict(word, sentence_german, d)
    break
break

The output is:

WORD we
['wir', 'glauben', 'nicht', ',', 'daß', 'wir', 'nur', 'ro', 's', 'inen', 'her', 'au', 'sp', 'ick', 'en', 'sollten', '.']

But dictionary d now has this:

{'wir': 2, 'glauben': 1, 'nicht': 1, ',': 1, 'daß': 1, 'nur': 1, 'ro': 1, 's': 1, 'inen': 1, 'her': 1, 'au': 1, 'sp': 1, 'ick': 1, 'en': 1, 'sollten': 1, '.': 1}

in all of its values. Why did words that are not "we" update?

I also tried to use a different implementation of update_dict (the one below) but the same problem keeps happening.

def update_dict(w, sentence, d):
    if sentence:
        for paired_word in sentence:
            if paired_word in d[w].keys():
                d[w].update({paired_word:(d[w][paired_word] 1)})
            else:
                d[w].update({paired_word:1})

Why are all the keys (English words) updated if the value of w is just "we"?

CodePudding user response：

Because

d = dict.fromkeys(words_english, {})

uses the same {} for all keys. (fromkeys is generally more useful when the value is not mutable.)

Use a dictionary comprehension, á la

d = {word: {} for word in words_english}

instead to have a separate dict for each key.

Even better, since you're counting occurrences, use our good friends defaultdict and Counter from the collections module, like

import collections

d = collections.defaultdict(collections.Counter)

and you can just do

d[w][paired_word]  = 1

without any if in/update shenanigans, and you don't need to even know the words_english in advance.