Dictionary comprehension Bigram-CodePudding

I have a python assignment to extract bigrams from a string into a dictionary and I think I have found the solution online but cant remember where I found it. But it seems to work but I am having trouble understanding it as I am new to python. Can anyone explain the code below which takes a string and extracts chars into tuples and counts instances and puts it into a dictionary

'''

s = 'Mississippi' # Your string
# Dictionary comprehension
dic_ = {k : s.count(k) for k in{s[i] s[i 1] for i in range(len(s)-1)}}

'''

CodePudding user response：

First let's understand comprehensions:

A list, dict, set, etc. can be made with a comprehension. Basically a comprehension is taking a generator and using it to form a new variable. A generator is just an object that returns a different value each iteration so to use list as an example: to make a list with a list comprehension we take the values that the generator outputs and put them into their own spot in a list. Take this generator for example:

x for x in range(0, 10)

This will just give 0 on the first iteration, then 1, then 2, etc. so to make this a list we would use [] (list brakets) like so:

[x for x in range(0, 10)]

This would give:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #note: range does not include the second input

for a dictionary and for a set we use {}, but since dictionaries uses key-value pairs our generator will be different for sets and dictionaries. For a set it is the same as a list:

{x for x in range(0, 10)} #gives the set --> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

but for a dictionary we need a key and a value. Since enumerate gives two items this could be useful for dictionaries in some cases:

{key: value for key, value in enumerate([1,2,3])}

In this case the keys are the indexes and the values are the items in the list. So this gives:

{0: 1, 1: 2, 2: 3} #dictionary

It doesn't make a set because we denote x : y which is the format for items in a dictionary, not a set.

Now, let's break this down:

This part of the code:

{s[i] s[i 1] for i in range(len(s)-1)}

is making a set of values that is every pair of touching letters, s[i] is one letter, s[i 1] is the letter after, so it is saying get this pair (s[i] s[i 1]) and do it for every item in the string (for i in range(len(s)-1) Notice there is a -1 since the last letter does not have a touching letter after it (so we don't want to run it for the last letter).

Now that we have a set let's save it to a variable so it's easier to see:

setOfPairs = {s[i] s[i 1] for i in range(len(s)-1)}

Then our original comprehension would change to:

{k : s.count(k) for k in setOfPairs}

This is saying we want to make a dictionary that has keys of k and values of s.count(k) since we get every k from our pairs list: for k in setOfPairs the keys of the dictionary are, then, the pairs. Since s.count(k) returns the number of times k is in s, the values of the dictionary are the number of times the key appears in s.

CodePudding user response：

Let's take this apart one step at a time:

s[i] is the code to select the i-th letter in the string s.
s[i] s[i 1] concatenates the letter at position i and the letter at position i 1.
s[i] s[i 1] for i in range(len(s)-1) iterates each index i (except the last one) and so computes all the bigrams.
Since the expression in 3 is surrounded by curly brackets, the result is a set, meaning that all duplicate bigrams are removed.
for k in {s[i] s[i 1] for i in range(len(s)-1)} therefore iterates over all unique bigrams in the given string s.
Lastly, {k : s.count(k) for k in{s[i] s[i 1] for i in range(len(s)-1)}} maps each each bigram k to the amount of times it appears in s, because the str.count function returns the number of times a substring appears in a string.

I hope that helps. If you want to know more about list/set/dict comprehensions in Python, the relevant entry in the Python documentation is here: https://docs.python.org/3/tutorial/datastructures.html?highlight=comprehension#list-comprehensions

CodePudding user response：

dic_ = {k : s.count(k) for k in{s[i] s[i 1] for i in range(len(s)-1)}}

Read backwards

    dic_ = {k : s.count(k) 
      ## Step 3  with each of the pair of characters  
count how many are in the string 
store the 2 char string as the key and the count as the value for the dictionary.
    
    
    for k in{s[i] s[i 1]  
     # Step 2  From each of the position take 2 characters out of the string
    
    
    
    for i in range(len(s)-1)}} 
     # Step 1 loop over all but the last character of the string.

The code may be inefficient for long strings with many repetitions. Step 2 takes every pair so the count and store will be repeated count times. Refactoring so you can test if the key already exists and not repeating the count may speed it up. bench mark time on ... say a billion base pair DNA sequence.