How to find and calculate common letters between words in pandas-CodePudding

I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.

For e.g I have:

data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'],
       'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']}
df = pd.DataFrame(data)

| Col_1  | Col_2  |
-------------------
| Heaven | Heaven |
| Jako   | Jakob  |
| Sm     | Smart  |
| apizza | pizza  |

And I want to have smth like that:

| Col_1  | Col_2  | Match                          | Count |
------------------------------------------------------------
| Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6     |
| Jako   | Jakob  | ['J', 'a', 'k', 'o']           | 4     |
| Sm     | Smart  | ['S', 'm']                     | 2     |
| apizza | pizza  | []                             | 0     |

CodePudding user response：

You can use a list comprehension with help of itertools.takewhile:

from itertools import takewhile
df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))]
               for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()

output:

    Col_1   Col_2               Match  Count
0  Heaven  Heaven  [H, e, a, v, e, n]      6
1    Jako   Jakob        [J, a, k, o]      4
2      Sm   Smart              [S, m]      2
3  apizza   pizza                  []      0

NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch

If you want to continue after a mistmatch (which doesn't seems to fit the "pizza" example):

df['Match'] = [[x for x,y in zip(a,b) if x==y]
               for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()

output:

    Col_1   Col_2               Match  Count
0  Heaven  Heaven  [H, e, a, v, e, n]      6
1    Jako   Jakob        [J, a, k, o]      4
2      Sm   Smart              [S, m]      2
3  apizza   pizza                 [z]      1

CodePudding user response：

I took a different route and used the SequenceMatcher from the underrated diffilib which is aimed at comparaisons between sequences.

from difflib import SequenceMatcher

def common(a, b):
    s = SequenceMatcher(None, a, b)
    left, right = s.find_longest_match().a, s.find_longest_match().b
    if left != right:
        return []
    else: return list(a[left:s.find_longest_match().size])

df["Match"] = df.apply(lambda x : common(x[0], x[1]), axis=1)
df["Count"] = df["Match"].apply(len)

Results in:

0    [H, e, a, v, e, n]
1          [J, a, k, o]
2                [S, m]
3                    []
dtype: object

So what it does is take the two strings and compare them for the longest matches between them. The a attribute is the index at which the left hand side string starts to much and the b attribute is the right hand side equivalent. It also has a size attribute that tells us how long the match was.

If a and b are different, it means that they start matching a different places in the string which was against the rule here. Then we just keep the left hand side string and grab it until there is no match anymore.

The end df looks like this ofc:

    Col_1   Col_2               Match  Count
0  Heaven  Heaven  [H, e, a, v, e, n]      6
1    Jako   Jakob        [J, a, k, o]      4
2      Sm   Smart              [S, m]      2
3  apizza   pizza                  []      0