I have a dataset with some words in it and I want to compare 2 columns and count common letters between them.
For e.g I have:
data = {'Col_1' : ['Heaven', 'Jako', 'Sm', 'apizza'],
'Col_2' : ['Heaven', 'Jakob', 'Smart', 'pizza']}
df = pd.DataFrame(data)
| Col_1 | Col_2 |
-------------------
| Heaven | Heaven |
| Jako | Jakob |
| Sm | Smart |
| apizza | pizza |
And I want to have smth like that:
| Col_1 | Col_2 | Match | Count |
------------------------------------------------------------
| Heaven | Heaven | ['H', 'e', 'a', 'v', 'e', 'n'] | 6 |
| Jako | Jakob | ['J', 'a', 'k', 'o'] | 4 |
| Sm | Smart | ['S', 'm'] | 2 |
| apizza | pizza | [] | 0 |
CodePudding user response:
You can use a list comprehension with help of itertools.takewhile
:
from itertools import takewhile
df['Match'] = [[x for x,y in takewhile(lambda x: x[0]==x[1], zip(a,b))]
for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count
0 Heaven Heaven [H, e, a, v, e, n] 6
1 Jako Jakob [J, a, k, o] 4
2 Sm Smart [S, m] 2
3 apizza pizza [] 0
NB. the logic was no fully clear, so here this stops as soon as there is a mistmatch
If you want to continue after a mistmatch (which doesn't seems to fit the "pizza" example):
df['Match'] = [[x for x,y in zip(a,b) if x==y]
for a,b in zip(df['Col_1'], df['Col_2'])]
df['Count'] = df['Match'].str.len()
output:
Col_1 Col_2 Match Count
0 Heaven Heaven [H, e, a, v, e, n] 6
1 Jako Jakob [J, a, k, o] 4
2 Sm Smart [S, m] 2
3 apizza pizza [z] 1
CodePudding user response:
I took a different route and used the SequenceMatcher
from the underrated diffilib
which is aimed at comparaisons between sequences.
from difflib import SequenceMatcher
def common(a, b):
s = SequenceMatcher(None, a, b)
left, right = s.find_longest_match().a, s.find_longest_match().b
if left != right:
return []
else: return list(a[left:s.find_longest_match().size])
df["Match"] = df.apply(lambda x : common(x[0], x[1]), axis=1)
df["Count"] = df["Match"].apply(len)
Results in:
0 [H, e, a, v, e, n]
1 [J, a, k, o]
2 [S, m]
3 []
dtype: object
So what it does is take the two strings and compare them for the longest matches between them. The a
attribute is the index at which the left hand side string starts to much and the b
attribute is the right hand side equivalent. It also has a size
attribute that tells us how long the match was.
If a
and b
are different, it means that they start matching a different places in the string which was against the rule here. Then we just keep the left hand side string and grab it until there is no match anymore.
The end df
looks like this ofc:
Col_1 Col_2 Match Count
0 Heaven Heaven [H, e, a, v, e, n] 6
1 Jako Jakob [J, a, k, o] 4
2 Sm Smart [S, m] 2
3 apizza pizza [] 0