I have a pandas dataframe full of strings (300k) similar to this.
index | original | modified |
---|---|---|
0 | This is the original sentence | This is the changed sentence |
1 | This is a different sentence | This is the same sentence |
and I want to diff the strings. Ideally I'd create a third one with the changes such as:
index | original | modified | change |
---|---|---|---|
0 | This is the original sentence | This is the changed sentence | original -> changed |
1 | This is a different sentence | This is the same sentence | a different -> the same |
But even just being able to output the differences would already be great.
I tried df.applying
difflib.ndiff()
but it outputs
<generator object Differ.compare at 0x7ff8121a...
CodePudding user response:
I'm not sure whether difflib
has a shrink-wrapped way to do what you want, but it definitely has some ingredients to work with.
Here's an example of what you can do with SequenceMatcher
from difflib
(docs):
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
diffStr = ''
sm = difflib.SequenceMatcher(None, o, m)
for tag, i1, i2, j1, j2 in sm.get_opcodes():
diffStr = '\n' if diffStr else ''
diffStr = f'{tag:7} o[{i1}:{i2}] --> m[{j1}:{j2}] {o[i1:i2]!r:>6} --> {m[j1:j2]!r}'
return f"original: {o}\nmodified: {m}\n" diffStr
out = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
[print(x) for x in out]
Output:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original: This is the original sentence
modified: This is the changed sentence
equal o[0:12] --> m[0:12] 'This is the ' --> 'This is the '
replace o[12:15] --> m[12:16] 'ori' --> 'chan'
equal o[15:16] --> m[16:17] 'g' --> 'g'
replace o[16:20] --> m[17:19] 'inal' --> 'ed'
equal o[20:29] --> m[19:28] ' sentence' --> ' sentence'
original: This is a different sentence
modified: This is the same sentence
equal o[0:8] --> m[0:8] 'This is ' --> 'This is '
insert o[8:8] --> m[8:13] '' --> 'the s'
equal o[8:9] --> m[13:14] 'a' --> 'a'
replace o[9:14] --> m[14:15] ' diff' --> 'm'
equal o[14:15] --> m[15:16] 'e' --> 'e'
delete o[15:19] --> m[16:16] 'rent' --> ''
equal o[19:28] --> m[16:25] ' sentence' --> ' sentence'
The replace
, insert
and delete
opcodes could help you do what you're asking. However, note that "original"
and "changed"
get compared at a character-by-character level (so that the letter "g"
in both words is detected as being unchanged), so it may take some additional work to get to the exact sample output in your question.
UPDATED:
I've thought about this a bit more (since it's certainly an appealing capability to build onto difflib
) and have come up with a strategy that uses get_op_codes()
from SequenceMatcher
to give the exact "changed" output specified in the example in the question. I don't know that it would give satisfying results for every possible example, but the same can be said of many problems and solutions:
records = [
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df)
import difflib
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag != 'equal':
if oStart is None:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
diffStr = '<no change>' if oStart is None else o[oStart:oEnd] ' -> ' m[mStart:mEnd]
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
print(df)
Output:
original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence
original modified changed
0 This is the original sentence This is the changed sentence original -> changed
1 This is a different sentence This is the same sentence a different -> the same
UPDATE #3:
OK, now for a solution that treats punctuation, white space and string boundaries as word delimiters and decides whether or not to merge opcodes with tag == 'equal'
into adjacent diffs depending on whether they are standalone (i.e., bordered by "string boundaries").
I have added a more complex example to illustrate what it does. I have also wrapped all result substrings in single quotes for clarity.
records = [
{'original':'This, my good friend, is a very small piece of cake', 'modified':'That, my friend, is a very, very large piece of work'},
{'original':'This is the original sentence', 'modified':'This is the changed sentence'},
{'original':'This is a different sentence', 'modified':'This is the same sentence'}
]
import pandas as pd
df = pd.DataFrame(records)
print(df.to_string(index=False))
import difflib
import string
def isStandalone(x, i1, i2):
puncAndWs = string.punctuation string.whitespace
while i1 < i2 and x[i1] in puncAndWs:
i1 = 1
while i1 < i2 and x[i2 - 1] in puncAndWs:
i2 -= 1
return (i1 == 0 or x[i1 - 1] in puncAndWs) and (i2 == len(x) or x[i2] in puncAndWs)
def makeDiff(o, m, oStart, oEnd, mStart, mEnd):
oChange = "'" o[oStart:oEnd] "'"
mChange = "'" m[mStart:mEnd] "'"
return '<no change>' if oStart is None else oChange ' -> ' mChange
def getDiff(o, m):
sm = difflib.SequenceMatcher(None, o, m)
diffList = []
oStart, mStart, oEnd, mEnd = None, None, None, None
for tag, i1, i2, j1, j2 in sm.get_opcodes():
bothStandalone = isStandalone(o, i1, i2) and isStandalone(m, j1, j2)
if bothStandalone:
if oStart is not None:
if tag == 'equal':
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
oStart, mStart, oEnd, mEnd = None, None, None, None
else:
oEnd, mEnd = i2, j2
elif tag != 'equal':
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
elif oStart is not None:
oEnd, mEnd = i2, j2
else:
oStart, mStart = i1, j1
oEnd, mEnd = i2, j2
if oStart is not None:
diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
diffStr = ', '.join(diffList)
return diffStr
df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
#print(df.to_string(index=False))
df.drop(['original', 'modified'], axis=1, inplace=True)
print(df.to_string(index=False))
Output:
original modified
This, my good friend, is a very small piece of cake That, my friend, is a very, very large piece of work
This is the original sentence This is the changed sentence
This is a different sentence This is the same sentence
changed
'This' -> 'That', ' good' -> '', ' small' -> ', very large', 'cake' -> 'work'
'original' -> 'changed'
'a different' -> 'the same'