Home > OS >  diff strings in dataframe
diff strings in dataframe

Time:03-21

I have a pandas dataframe full of strings (300k) similar to this.

index original modified
0 This is the original sentence This is the changed sentence
1 This is a different sentence This is the same sentence

and I want to diff the strings. Ideally I'd create a third one with the changes such as:

index original modified change
0 This is the original sentence This is the changed sentence original -> changed
1 This is a different sentence This is the same sentence a different -> the same

But even just being able to output the differences would already be great.

I tried df.applying

difflib.ndiff()

but it outputs

<generator object Differ.compare at 0x7ff8121a...

CodePudding user response:

I'm not sure whether difflib has a shrink-wrapped way to do what you want, but it definitely has some ingredients to work with.

Here's an example of what you can do with SequenceMatcher from difflib (docs):

records = [
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df)

import difflib
def getDiff(o, m):
    diffStr = ''
    sm = difflib.SequenceMatcher(None, o, m)
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        diffStr  = '\n' if diffStr else ''
        diffStr  = f'{tag:7} o[{i1}:{i2}] --> m[{j1}:{j2}] {o[i1:i2]!r:>6} --> {m[j1:j2]!r}'
    return f"original: {o}\nmodified: {m}\n"   diffStr


out = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
[print(x) for x in out]

Output:

                        original                      modified
0  This is the original sentence  This is the changed sentence
1   This is a different sentence     This is the same sentence
original: This is the original sentence
modified: This is the changed sentence
equal   o[0:12] --> m[0:12] 'This is the ' --> 'This is the '
replace o[12:15] --> m[12:16]  'ori' --> 'chan'
equal   o[15:16] --> m[16:17]    'g' --> 'g'
replace o[16:20] --> m[17:19] 'inal' --> 'ed'
equal   o[20:29] --> m[19:28] ' sentence' --> ' sentence'
original: This is a different sentence
modified: This is the same sentence
equal   o[0:8] --> m[0:8] 'This is ' --> 'This is '
insert  o[8:8] --> m[8:13]     '' --> 'the s'
equal   o[8:9] --> m[13:14]    'a' --> 'a'
replace o[9:14] --> m[14:15] ' diff' --> 'm'
equal   o[14:15] --> m[15:16]    'e' --> 'e'
delete  o[15:19] --> m[16:16] 'rent' --> ''
equal   o[19:28] --> m[16:25] ' sentence' --> ' sentence'

The replace, insert and delete opcodes could help you do what you're asking. However, note that "original" and "changed" get compared at a character-by-character level (so that the letter "g" in both words is detected as being unchanged), so it may take some additional work to get to the exact sample output in your question.

UPDATED: I've thought about this a bit more (since it's certainly an appealing capability to build onto difflib) and have come up with a strategy that uses get_op_codes() from SequenceMatcher to give the exact "changed" output specified in the example in the question. I don't know that it would give satisfying results for every possible example, but the same can be said of many problems and solutions:

records = [
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df)

import difflib
def getDiff(o, m):
    sm = difflib.SequenceMatcher(None, o, m)
    oStart, mStart, oEnd, mEnd = None, None, None, None
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag != 'equal':
            if oStart is None:
                oStart, mStart = i1, j1
            oEnd, mEnd = i2, j2
    diffStr = '<no change>' if oStart is None else o[oStart:oEnd]   ' -> '   m[mStart:mEnd]
    return diffStr

df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
print(df)

Output:

                        original                      modified
0  This is the original sentence  This is the changed sentence
1   This is a different sentence     This is the same sentence
                        original                      modified                  changed
0  This is the original sentence  This is the changed sentence      original -> changed
1   This is a different sentence     This is the same sentence  a different -> the same

UPDATE #3: OK, now for a solution that treats punctuation, white space and string boundaries as word delimiters and decides whether or not to merge opcodes with tag == 'equal' into adjacent diffs depending on whether they are standalone (i.e., bordered by "string boundaries").

I have added a more complex example to illustrate what it does. I have also wrapped all result substrings in single quotes for clarity.

records = [
    {'original':'This, my good friend, is a very small piece of cake', 'modified':'That, my friend, is a very, very large piece of work'},
    {'original':'This is the original sentence', 'modified':'This is the changed sentence'},
    {'original':'This is a different sentence', 'modified':'This is the same sentence'}
]

import pandas as pd
df = pd.DataFrame(records)
print(df.to_string(index=False))

import difflib
import string
def isStandalone(x, i1, i2):
    puncAndWs = string.punctuation   string.whitespace
    while i1 < i2 and x[i1] in puncAndWs:
        i1  = 1
    while i1 < i2 and x[i2 - 1] in puncAndWs:
        i2 -= 1
    return (i1 == 0 or x[i1 - 1] in puncAndWs) and (i2 == len(x) or x[i2] in puncAndWs)
def makeDiff(o, m, oStart, oEnd, mStart, mEnd):
    oChange = "'"   o[oStart:oEnd]   "'"
    mChange = "'"   m[mStart:mEnd]   "'"
    return '<no change>' if oStart is None else oChange   ' -> '   mChange
def getDiff(o, m):
    sm = difflib.SequenceMatcher(None, o, m)
    diffList = []
    oStart, mStart, oEnd, mEnd = None, None, None, None
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        bothStandalone = isStandalone(o, i1, i2) and isStandalone(m, j1, j2)
        if bothStandalone:
            if oStart is not None:
                if tag == 'equal':
                    diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
                    oStart, mStart, oEnd, mEnd = None, None, None, None
                else:
                    oEnd, mEnd = i2, j2
            elif tag != 'equal':
                oStart, mStart = i1, j1
                oEnd, mEnd = i2, j2
        elif oStart is not None:
            oEnd, mEnd = i2, j2
        else:
            oStart, mStart = i1, j1
            oEnd, mEnd = i2, j2
    if oStart is not None:
        diffList.append(makeDiff(o, m, oStart, oEnd, mStart, mEnd))
    diffStr = ', '.join(diffList)
    return diffStr

df['changed'] = df.apply(lambda x: getDiff(x['original'], x['modified']), axis=1)
#print(df.to_string(index=False))

df.drop(['original', 'modified'], axis=1, inplace=True)
print(df.to_string(index=False))

Output:

                                           original                                             modified
This, my good friend, is a very small piece of cake That, my friend, is a very, very large piece of work
                      This is the original sentence                         This is the changed sentence
                       This is a different sentence                            This is the same sentence
                                                                      changed
'This' -> 'That', ' good' -> '', ' small' -> ', very large', 'cake' -> 'work'
                                                      'original' -> 'changed'
                                                  'a different' -> 'the same'
  • Related