Home > Blockchain >  Compare two strings of data for overlap
Compare two strings of data for overlap

Time:01-17

I am trying to compare a set of data from Report 1 (in column A) with a set of data from Report 2 (in column B). The reports contain data such as name, DOB, SSN, address, etc. Both reports have instances when there will be multiple data points within a single cell, such as multiple possible dates of birth, separated by a pipe |. I am trying to find all instances of overlap where data from Report 1 is also on Report 2.

I've set up a VBA script to do all of the formatting, adding columns, dropping in a comparison formula, sorting, etc but the comparison formula I've been using fails if there are any differences at all (such as multiple DOB options in Report 1 but only one DOB in Report 2). I know that there has to be a decent way to do this (some kind of string match formula), but I'm striking out and the manual way is slow and goofy. Any help would be greatly appreciated!

Doing this manually, I can add a 'helper' column to find the number of pipes in each line of the report: =LEN(A2)-LEN(SUBSTITUTE(A2,"|","")). Find the max number of pipes in that helper column, add that many columns, and use the Text to Columns button to split the info from Report 1 into multiple columns with a single piece of data in each. Repeat for Report 2. Then I can use a Countif to compare any of the cells from Report 1 with those from Report 2 in yet another column. This works more or less, but it adds a boatload of columns and that will throw off my VBA script for everything further to the right for the other data points.

  1. Is there a formula that can be used to compare the two columns of data without splitting the data via the delimiters? Something along the lines of 'find the same pattern of characters from cell B2 anywhere within cell A2'. This would be my absolute preference if possible.

  2. A slight change to #1, is it possible to change that slightly to 'find 7 or more characters from the pattern of characters in cell B2 within cell A2'? This would be greatly helpful in comparing SSN's for when some have leading 0's and some don't! A similar tweak would be for 'find 5 or more numbers in a pattern from B2 in A2' (zip code). Not just the same characters, but the same order of characters.

  3. If there really is no good way to compare the two sets of data without splitting the data into multiple columns, how do I add the right numbers of columns based on the max number of pipes and then have later steps referencing to columns that are now moved over?

Thank you for any ideas and help!!

Image below for reference:

DOBs from Report 1 DOBs from Report 2 Output
1/1/1984 5/7/1986 No Match
6/7/1976 6/7/1976 Match
9/12/1999 | 12/9/1999 5/9/1999 No Match
1/1/1987 | 1/2/1988 | 1/3/1989 1/2/1988 | 1/3/1989 Match

Edits (thank you all so very much for the help and comments, I really appreciate it!):

InStr to find col B in col A from Maya's help: 
Sub InStr_Match() Dim LastRow As Long, i As Long 
LastRow = Cells(Rows.Count, "A").End(xlUp).Row ' get last row with data in column A 
For i = 2 To LastRow 
If InStr(1, Range("A" & i), Range("B" & i), vbt) > 0 Then Range("C" & i).Value2 = "Match" 
Else Range("C" & i).Value2 = "No Match" 
End If 
Next i 
End Sub 

This is adding an answer of Match or No Match in column C after trying to find B in A. I can add an OR and send it to find A in B, which will help with some of the misses.

@JvdV - I need to match only within each row please. A DOB match between the reports but on different rows is not relevant for this purpose. Each row is a 'person' and I'm trying to find if any of their data on Report 1 is a match to any of their data on Report 2. The formatting is a bit of a mix, but I think I can try to standardize that earlier in the macro. The dates tend to be either "mm/dd/yyyy" or "mm-dd-yyyy | mm-dd-yyyy" so I can replace the -'s with /'s to standardize that. I can also throw '=text(A2,"mm/dd/yyyy") at them to standardize them a bit better (I need to tweak that so that it's VBA rather than a formula). I'm on Excel 365 usually.

@Ike - thanks for fixing the chart! The output I'm looking for is match/no match, or true/false. I've added that to the example.

CodePudding user response:

You will have to split the Value in B into single data-points and compare each of them. If one match is enough (and you dont want the next miss-hit to erase it), it should something like:

Sub InStr_Match()

 Dim LastRow As Long, i As Long
 LastRow = Cells(Rows.Count, "A").End(xlUp).Row ' get last row with data in column A
 For i = 2 To LastRow
  Dim arr() As String
  arr = Split(Range("B" & i), " | ")
  Dim name As Variant
    For Each name In arr
      If InStr(1, Range("A" & i), name, vbt) > 0 Then
        Range("C" & i).Value2 = "Match"
        GoTo matchfound
        Else
        Range("C" & i).Value2 = "No Match"
      End If
    Next
   matchfound:
  Next i
End Sub

CodePudding user response:

If you have various of the Office 365 functions, you can do this with a single worksheet formula in one cell:
(You will need to replace Report1 and Report2 with the appropriate range references)

=LET(rpt_1,--TRIM(TEXTSPLIT(TEXTJOIN("|",TRUE,Report1),,"|")),
     rpt_2,--TRIM(TEXTSPLIT(TEXTJOIN("|",TRUE,Report2),,"|")),
     FILTER(rpt_1,ISNUMBER(XMATCH(rpt_1,rpt_2))))

enter image description here

CodePudding user response:

Short answer is that this would be much easier to do via a macro.

This solution is a little "over-engineered", but I wanted to demonstrate converting strings to arrays in VBA and doing quick comparisons.

It will compare the dates in the two columns, and paste the matches in another column:

Public Function InArray(theValue As Variant, theArray) As Boolean
'________________________________________________________________________________________________
' Purpose:
'   -> to assess if a value is in an array
'________________________________________________________________________________________________
' Parameters:
'   - theValue, any type
'   -----> the value to search for in the array
'   - theArray, Array
'   -----> The array to search for the value in
'________________________________________________________________________________________________
' Returns:
'   - Boolean
'   -----> Boolean is True if value is in array, False if not
'________________________________________________________________________________________________
On Error GoTo Err
    Dim iter As Long
    For iter = LBound(theArray) To UBound(theArray)
        If theArray(iter) = theValue Then
            InArray = True
            Exit Function
        End If
    Next iter
    InArray = False
    Exit Function
Err:
    ' on error, tell the user the error and stop the code for debugging
    Call MsgBox("An error occurred in the Public Function InArray. The code will stop:" _
                & vbCrLf & vbCrLf & Err.Description, vbCritical)
    Stop
End Function

Sub CompareDates()
    Dim reportWorksheet As Worksheet
    Dim reportColumnOneInt As Integer
    Dim reportColumnTwoInt As Integer
    Dim matchColumnInt As Integer
    Dim rowPosition As Long
    Dim startingRow As Integer
    Dim endRow As Integer
    Dim columnOneArray As Variant
    Dim columnTwoArray As Variant
    Dim columnOneString As String
    Dim columnTwoString As String
    Dim matchesString As String
    Dim arrayIter As Long
    
    ' Set objects
    Set reportWorksheet = ActiveWorkbook.Sheets("Sheet1")
    ' Set variables
    startingRow = 2
    reportColumnOneInt = 1
    reportColumnTwoInt = 2
    matchColumnInt = 3
    endRow = reportWorksheet.Range("A" & reportWorksheet.Rows.Count).End(xlUp).Row
    ' Iterate down the rows
    For rowPosition = startingRow To endRow
        columnOneString = Replace(CStr(reportWorksheet.Cells(rowPosition, reportColumnOneInt).Value), " ", "")
        columnTwoString = Replace(CStr(reportWorksheet.Cells(rowPosition, reportColumnTwoInt).Value), " ", "")
        ' convert first column values to array
        If InStr(1, columnOneString, "|") = 0 Then
            columnOneArray = Array(columnOneString)
        Else
            columnOneArray = Split(columnOneString, "|")
            
        End If
        ' convert second column values to array
        If InStr(1, columnTwoString, "|") = 0 Then
            columnTwoArray = Array(columnTwoString)
        Else
            columnTwoArray = Split(columnTwoString, "|")
        End If
        ' Iterate over first column, checking to see if values in second column
        matchesString = ""
        For arrayIter = LBound(columnOneArray) To UBound(columnOneArray)
            columnOneArray(arrayIter) = Trim(columnOneArray(arrayIter))
            Debug.Print (columnOneArray(arrayIter))
            'Check if values in first column match any in second
            If InArray(columnOneArray(arrayIter), columnTwoArray) = True Then
                If matchesString = "" Then
                    matchesString = columnOneArray(arrayIter)
                Else
                    matchesString = matchesString & " | " & columnOneArray(arrayIter)
                End If
            End If
        Next arrayIter
        ' set value of match column
        reportWorksheet.Cells(rowPosition, matchColumnInt).Value = matchesString
    Next rowPosition
End Sub

The output would look like this: enter image description here

  • Related