Home > Enterprise >  Fastest way to delete records on excel file based on value in another excel file
Fastest way to delete records on excel file based on value in another excel file

Time:12-08

I need guidance on the following. I have a file with 150000 records (excel). Another excel file with 5000-6000 records is received and need to delete the row based on some criteria of the info from the second file.

I use Dictionary function to collect the second file data in dictionary -

IntI = 2
Do While wbk.Sheets("Sheet1").Cells(IntI, 1).Value <> ""
    strAgNo = wbk.Sheets("Sheet1").Cells(IntI, 8).Value
    If Dict.Exists(strAgNo) Then
    Else
        Dict.Add Key:=strAgNo, Item:=IntI
    End If
    IntI = IntI   1
Loop
wbk.Close SaveChanges:=False

Then based on criteria of second file record, sering the first file using Range Find command (rgFound is Object)-

For n = 0 To Dict.Count - 1
    strAgNo = Dict.Keys(n)
    Set rgFound = Range("G:G").Find(strAgNo)
    If rgFound Is Nothing Then
        intNotSetlAg = intNotSetlAg   1
    Else
        FoundRow = rgFound.Row
        intSetlAg = intSetlAg   1
        Rows(FoundRow).Select
        wbk.Sheets("Details").Rows(FoundRow).Delete
    End If
Next n

This is working fine. However for 160000 to 180000 records in first file and 5 to 6K rows (to be deleted in first file) it takes 40-45 minutes. Need guidance for this in excel vba.

CodePudding user response:

Following from my comment above. This ran in ~20sec for me (150k rows of data, 5k random values to be deleted)

EDIT: refactored a bit...

Sub DeleteMatches()
    
    Dim dict As Object, arr, n As Long, t
    Dim col As New Collection
    
    'create some sample data
    With Sheet1.[A2:A150000]
        .Formula = "=""Val_"" & TEXT(ROW()-1,""00000000"")"
        .Value = .Value
    End With
    
    t = Timer
    
    'load the ids to be deleted
    'tested with 5k rows of `="Val_" & TEXT(RANDBETWEEN(1,150000),"00000000")`
    Set dict = UniquesFromColumn(Sheet2.Range("A2"))
    Debug.Print "Loaded Ids: " & Timer - t
    
    'load the sheet1 id column into an array and scan through it,
    '  collecting any matched rows in the Collection
    arr = Sheet1.Range("A1", Sheet1.Cells(Rows.Count, 1).End(xlUp)).Value
    For n = 2 To UBound(arr, 1) 'skip header row if present
        If dict.exists(arr(n, 1)) Then col.Add Sheet1.Cells(n, 1)
    Next n
    Debug.Print "Scanned sheet1 for matches: " & Timer - t
    
    DeleteRows col 'delete the collected rows
    Debug.Print "Deleted " & col.Count & " rows: " & Timer - t
        
End Sub

'return a dictionary of unique values from a column, starting at `startCell`
Function UniquesFromColumn(startCell As Range) As Object
    Dim dict As Object, arr, n As Long, v
    Set dict = CreateObject("scripting.dictionary")
    With startCell.Parent
        arr = .Range(startCell, .Cells(.Rows.Count, 1).End(xlUp)).Value
    End With
    For n = 1 To UBound(arr)
        v = arr(n, 1)
        If Len(v) > 0 Then dict(v) = dict(v)   1
    Next n
    Set UniquesFromColumn = dict
End Function

'delete all rows based on a collection of cells
Sub DeleteRows(col As Collection)
    Dim rng As Range, n As Long, i As Long
    If col.Count = 0 Then Exit Sub
    'loop over the cells in the collection, building ranges for deletion
    For n = col.Count To 1 Step -1
        If rng Is Nothing Then
            Set rng = col(n)
            i = 1
        Else
            Set rng = Application.Union(rng, col(n))
            i = i   1
            If i > 200 Then 'union gets slow after a point, so delete and reset
                rng.EntireRow.Delete
                Set rng = Nothing
            End If
        End If
    Next n
    If Not rng Is Nothing Then rng.EntireRow.Delete 'any last rows?
End Sub

CodePudding user response:

Let Excel do all the dirty work using native features.

Logic:

  1. Use Remove duplicates to get unique entries from Sheet1
  2. Store the above data in an array
  3. Store Col G from 2nd File in an array
  4. Search 1st array in 2nd array and if found replace value by say "DELME"
  5. Write the 2nd array back to the worksheet
  6. Autofilter column G on "DELME" and delete all rows in 1 go.

Test Conditions I used

The total number of unique IDs in File A and the number of these IDs in File B will always influence the time taken by the code.

The other thing that will influence the time taken by the code is your hardware specs.

I test the below code on

  1. ★ CPU ★ Ryzen 5 5800X
  2. ★ GPU ★ NVIDIA GeForce RTX 3080 TI Founder Edition
  3. ★ RAM ★ G. Skill Trident Z RGB 32GB 3600MHz

See what time does this code give against your data?

File A: 6000 records out of which there are 2500 unique values.

File B: 150000 records which has 20830 duplicates values to be removed.

To do a stess test, I used these 2 files.

Sample Test Files

Code:

Here is the code that I tested

Option Explicit

'~~> This is the 2nd file. Change as applicable
Private Const fileA As String = "C:\Users\routs\Desktop\Delete Me Later\FileA.xlsx"
'~~> This is the 1st file. Change as applicable
Private Const fileB As String = "C:\Users\routs\Desktop\Delete Me Later\FileB.xlsx"

Sub Sample()
    Debug.Print Now
    
    Dim wbA As Workbook
    Dim wsA As Worksheet
    
    Set wbA = Workbooks.Open(fileA)
    
    '~~> This is the relevant sheet
    Set wsA = wbA.Sheets("Sheet1")
    
    Dim lRow As Long
    Dim lCol As Long
    Dim arA As Variant
    
    With wsA
        '~~> Remove any filters
        .AutoFilterMode = False
        
        '~~> Find last row and last column
        lCol = .Cells(1, .Columns.Count).End(xlToLeft).Column
        lRow = .Range("H" & .Rows.Count).End(xlUp).Row
    
        '~~> Use Excel remove duplicates to delete duplicates
        .Range("A1:" & Split(.Cells(, lCol).Address, "$")(1) & lRow).RemoveDuplicates Columns:=8, Header:=xlYes
    
        '~~> Find the next last row
        lRow = .Range("H" & .Rows.Count).End(xlUp).Row
    
        '~~> Store the data in an array
        arA = .Range("H2:H" & lRow).Value2
    End With
    
    Debug.Print "ID Array has " & lRow & " items"
    
    wbA.Close (False)
    
    Dim wbB As Workbook
    Dim wsB As Worksheet
    
    Set wbB = Workbooks.Open(fileB)
    
    '~~> This is the relevant sheet
    Set wsB = wbB.Sheets("Sheet1")
    
    Dim arB As Variant
    Dim lastCol As String
    Dim oldRow As Long, newRow As Long
    
    With wsB
        '~~> Remove any filters
        .AutoFilterMode = False
            
        '~~> Find last row and last column
        lCol = .Cells(1, .Columns.Count).End(xlToLeft).Column
        lastCol = Split(.Cells(, lCol).Address, "$")(1)
        
        lRow = .Range("G" & .Rows.Count).End(xlUp).Row
        oldRow = lRow
        
        Debug.Print "Main Array has " & lRow & " items"
    
        '~~> Store the data in an array
        arB = .Range("G2:G" & lRow).Value2
    End With
    
    Dim i As Long
    Dim j As Long
    Dim k As Long
    
    For i = LBound(arA) To UBound(arA)
        For j = LBound(arB) To UBound(arB)
            If arB(j, 1) = arA(i, 1) Then arB(j, 1) = "DELME"
        Next j
    Next i
    
    Dim Rng As Range
    
    With wsB
        .Range("G2").Resize(UBound(arB), 1).value = arB
        
        Set Rng = .Range("A1:" & lastCol & lRow)
        
        With Rng
            '~~> Filter, offset(to exclude headers) and delete visible rows
            With Rng
              .AutoFilter Field:=7, Criteria1:="DELME"
              .Offset(1, 0).SpecialCells(xlCellTypeVisible).EntireRow.Delete
            End With
        End With
        
        '~~> Remove any filters
        .AutoFilterMode = False
        
        lRow = .Range("G" & .Rows.Count).End(xlUp).Row
        newRow = lRow
        
        Debug.Print "Total " & (oldRow - newRow) & " items were removed."
    End With
    
    Debug.Print Now
End Sub

Output

The code took 58 seconds on this particular test data.

08-12-2021 13:16:51 
ID Array has 2500 items
Main Array has 150000 items
Total 20830 items were removed.
08-12-2021 13:17:49 
  • Related