Home > Enterprise >  Script too slow. How to handle big files?
Script too slow. How to handle big files?

Time:11-10

I need help in working with big Excel Tables.

Description


I have an export of Data from our ERP System that has 400K Rows at least. In this report the format is quite messed up and I want to write a script that will clean up all the data.

I started to write little sub just to delete empty rows and such that have a special behavior. please see below:

Sub Main()
OptimizeVBA (True)
DeleteLastRows
OptimizeVBA (False)
End Sub

Sub DeleteLastRows()
'Achtung, diese Funktion dauert sehr lange
Dim total
total = ActiveSheet.UsedRange.Rows.Count
Dim Tim1 As Single
Tim1 = Timer
For i = total To total - 100 Step -1
    If ThatSpecialLine("0", i, 1, 9) Then
        'DeleteRow (i)
        Rows(i).EntireRow.Delete
        ElseIf EmptyRow(i, 1, 13) Then
        'DeleteRow (i)
        Rows(i).EntireRow.Delete
    End If
Next
Tim1 = Timer - Tim1
MsgBox ("Anzahl der Zeilen nach der Bearbeitung: " & ActiveSheet.UsedRange.Rows.Count & vbNewLine & "Dafür wurde: " & Tim1 & " gebraucht")

End Sub

Function EmptyRow(ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
EmptyRow = True
Dim temp As String
    For i = startc To EndC
     temp = Cells(Row, i).Value
     temp = Trim(temp)
        If temp <> "" Then
            EmptyRow = False
            Exit Function
        End If
    Next
End Function

Function ThatSpecialLine(ByVal val As String, ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
ThatSpecialLine = False
If EmptyRow(Row, startc, EndC) Then
    If Cells(Row, EndC   1).Value = val Then
        ThatSpecialLine = True
    End If
End If
End Function

Sub OptimizeVBA(isOn As Boolean)
    Application.Calculation = IIf(isOn, xlCalculationManual, xlCalculationAutomatic)
    Application.EnableEvents = Not (isOn)
    Application.ScreenUpdating = Not (isOn)
    ActiveSheet.DisplayPageBreaks = Not (isOn)
End Sub

This code needs about 14 seconds to execute for just 100 lines. I am wondering why the performance is so bad. I have no experience in making an application performance optimizing so please be kind if my question is very stupid :).

Questions


  1. Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.
  2. How to improve my vba code?

Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.

Thanks in advance

CodePudding user response:

There are 2 things in your code that makes the execution slow.

The first thing has something to do with Excel vs VBA. Every time your VBA code needs something from Excel, it has to call an internal interface and that is rather slow. You can't measure this when you have a sheet with a few rows/columns, but in a sheet with 400k rows and (at least) 13 columns of data, you have 5 millions cells, and your code reads most of them 2 times. This can be speed up by reading large chunks of data into an array. This is only one read and for that 5 million cells it's a matter of maybe a second.

The second thing is pure Excel: Deleting a row of data from a worksheet is painfully slow, even if you switch off recalculation and screen update. That means, you should decrease the number of deletes by "collecting" rows to be deleted into a Range variable and then delete them all at once. However, the number of rows collected shouldn't bee too high. I experimented a little bit and 1000 seemed to be reasonable.

Sub DeleteLastRows()
    Const DeleteChunkSize = 1000
    
    Dim lastRow As Long
    With ThisWorkbook.Sheets(1)
        lastRow = .Cells(.Rows.Count, "A").End(xlUp).row
    
        ' Read all Data into Memory
        Dim AllData As Variant
        AllData = .Range(.Cells(1, 1), .Cells(lastRow, 13))
        Debug.Print "data read"
        Dim row As Long
        For row = lastRow To 2 Step -1
            If row Mod 100 = 0 Then DoEvents
            Dim deleteRange As Range, deleteCount As Long
            
            Dim toBeDeleted As Boolean
            toBeDeleted = ThatSpecialLine(AllData, "0", row, 1, 9) Or EmptyRow(AllData, row, 1, 13)
            If toBeDeleted Then
                deleteCount = deleteCount   1
                If deleteRange Is Nothing Then
                    Set deleteRange = .Cells(row, 1).EntireRow
                Else
                    Set deleteRange = Union(deleteRange, .Cells(row, 1).EntireRow)
                End If

                ' Delete only if a certain number of rows to be deleted is reached to speed up runtime
                If deleteCount >= DeleteChunkSize Then
                    DoEvents
                    deleteRange.Delete xlUp
                    Set deleteRange = Nothing
                    deleteCount = 0
                End If
            End If
        
        Next row
    End With
    ' delete the last chunk of data if any
    If Not deleteRange Is Nothing Then
        deleteRange.Delete xlUp
    End If

End Sub

I adapted your helper routine so that they work on the array of data which is passed as argument:

Function EmptyRow(data As Variant, row As Long, startc As Long, EndC As Long) As Boolean
    EmptyRow = True
    Dim temp As String
    
    Dim i As Long
    For i = startc To EndC
        temp = Trim(data(row, i))
        If temp <> "" Then
            EmptyRow = False
            Exit Function
        End If
    Next
End Function

Function ThatSpecialLine(data As Variant, val As String, row As Long, startc As Long, EndC As Long) As Boolean
    If Not EmptyRow(data, row, startc, EndC) Then Exit Function
    ThatSpecialLine = (data(row, EndC   1) = val)
End Function

That code took more or less 1s for 1000 rows that where to be deleted - my example sheet had approx 30% of such rows. That would lead to a runtime in the range of few minutes.

But there is a much faster attempt, assuming that you are only interested in the data, not in formatting. Instead of deleting rows in the Excel sheet, copy the data you want to keep in a second array. When done, delete all data of your sheet and write the copied data back to Excel. This took maybe 2 or 3 seconds in my example sheet with > 800k rows:

Sub CopyRelevantData()
    Dim lastRow As Long
    With ThisWorkbook.Sheets(1)
        lastRow = .Cells(.Rows.Count, "A").End(xlUp).row
    
        ' Read all Data into Memory
        Dim AllData As Variant, newData As Variant
        AllData = .Range(.Cells(1, 1), .Cells(lastRow, 13))
        ' Create a second array where you copy the data you want to keep
        ReDim newData(LBound(AllData, 1) To UBound(AllData, 1), LBound(AllData, 2) To UBound(AllData, 2))
        Debug.Print "data read"
        Dim row As Long, newRow As Long
        For row = 1 To lastRow
            Dim toBeDeleted As Boolean
            toBeDeleted = ThatSpecialLine(AllData, "0", row, 1, 9) Or EmptyRow(AllData, row, 1, 13)
            If Not toBeDeleted Then
                ' Copy this row of data
                newRow = newRow   1
                Dim col As Long
                For col = LBound(AllData, 2) To UBound(AllData, 2)
                    newData(newRow, col) = AllData(row, col)
                Next col
            End If
            If row Mod 100 = 0 Then DoEvents
        Next row
        
        .UsedRange.Clear
        .Cells(1, 1).Resize(UBound(AllData, 1), UBound(AllData, 2)) = newData
    End With
End Sub

CodePudding user response:

Trying adding in this, it speeds up by turning off screen updating, events, animations etc, this should speed it up a bit!

At the start of your code add in this sub

Call TurnOffCode

At the end of your code add in this sub

Call TurnOnCode

This is what they should both look like

Sub TurnOffCode() 'Used to turn off settings to make workbook run faster
Application.Calculation = xlCalculationManual 'Set calculations to manual
Application.ScreenUpdating = False 'Turns off screen updating
Application.EnableEvents = False 'Turns off events
Application.EnableAnimations = False 'Turns off animations
Application.DisplayStatusBar = False 'Turns off display status bar
Application.PrintCommunication = False 'Turns off print communications
End Sub

Sub TurnOnCode() 'Used to turn settings back on to normal
Application.Calculation = xlCalculationAutomatic 'Set calculations to automatic
Application.ScreenUpdating = True 'Turns on screen updating
Application.EnableEvents = True 'Turns on events
Application.EnableAnimations = True 'Turns on animations
Application.DisplayStatusBar = True 'Turns on display status bar
Application.PrintCommunication = True 'Turns on print communications
End Sub

CodePudding user response:

The Solution was to make a quick program in VB.Net with just reading the needed Lines. Also i made some improvements to the code.

The following code needs just 1 sec to read the File in the List(of string()) and to write it back again to .csv

I don't think i will use vba again for big data. Feel free to change my mind.

Imports System.IO
Imports System.IO.File
Imports System.Text

Public Class Form1

    Public Datas As New List(Of String())

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        FileToList()
        DataToFile()
    End Sub


    Sub FileToList()
        Using sr As StreamReader = New StreamReader("Bestand 31.10.2022.CSV", Encoding.Default)
            Dim Time As DateTime = Now
            Dim span As TimeSpan
            Do Until sr.Peek() = -1
                Dim s As String = sr.ReadLine()
                Dim a() As String = s.Split(";")
                If Not EmptyRow(a) Then
                    Datas.Add(a)
                End If
            Loop
            span = Now - Time

            Dim i As Long = Datas.Count
            MessageBox.Show(String.Format("Es sind: {0} Zeilen vorhanden in der Liste" & vbCrLf &
                                          "Dies benötigte: {1}s", i, span.TotalSeconds))
        End Using

    End Sub

    Sub DataToFile()
        Dim Time As DateTime = Now
        Dim span As TimeSpan
        Using fs As FileStream = New FileStream("Test_" & DateTime.Now.ToShortDateString & ".csv", FileMode.Create)
            Using sw As StreamWriter = New StreamWriter(fs, Encoding.Default)
                For i = 0 To Datas.Count - 1
                    sw.WriteLine(Join(Datas(i), ";"))
                Next
            End Using
        End Using

        span = Now - Time
        MessageBox.Show(String.Format("Das Erstellen der neuen Datei hat: {0}s gedauert", span.TotalSeconds))
    End Sub

    Function EmptyRow(ByVal Array As String()) As Boolean

        For i = 0 To Array.Count - 1
            If Array(i) <> "" Then
                Return False
            End If
            Return True
        Next

    End Function


End Class
  • Related