I need help in working with big Excel Tables.
Description
I have an export of Data from our ERP System that has 400K Rows at least. In this report the format is quite messed up and I want to write a script that will clean up all the data.
I started to write little sub just to delete empty rows and such that have a special behavior. please see below:
Sub Main()
OptimizeVBA (True)
DeleteLastRows
OptimizeVBA (False)
End Sub
Sub DeleteLastRows()
'Achtung, diese Funktion dauert sehr lange
Dim total
total = ActiveSheet.UsedRange.Rows.Count
Dim Tim1 As Single
Tim1 = Timer
For i = total To total - 100 Step -1
If ThatSpecialLine("0", i, 1, 9) Then
'DeleteRow (i)
Rows(i).EntireRow.Delete
ElseIf EmptyRow(i, 1, 13) Then
'DeleteRow (i)
Rows(i).EntireRow.Delete
End If
Next
Tim1 = Timer - Tim1
MsgBox ("Anzahl der Zeilen nach der Bearbeitung: " & ActiveSheet.UsedRange.Rows.Count & vbNewLine & "Dafür wurde: " & Tim1 & " gebraucht")
End Sub
Function EmptyRow(ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
EmptyRow = True
Dim temp As String
For i = startc To EndC
temp = Cells(Row, i).Value
temp = Trim(temp)
If temp <> "" Then
EmptyRow = False
Exit Function
End If
Next
End Function
Function ThatSpecialLine(ByVal val As String, ByVal Row As Long, ByVal startc As Integer, ByVal EndC As Integer) As Boolean
ThatSpecialLine = False
If EmptyRow(Row, startc, EndC) Then
If Cells(Row, EndC 1).Value = val Then
ThatSpecialLine = True
End If
End If
End Function
Sub OptimizeVBA(isOn As Boolean)
Application.Calculation = IIf(isOn, xlCalculationManual, xlCalculationAutomatic)
Application.EnableEvents = Not (isOn)
Application.ScreenUpdating = Not (isOn)
ActiveSheet.DisplayPageBreaks = Not (isOn)
End Sub
This code needs about 14 seconds to execute for just 100 lines. I am wondering why the performance is so bad. I have no experience in making an application performance optimizing so please be kind if my question is very stupid :).
Questions
- Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.
- How to improve my vba code?
Would it be better / faster to export this .xlsx file to .txt file and process with a programm i write in Visual studio with vb.net or C#? this would be my next idea.
Thanks in advance
CodePudding user response:
There are 2 things in your code that makes the execution slow.
The first thing has something to do with Excel vs VBA. Every time your VBA code needs something from Excel, it has to call an internal interface and that is rather slow. You can't measure this when you have a sheet with a few rows/columns, but in a sheet with 400k rows and (at least) 13 columns of data, you have 5 millions cells, and your code reads most of them 2 times. This can be speed up by reading large chunks of data into an array. This is only one read and for that 5 million cells it's a matter of maybe a second.
The second thing is pure Excel: Deleting a row of data from a worksheet is painfully slow, even if you switch off recalculation and screen update. That means, you should decrease the number of deletes by "collecting" rows to be deleted into a Range variable and then delete them all at once. However, the number of rows collected shouldn't bee too high. I experimented a little bit and 1000 seemed to be reasonable.
Sub DeleteLastRows()
Const DeleteChunkSize = 1000
Dim lastRow As Long
With ThisWorkbook.Sheets(1)
lastRow = .Cells(.Rows.Count, "A").End(xlUp).row
' Read all Data into Memory
Dim AllData As Variant
AllData = .Range(.Cells(1, 1), .Cells(lastRow, 13))
Debug.Print "data read"
Dim row As Long
For row = lastRow To 2 Step -1
If row Mod 100 = 0 Then DoEvents
Dim deleteRange As Range, deleteCount As Long
Dim toBeDeleted As Boolean
toBeDeleted = ThatSpecialLine(AllData, "0", row, 1, 9) Or EmptyRow(AllData, row, 1, 13)
If toBeDeleted Then
deleteCount = deleteCount 1
If deleteRange Is Nothing Then
Set deleteRange = .Cells(row, 1).EntireRow
Else
Set deleteRange = Union(deleteRange, .Cells(row, 1).EntireRow)
End If
' Delete only if a certain number of rows to be deleted is reached to speed up runtime
If deleteCount >= DeleteChunkSize Then
DoEvents
deleteRange.Delete xlUp
Set deleteRange = Nothing
deleteCount = 0
End If
End If
Next row
End With
' delete the last chunk of data if any
If Not deleteRange Is Nothing Then
deleteRange.Delete xlUp
End If
End Sub
I adapted your helper routine so that they work on the array of data which is passed as argument:
Function EmptyRow(data As Variant, row As Long, startc As Long, EndC As Long) As Boolean
EmptyRow = True
Dim temp As String
Dim i As Long
For i = startc To EndC
temp = Trim(data(row, i))
If temp <> "" Then
EmptyRow = False
Exit Function
End If
Next
End Function
Function ThatSpecialLine(data As Variant, val As String, row As Long, startc As Long, EndC As Long) As Boolean
If Not EmptyRow(data, row, startc, EndC) Then Exit Function
ThatSpecialLine = (data(row, EndC 1) = val)
End Function
That code took more or less 1s for 1000 rows that where to be deleted - my example sheet had approx 30% of such rows. That would lead to a runtime in the range of few minutes.
But there is a much faster attempt, assuming that you are only interested in the data, not in formatting. Instead of deleting rows in the Excel sheet, copy the data you want to keep in a second array. When done, delete all data of your sheet and write the copied data back to Excel. This took maybe 2 or 3 seconds in my example sheet with > 800k rows:
Sub CopyRelevantData()
Dim lastRow As Long
With ThisWorkbook.Sheets(1)
lastRow = .Cells(.Rows.Count, "A").End(xlUp).row
' Read all Data into Memory
Dim AllData As Variant, newData As Variant
AllData = .Range(.Cells(1, 1), .Cells(lastRow, 13))
' Create a second array where you copy the data you want to keep
ReDim newData(LBound(AllData, 1) To UBound(AllData, 1), LBound(AllData, 2) To UBound(AllData, 2))
Debug.Print "data read"
Dim row As Long, newRow As Long
For row = 1 To lastRow
Dim toBeDeleted As Boolean
toBeDeleted = ThatSpecialLine(AllData, "0", row, 1, 9) Or EmptyRow(AllData, row, 1, 13)
If Not toBeDeleted Then
' Copy this row of data
newRow = newRow 1
Dim col As Long
For col = LBound(AllData, 2) To UBound(AllData, 2)
newData(newRow, col) = AllData(row, col)
Next col
End If
If row Mod 100 = 0 Then DoEvents
Next row
.UsedRange.Clear
.Cells(1, 1).Resize(UBound(AllData, 1), UBound(AllData, 2)) = newData
End With
End Sub
CodePudding user response:
Trying adding in this, it speeds up by turning off screen updating, events, animations etc, this should speed it up a bit!
At the start of your code add in this sub
Call TurnOffCode
At the end of your code add in this sub
Call TurnOnCode
This is what they should both look like
Sub TurnOffCode() 'Used to turn off settings to make workbook run faster
Application.Calculation = xlCalculationManual 'Set calculations to manual
Application.ScreenUpdating = False 'Turns off screen updating
Application.EnableEvents = False 'Turns off events
Application.EnableAnimations = False 'Turns off animations
Application.DisplayStatusBar = False 'Turns off display status bar
Application.PrintCommunication = False 'Turns off print communications
End Sub
Sub TurnOnCode() 'Used to turn settings back on to normal
Application.Calculation = xlCalculationAutomatic 'Set calculations to automatic
Application.ScreenUpdating = True 'Turns on screen updating
Application.EnableEvents = True 'Turns on events
Application.EnableAnimations = True 'Turns on animations
Application.DisplayStatusBar = True 'Turns on display status bar
Application.PrintCommunication = True 'Turns on print communications
End Sub
CodePudding user response:
The Solution was to make a quick program in VB.Net with just reading the needed Lines. Also i made some improvements to the code.
The following code needs just 1 sec to read the File in the List(of string()) and to write it back again to .csv
I don't think i will use vba again for big data. Feel free to change my mind.
Imports System.IO
Imports System.IO.File
Imports System.Text
Public Class Form1
Public Datas As New List(Of String())
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
FileToList()
DataToFile()
End Sub
Sub FileToList()
Using sr As StreamReader = New StreamReader("Bestand 31.10.2022.CSV", Encoding.Default)
Dim Time As DateTime = Now
Dim span As TimeSpan
Do Until sr.Peek() = -1
Dim s As String = sr.ReadLine()
Dim a() As String = s.Split(";")
If Not EmptyRow(a) Then
Datas.Add(a)
End If
Loop
span = Now - Time
Dim i As Long = Datas.Count
MessageBox.Show(String.Format("Es sind: {0} Zeilen vorhanden in der Liste" & vbCrLf &
"Dies benötigte: {1}s", i, span.TotalSeconds))
End Using
End Sub
Sub DataToFile()
Dim Time As DateTime = Now
Dim span As TimeSpan
Using fs As FileStream = New FileStream("Test_" & DateTime.Now.ToShortDateString & ".csv", FileMode.Create)
Using sw As StreamWriter = New StreamWriter(fs, Encoding.Default)
For i = 0 To Datas.Count - 1
sw.WriteLine(Join(Datas(i), ";"))
Next
End Using
End Using
span = Now - Time
MessageBox.Show(String.Format("Das Erstellen der neuen Datei hat: {0}s gedauert", span.TotalSeconds))
End Sub
Function EmptyRow(ByVal Array As String()) As Boolean
For i = 0 To Array.Count - 1
If Array(i) <> "" Then
Return False
End If
Return True
Next
End Function
End Class