Home > OS >  read a very large Delimited and insert data into SQL after Number of Lines
read a very large Delimited and insert data into SQL after Number of Lines

Time:11-12

In my project, I'm reading a very large delimited file and building a StringBuilder with Insert statement. Once the entire file is read, I'm executing the content to StringBuilder which now has a bunch of Insert statements with Data. However, when I run execute the statement, it's timing out because of the large amount of data (about 132K or more rows). I would like to the best way to split that into multiple chunks and execute them separately. Initially, the original code was executing each row at a time, and that creates a lot of traffic on the database for just one process.

Each row in the text file has SQL statement like this.

INSERT INTO dbo.EmployeesPayrollDataImport (SSN, EID, FullName,  [Status],DepartmentCode, SSNPrior) VALUES ( ....

This is the code that executes the script. But when I execute it, I get a timeout error before it finishes. I could increase the command timeout, but if the file gets larger, I would run into the same problem. I have also tried Bulk Insert but I don't have admin permissions on the SQL server, I only have Read write permissions and I think in Bulkcopy the file has to reside on the SQL server which I don't have access to.

            Using Cn As New SqlConnection(sConnection)
                Cn.Open()
                Using sqlCmd As New SqlCommand(s_SQLScript.ToString, Cn)
                    sqlCmd.CommandTimeout = 2000
                    rows = CInt(sqlCmd.ExecuteNonQuery)

                End Using 'SqlCmd
            End Using 'Cn

I would like to be able to execute my command in multiple chunks.

CodePudding user response:

You want to look into either the SQLBulkCopy type or the BULK INSERT SQL command. Either of those will give you far faster performance, and should also help eliminate the issue in the question.

But honestly, what I tend to do is code that looks more like this:

Public Sub InsertEmployeePayrollDataRecords(filePath As String)
    Dim SQL As String = "INSERT INTO dbo.EmployeesPayrollDataImport (SSN, EID, FullName,  [Status],DepartmentCode, SSNPrior) VALUES (@SSN, @EID, @FullName, @Status, @DepartmentCode, @SSNPrior)"

    Dim lines = File.ReadLines(filePath).
          Skip(1). 'skip the header
          Select(Function(line) line.Split(",")) 
        ' In real code, I NEVER use string.split for this
        '  (pull a real CSV parser from NuGet instead)
        '  but I find it useful for examples on Stack Overflow
     
    Using con As New SqlConnection("connection string here"), _
          cmd As New SqlCommand(SQL, con)

        cmd.Parameters.Add("@SSN", SqlDbType.Char, 9)
        cmd.Parameters.Add("@EID", SqlDbType.Char, 14)
        cmd.Parameters.Add("@FullName", SqlDbType.NVarChar, 60)
        cmd.Parameters.Add("@Status" SqlDbType.Int)
        cmd.Parameters.Add("@DepartmentCode", SqlDbType.Char, 5)
        cmd.Parameters.Add("@SSNPrior", SqlDbType.Char, 9)

        con.Open()
        For Each line() As String in lines
            cmd.Parameters("@SSN").Value = line(0)
            cmd.Parameters("@EID").Value = line(1)
            cmd.Parameters("@FullName").Value = line(2)
            cmd.Parameters("@Status").Value = line(3)
            cmd.Parameters("@DepartmentCode").Value = line(4)
            cmd.Parameters("@SSNPrior").Value = line(5)             
            cmd.ExecuteNonQuery()
        Next
    End Using
End Sub

Some notes on this:

  • It only opens ONE connection to the database, which will save your system a lot of work.
  • Because of the query parameters, SQL Server will see each of these as the same query. This will enable it to re-use the execution plan and save most of the compilation step.
  • Even though SQL Server sees the same query, it will still use the correct values on each iteration
  • These values are safe from any possibility of injection
  • This can start executing INSERT statements as soon as the first line is ready, instead of needing to wait until the last line as with the original code
  • This only needs enough memory to handle one line from the file at time, instead of the entire file
  • Notice the loop itself does the minimum work possible: update existing parameter values and then execute the statement. This makes it as fast as possible using an iterative solution (BULK solutions are still faster)
  • You don't have to mess with CommandTimeout anymore, because each iteration is a quick single INSERT
  • I had to guess at column types. You should use actual column types from the database.

The one thing you should NOT do is use string concatenation or StringBuilder to include the values in your INSERT statements. This is begging to have problems. It's only a matter of time until someone puts a malicious single quote into your file and uses it to run their own queries. String concatenation/StringBuilder can still have a place in an operation like this, but only for adding query parameter markers. NEVER the values themselves.

Since you are interested in using chunks, if you wanted to you could further optimize this to have multiple copies of the same statement with a number appended to the parameter names, and an internal counter for which set of numbered parameters to use on each row and when to execute the full batch. But I would only expect minimal performance improvement.

CodePudding user response:

you could try in a few ways, one of which would be a task delimiting the number of entries per task and then using a sleep waiting end of one to start the next.

  • Related