Bulk insert csv file with semicolon as delimiter-CodePudding

I'm trying to import data from semicolon separated csv file into a SQL Server database. Here is the table structure

CREATE TABLE [dbo].[waste_facility] 
(
    [Id]           INT             IDENTITY (1, 1) NOT NULL,
    [postcode]     VARCHAR (50)    NULL,
    [name]         VARCHAR (50)    NULL,
    [type]         VARCHAR (255)   NULL,
    [street]       VARCHAR (255)   NULL,
    [suburb]       VARCHAR (255)   NULL,
    [municipality] VARCHAR (255)   NULL,
    [telephone]    VARCHAR (255)   NULL,
    [website]      VARCHAR (255)   NULL,
    [longtitude]   DECIMAL (18, 8) NULL,
    [latitude]     DECIMAL (18, 8) NULL,
    PRIMARY KEY CLUSTERED ([Id] ASC)
);

The csv file is shown below:

Location Coordinate;Feature Extent;Projection;Postcode;Name Of Facility;Type Of Facility;Street;Suburb;Municipality;Telephone Number;Website;Easting Coordinate;Northing Coordinate;Longitude Coordinate;Latitude Coordinate;Google Maps Direction
-37.9421182892,145.3193857967;"{""coordinates"": [145.3193857967, -37.9421182892], ""type"": ""Point""}";MGA zone 55;3156;Cleanaway Lysterfield Resource Recovery Centre;Recovery Centre;840 Wellington Road;LYSTERFIELD;Yarra Ranges;9753 5411;https://www.cleanaway.com.au/location/lysterfield/;352325;5799275;145.31938579674124;-37.94211828921733;https://www.google.com.au/maps/dir//-37.94211828921733,145.31938579674124/@your location,17z/data=!4m2!4m1!3e0
-38.0529529215,145.2433557709;"{""coordinates"": [145.2433557709, -38.0529529215], ""type"": ""Point""}";MGA zone 55;3175;Smart Recycling (South Eastern Depot);Recycling Centre;185 Dandenong-Hastings Rd;LYNDHURST;Greater Dandenong;8787 3300;https://smartrecycling.com.au/;345876;5786853;145.24335577090602;-38.05295292152536;https://www.google.com.au/maps/dir//-38.05295292152536,145.24335577090602/@your location,17z/data=!4m2!4m1!3e0
-38.0533129717,145.267610135;"{""coordinates"": [145.267610135, -38.0533129717], ""type"": ""Point""}";MGA zone 55;3976;Hampton Park Transfer Station (Outlook Environmental);Transfer Station;274 Hallam Road;HAMPTON PARK;Casey;9554 4502;https://www.suez.com.au/en-au/who-we-are/suez-in-australia-and-new-zealand/our-locations/waste-management-hampton-park-transfer-station;348005;5786853;145.2676101350274;-38.053312971691255;https://www.google.com.au/maps/dir//-38.053312971691255,145.2676101350274/@your location,17z/data=!4m2!4m1!3e0
-38.1243050577,145.2183465487;"{""coordinates"": [145.2183465487, -38.1243050577], ""type"": ""Point""}";MGA zone 55;3977;Frankston Regional Recycling and Recovery Centre;Recycling Centre;20 Harold Road;SKYE;Frankston;1300 322 322;https://www.frankston.vic.gov.au/Environment-and-Waste/Waste-and-Recycling/Frankston-Regional-Recycling-and-Recovery-Centre-FRRRC/Accepted-Items-at-FRRRC;343833;5778893;145.21834654873447;-38.12430505770815;https://www.google.com.au/maps/dir//-38.12430505770815,145.21834654873447/@your location,17z/data=!4m2!4m1!3e0
-38.0973208774,145.4920399066;"{""coordinates"": [145.4920399066, -38.0973208774], ""type"": ""Point""}";MGA zone 55;3810;Pakenham Waste Transfer Station (Future Recycling);Transfer Station;30-32 Exchange Drive;PAKENHAM;Cardinia;13Recycling;https://www.futurerecycling.com.au/;367776;5782313;145.4920399066473;-38.09732087738631;https://www.google.com.au/maps/dir//-38.09732087738631,145.4920399066473/@your location,17z/data=!4m2!4m1!3e0

There are some columns that I don't need, so I create a format file to import the data. The format file is shown as below

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharFixed" LENGTH="50"/>
  <FIELD ID="12" xsi:type="CharFixed" LENGTH="50"/>
  <FIELD ID="13" xsi:type="CharFixed" LENGTH="50"/>
  <FIELD ID="2" xsi:type="CharFixed" LENGTH="50" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="3" xsi:type="CharFixed" LENGTH="50" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="4" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="5" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="6" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="7" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="8" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="9" xsi:type="CharFixed" LENGTH="255" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
  <FIELD ID="14" xsi:type="CharFixed" LENGTH="50"/>
  <FIELD ID="15" xsi:type="CharFixed" LENGTH="50"/>
  <FIELD ID="10" xsi:type="CharFixed" LENGTH="41"/>
  <FIELD ID="11" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="41"/>
  <FIELD ID="16" xsi:type="CharFixed" LENGTH="50"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="2" NAME="postcode" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="3" NAME="name" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="4" NAME="type" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="5" NAME="street" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="6" NAME="suburb" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="7" NAME="municipality" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="8" NAME="telephone" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="9" NAME="website" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="10" NAME="longtitude" xsi:type="SQLDECIMAL" PRECISION="18" SCALE="8"/>
  <COLUMN SOURCE="11" NAME="latitude" xsi:type="SQLDECIMAL" PRECISION="18" SCALE="8"/>
 </ROW>
</BCPFORMAT>

Then I tried both bulk insert and bcp in - neither of them works.

Here is the bulk insert command

USE [waste-facility-locations];  

BULK INSERT [dbo].[waste_facility]   
FROM 'E:\onboardingIteration\waste-facility-locations.csv'   
WITH (FORMATFILE = 'E:\onboardingIteration\waste_facility_formatter.xml',
      FIRSTROW = 2,
      LASTROW = 6,
      FIELDTERMINATOR = ';',
      ROWTERMINATOR = '\n',
      ERRORFILE = 'E:\onboardingIteration\myRubbishData.log');

But unlucky some error file were generated. Here is what myRubbishData.log error says:

Row 2 File Offset 1993 ErrorFile Offset 0 - HRESULT 0x80004005

And the actual row stored in myRubbishData.txt:

;Pakenham Waste Transfer Station (Future Recycling);Transfer Station;30-32 Exchange Drive;PAKENHAM;Cardinia;13Recycling;https://www.futurerecycling.com.au/;367776;5782313;145.4920399066473;-38.09732087738631;https://www.google.com.au/maps/dir//-38.09732087738631,145.4920399066473/@your location,17z/data=!4m2!4m1!3e0;Pakenham Waste Transfer Station (Future Recycling);Transfer Station;30-32 Exchange Drive;PAKENHAM;Cardinia;13Recycling;https://www.futurerecycling.com.au/;367776;5782313;145.4920399066473;-38.09732087738631;https://www.google.com.au/maps/dir//-38.09

As you can see, it seems like rows are not correctly separated. So I tried to change the row delimiter to "\n","\r","\n\r","\r\n", none of them work.

And I tried bcp. It did not work either.

Here is the bcp command I used:

bcp [waste-facility-locations].[dbo].[waste_facility] in "E:\onboardingIteration\waste-facility-locations.csv" -f "E:\onboardingIteration\waste_facility_formatter.xml" -T -S "(LocalDB)\MSSQLLocalDB" -F 2 -t ";" -r "\n"

Then I get an error said somehow the same thing

SQLState = S1000, NativeError = 0
Error = [Microsoft][ODBC Driver 17 for SQL Server]Unexpected EOF encountered in BCP data-file

0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 1

One interesting things is, if I create a new excel and choose "Get data" option to import the csv file, the file can be literally correctly parsed.

Basically I can't find what I did wrong here. Can someone help me on this one?

CodePudding user response：

The SQL Server import facilities are very intolerant of bad data and even just formatting variations or options. In my career, I have literally spent thousands of work-hours trying to develop and debug import procedures for customers. I can tell you right now, that trying to fix this with SQL alone is both difficult and time-consuming.

When you have this problem (bad data and/or inconsistent formatting) it is almost always easier to find or develop a more flexible tool to pre-process the data into the rigid standard that SQL expects. So I would say that if Excel can parse it then just use Excel automation to pre-process them and then use SQL to import the Excel output. If that's not practical for you, then I'd advise writing your own tool in some client language (C#, Vb, Java, Python, etc.) to pre-process the files.

You can do it in SQL (and I have done it many times), but I promise you that it is a long complicated trek.

SSIS has more flexible error-handling for problems like this, but if you are not already familiar and using it, it has a very steep learning curve and your first SSIS project is likely to be very time-consuming also.