Home > Enterprise >  Handle extra column while importing delimited file in adf
Handle extra column while importing delimited file in adf

Time:12-07

I am having a csv file with delimiter '|'. For some of the rows the string itself contains '|'. At the end these rows are getting an additional column. So, when ever copying data using a copy activity, ADF is throwing an error. How to skip the copy activity for these particular rows ?

I have tried deleting these rows in file itself. But the main problem here is, I would be getting files every day that are to be loaded into db.

CodePudding user response:

This problem comes up frequently, usually with commas, and there aren't any good answers. Below are my recommendations in order of preference.

If you can control the input file format, I would recommend doing both of these:

  1. Change the file delimiter. Change the file to use a delimiter that would not occur in your data. Again, this issue occurs most frequently with comma (,) delimiters because commas often show up in the underlying data. Pipestem (|) is usually a good option as it does not organically occur in text. Since that is not the case here, you may need to get more creative and use something like caret (^). Tabs (\t) are also a solid option and probably the easiest change to implement.
  2. Wrap the fields with Quotes. Doing this will allow the text inside the quotes to contain the delimiter character. This is a good practice regardless of the delimiter, but can add significant bloat to the file size depending on the number of rows and columns. You can also choose to only quote fields that contain the delimiter in the text.

If you cannot change the input file, you'll need a preprocessor step to remove the offending rows. Basically, I would read each line of the original file as a single text value (not parsing) and count the delimiters. If a row has the proper delimiter count, then write it out to a secondary file. Then you can use the secondary file for your downstream processing. This would be my last resort because of the data loss, but it might be tolerable in your situation. You could use a Data Flow with a schema-less source dataset to accomplish this step.

  • Related