Home > Software design >  How to convert CSV to a nested JSON array using Azure Data Factory?
How to convert CSV to a nested JSON array using Azure Data Factory?

Time:10-14

I'm facing a pretty interesting task to convert an arbitrary CSV file to a JSON structure following this schema:

{
    "Data": [
        ["value_1", "value_2"],
        ["value_3", "value_4"]
    ]
}

In this case, the input file will look like this:

value_1,value_2
value_3,value_4

The requirement is to use Azure Data Factory and I won't be able to delegate this task to Azure Functions or other services. I'm thinking about using 'Copy data' activity but can't get my mind around the configuration. TabularTranslator seems to only work with a definite number of columns but the CSV that I can receive can contain any number of columns. Maybe DataFlows can help me but their setup doesn't look to be an easy one either. Plus, if I get it correctly, DataFlows take more time to start up.

So, basically, I just need to take the CSV content and put it into "Data" 2d array.

Any ideas on how to accomplish this?

CodePudding user response:

To achieve this requirement, using Copy data or TabularTranslator is complicated. This can be achieved using dataflows in the following way.

  • First create a source dataset using the following configurations. This allows us to read entire row as a single column value (string):

enter image description here

  • Import the projection and name the column as data. The following is how the data preview looks like:

enter image description here

  • Now, first split these column values using split function in derived column transformations. I am replacing the same column using split(data,',').

enter image description here

  • Then, I have added a key column with a constant value 'x' so that I can group all rows and covert the grouped data into array of arrays.

enter image description here

  • The data would look like this after the above step:

enter image description here

  • Use aggregate transformation to group by the above created column and use collect aggregate function to create array of arrays (collect(data)).

enter image description here

  • Use select transformation to select only the above created column Data.

enter image description here

  • Finally, in the sink, select your destination and create a sink JSON dataset. Choose output to single file in settings and give a file name.

enter image description here

  • Create dataflow pipeline activity and run the above dataflow. The file will be created, and it looks like the following:

enter image description here

  • Related