Home > Software engineering >  How to create a pipeline and perform copy activity in Azure data factory for the tables related to f
How to create a pipeline and perform copy activity in Azure data factory for the tables related to f

Time:10-12

I am new to the Azure data factory and looking to copy CSV data into my table having a foreign key relationship.
Here are my tables:

Customer table

CREATE TABLE [dbo].[Customer]
(
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
    [CustomerNumber] NVARCHAR(50) NOT NULL,
    [FirstName] NVARCHAR(50) NOT NULL,
    [LastName] NVARCHAR(50) NOT NULL,
    [CreatedOn] datetime,
    [CreatedBy] NVARCHAR(255),
    [ModifiedOn] datetime,
    [ModifiedBy] NVARCHAR(255)
);
GO

-- Insert rows into table 'Customer' in schema '[dbo]'
INSERT INTO [dbo].[Customer]
VALUES
( 
 NEWID(),'Tom123', 'Tom', 'Shehu',GETDATE(),'test',GETDATE(),'admin'
),
( 
 NEWID(),'Harol234', 'Harold', 'Haoxa',GETDATE(),'test',GETDATE(),'admin'
),
( 
 NEWID(),'Peter345', 'Peter', 'Begu',GETDATE(),'test',GETDATE(),'admin'
),
( 
 NEWID(),'Marlin09', 'Marlin', 'Hysi',GETDATE(),'test',GETDATE(),'admin'
)

GO

Product Table

CREATE TABLE [dbo].[Product]
(
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
    [Name] NVARCHAR(50) NOT NULL,
    [ErpNumber] NVARCHAR(50) NOT NULL,
    [Description] NVARCHAR(50) NOT NULL,
    [CreatedOn] datetime,
    [CreatedBy] NVARCHAR(255),
    [ModifiedOn] datetime,
    [ModifiedBy] NVARCHAR(255)
);
GO

-- Insert rows into table 'Product' in schema '[dbo]'
INSERT INTO [dbo].[Product]
VALUES
( 
 NEWID(), 'EI500CMZ', 'EI500CMZ','7-Day test product',GETDATE(),'Tom',GETDATE(),'Tom'
),
( 
 NEWID(), 'ST0SMX', 'ST0SMX','7-Day heavy duty product',GETDATE(),'Tom',GETDATE(),'Tom'
),
( 
 NEWID(), 'EH30MZ', 'EH30MZ','Electronic water test product',GETDATE(),'Tom',GETDATE(),'Tom'
)

CustomerProduct  table

CREATE TABLE [dbo].[CustomerProduct]
(
    [Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
    [CustomerId] UNIQUEIDENTIFIER NOT NULL,
    [ProductId] UNIQUEIDENTIFIER NOT NULL,
    [Name] NVARCHAR(255) NOT NULL,
    [CreatedOn] datetime,
    FOREIGN KEY(CustomerId) REFERENCES Customer(Id),
    FOREIGN KEY(ProductId) REFERENCES Product(Id)

);
GO

Below is my CSV file data:
CustomerNumber,ErpNumber,Name
Tom123,EI500CMZ,EI500CMZ2340
Harol234,ST0SMX,ST0SMX74770
Peter345,EH30MZ,EH30MZ00234

Now I am looking to insert data into my 3rd table i.e CustomerProduct but I am not understanding how that "CustomerId", "ProductId" and "Name" will get inserted. In the above CSV data, I am getting the "CustomerNumber" and "ErpNumber" but during the insertion "CustomerId" and "ProductId" should go in the table.

Not understanding how to do this.

So far I have done this in the Azure data factory:

  1. Created a blob storage account. Added a container in blob storage and uploaded my CSV file.

  2. Created a linked service of type Azure blob storage called "CustomerProductInputService" that will talk to blob storage

  3. Created a linked service of type Azure SQL database called "CustomerProductOutputService" that will communicate with the "CustomerProduct" table.

  4. Created a dataset of type azure blob. This will receive the data from "CustomerProductInputService".

  5. Created a dataset of type azure SQL database.

Now I am stuck at copy activity. I am not understanding how to create pipeline for this scenario and insert data into the CustomerProduct table. As I explained I am getting "CustomerNumber" and "ErpNumber" in the CSV file but I want to insert "CustomerId" and "ProductId" into my "CustomerProduct" table.

Can anybody help me?

CodePudding user response:

You can Insert the CustomerProduct data from CSV to the table using dataflow activity using lookup transformations to get the CustomerID and ProductID from Customer and Product table respectively.

enter image description here

Source:

enter image description here

enter image description here

enter image description here

  1. Add 3 source transformations in dataflow, 1 for CSV source file, 1 for Customer table, and 1 for Product table.

a) Source1 (CSV): Create and CSV dataset to source1 to get Input file data.

enter image description here

b) Source2 (CustomerTable): Connect to Customer table and get all the existing data from the Customer table.

enter image description here

• As we only need ID and CustomerNumber columns from the Customer table, add select transformation (Customer) after source2 to select only the required column list.

enter image description here

enter image description here

c) Source3 (ProductTable): Connect Source3 to Product Table to pull all the existing data from dbo.Product.

enter image description here

enter image description here

• Add Select transformation (Product) after Source3 to get only the required columns ID & ERPNumber from the column list.

enter image description here

  1. Add Lookup transformation to Source1 (CSV) with Primary stream as CSV source and Lookup Stream as Customer (Source2 Select transformation) and Lookup Condition as CSV column “CustomerNumber” equals to (==) Customer table Column “CustomerNumber”.

enter image description here

enter image description here

  1. As Lookup is like Left join here, it includes all columns from Source1 and Lookup columns from Source2 in the select list (which include duplicate columns).

a) So, using select transformation (CustomerSelectList) to select only the required columns in the Output. Also renaming the Column name “ID” which is pulled from Customer table to CustomerID to match with Sink table.

enter image description here

enter image description here

  1. Add another Lookup transformation after Select (CustomerSelectList) to get the data from Product table.

a) Select Primary stream as CustomerSelectList (Select transformation) and Lookup stream as Product (Select of Source3)

enter image description here

b) With Lookup condition as CSV Source column “ErpNumber” equals to (==) Product table column “ErpNumber”.

enter image description here

enter image description here

  1. Again, using the select transformation to ignore other columns and select only required columns from the select list. Also renaming column “ID” from Product table to ProductID.

enter image description here

enter image description here

enter image description here

  1. Add Derived Column transformation to the select (CustomerProductSelectList) to add new columns ID and CreatedOn.

a) ID: as this is UNIQUEIDENTIFIER in the sink table, we can add an expression to generate the id using UUID() in the derived column.

b) CreatedOn: adding expression to get the **Current timestamp** to Sink table.

enter image description here

enter image description here

  1. Finally add Sink transformation to Insert data onto CustomerProduct table.

enter image description here

enter image description here

enter image description here

  1. Add this dataflow to a pipeline and run the pipeline to insert data.

enter image description here

Output:

enter image description here

CodePudding user response:

First thing you would need to identify a key connection between customer and product. Next, create a pipeline from data factory and create 2 sources "Product" and "Customer" apply ADF transformation Join and Alter and sink it to dbo.CustomerProduct.

  • Related