Home > Software engineering >  Filter records from Redshift based on records from S3
Filter records from Redshift based on records from S3

Time:12-03

I am new to AWS. I am working on designing a solution for the below use case.

I have a copy of a relational database in S3, there are several files in a bucket, each file represents a table and the columns are pipe delimited. The total size of the files would be around 2 GB. I also have several billions of records distributed across multiple tables in redshift.

I need to filter records from redshift(by joining several tables) based on the records available in S3 (by joining several files) and write it to a DB/S3.

Example - let's say S3 has Product file that has the product details and criteria file that has the criteria details, when criteria matches the product should be shown as a suggestion to the customer. Criteria and product table are linked using product id. Example, display ipad accessories (product) as a suggestion when the customer has already bought ipad (criteria)

Redshift has Customer table that has the list of customers and Product table that has the list of products that they bought. Both products and customers are linked using customerid.

For every product available in S3 based on the criteria, i need to find the list of customers to whom the product should be suggested based on the product that they have already bought.

Can someone please help with a high level idea of how to achieve this in AWS?

CodePudding user response:

Look into Redshift Spectrum which will allow you to define external tables that live in S3. An external table will allow you to pull only those records from S3 that meet your criteria and then join this info to your customer tables.

This should work well for Spectrum as your S3 data isn't very large and you are paring it down by your criteria before the result is passed to Redshift proper. Spectrum is most powerful when your S3 data is pared down with simple WHERE clauses and reduced by GROUP BY. This keeps the network traffic for the results small when passed back to the RS cluster.

  • Related