I have multiple data source from which I need to build and implement a DWH in AWS. I have one challenge with respect to one of my unstructured data source (Data coming from different APIs). How can I ingest data from this source into the Amazon Redshift??? Can we first pull it into Amazon S3 bucket and then integrate S3 with Amazon redshift? What is a better approach?
CodePudding user response:
Yes, S3 first. You APIs can write to S3 or/and if you like you can use a service like Kinesis (with or without firehose) to populate S3. From there it is just work in Redshift.
CodePudding user response:
Without knowing more about the sources, yes S3 is likely the right approach - whether you require latency in seconds, minutes or hours will be an important consideration.
If latency is not a driving concern, simply:
- Set up an S3 bucket to use a destination from your initial source(s).
- Create tables in your Redshift database (loading data from S3 to Redshift requires pre-existing destination table).
- Use the COPY command load from S3 to Redshift.
As noted, there may be value in Kinesis, especially if you're working with real-time data streams (the service recently introduced support for skipping S3 and streaming directly to Redshift).
S3 is probably the easier approach, if you're not trying to analyze real-time streams.