Add primary key in dynamodb table by using AWS Glue Job-CodePudding

I'm trying to create an AWS Glue job where inputs are in Redshift Cluster and output table is in dynamoDB. I have mapped the columns but there's no primary key in the input table, but dynamoDB requires a primary key. For instance the input table has columns: owning_buyer, owner_code, business_name, owner_group_id, owner_group_status.

Is there a way to create random primary key value for the output table in AWS Glue job? Or do I have to edit my input table in Redshift cluster beforehand? (maybe concatting the input fields in the cluster, but I'm not sure if this is the efficient solution.)

CodePudding user response：

In PySpark you can simply use pyspark.sql.functions.monotonically_increasing_id:

.withColumn("id", monotonically_increasing_id 123)

However, its important that your primary key in DynamoDB suits your read access patterns, if you decide to use a random unique id it may be difficult to read your items from DynamoDB. For that reason, I would suggest that you introduce a PK which is useful for your read access patterns allowing you to make efficient queries of your data and avoid using Scan which reads the entire dataset.