I have a scenario all my Mongodb collections are having an objectId column. I am reading collections using pymongo and converting them into pandas dataframe.
When I try to write as parquet using
AWS lambda wrangler library or Pyarrow is failing with
with type ObjectId: did not recognize Python value type when inferring an Arrow data type"
Is there a way to convert objectId to string dynamically, if the column type is Objectid?
myresult = collection.find(query)
wr.s3.to_parquet(df1,path="s3://abcd/parquet.parquet")
Sample mongo data Schema
_id:objectID
id:string
createTimestamp: timestamp
updateTimestamp:timestamp
deleteTimestamp:timestamp
save as Parquet to Schema
_id:String
id:string
createTimestamp: timestamp
updateTimestamp:timestamp
deleteTimestamp:timestamp
CodePudding user response:
You can try to convert the _id
column to string before saving it to parquet.
wr.s3.to_parquet(
df1.astype({"_id": str}),
path="s3://abcd/parquet.parquet")