MongoDB collection to pandas Dataframe-CodePudding

My MongoDB document structure is as follows and some of the factors are NaN.

  _id :ObjectId("5feddb959297bb2625db1450")
factors: Array 
   0:Object
     factorId:"C24"
     Index:0
     weight:1
   1:Object
     factorId:"C25"
     Index:1
     weight:1
   2:Object
     factorId:"C26"
     Index:2
     weight:1
name:"Growth Led Momentum"

I want to convert it to pandas data frame as follows using pymongo and pandas.

|name                   | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum    | C24      | 0     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C25      | 1     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C26      | 2     | 0     |
----------------------------------------------------

Thank you

CodePudding user response：

You could use the aggregation pipeline to unwind factors and then project the fields you want.

Something like this should do the trick.

Live demo here.

Database Structure

[
  {
    "_id": 1,
    "name": "Growth Lead Momentum",
    "factors": [
      {
        factorId: "C24",
        index: 0,
        weight: 1
      },
      {
        factorId: "D74",
        index: 7,
        weight: 9
      }
    ]
  }
]

Query

db.collection.aggregate([
  {
    $unwind: "$factors"
  },
  {
    $project: {
      _id: 1,
      name: 1,
      factorId: "$factors.factorId",
      index: "$factors.index",
      weight: "$factors.weight"
    }
  }
])

Results

(.csv friendly)

[
  {
    "_id": 1,
    "factorId": "C24",
    "index": 0,
    "name": "Growth Lead Momentum",
    "weight": 1
  },
  {
    "_id": 1,
    "factorId": "D74",
    "index": 7,
    "name": "Growth Lead Momentum",
    "weight": 9
  }
]

CodePudding user response：

Wonderful answer by Matt, In case you want to use pandas:

Use this after you have retrieved documents from db:

df = pd.json_normalize(data)
df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])

Output:

  factorId  Index  weight                 name
0      C24      0       1  Growth Led Momentum
0      C25      1       1  Growth Led Momentum
0      C26      2       1  Growth Led Momentum