Home > OS >  MongoDB collection to pandas Dataframe
MongoDB collection to pandas Dataframe

Time:10-13

My MongoDB document structure is as follows and some of the factors are NaN.

  _id :ObjectId("5feddb959297bb2625db1450")
factors: Array 
   0:Object
     factorId:"C24"
     Index:0
     weight:1
   1:Object
     factorId:"C25"
     Index:1
     weight:1
   2:Object
     factorId:"C26"
     Index:2
     weight:1
name:"Growth Led Momentum"

I want to convert it to pandas data frame as follows using pymongo and pandas.

|name                   | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum    | C24      | 0     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C25      | 1     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C26      | 2     | 0     |
----------------------------------------------------

Thank you

CodePudding user response:

You could use the aggregation pipeline to unwind factors and then project the fields you want.

Something like this should do the trick.

Live demo here.

Database Structure

[
  {
    "_id": 1,
    "name": "Growth Lead Momentum",
    "factors": [
      {
        factorId: "C24",
        index: 0,
        weight: 1
      },
      {
        factorId: "D74",
        index: 7,
        weight: 9
      }
    ]
  }
]

Query

db.collection.aggregate([
  {
    $unwind: "$factors"
  },
  {
    $project: {
      _id: 1,
      name: 1,
      factorId: "$factors.factorId",
      index: "$factors.index",
      weight: "$factors.weight"
    }
  }
])

Results

(.csv friendly)

[
  {
    "_id": 1,
    "factorId": "C24",
    "index": 0,
    "name": "Growth Lead Momentum",
    "weight": 1
  },
  {
    "_id": 1,
    "factorId": "D74",
    "index": 7,
    "name": "Growth Lead Momentum",
    "weight": 9
  }
]

CodePudding user response:

Wonderful answer by Matt, In case you want to use pandas:

Use this after you have retrieved documents from db:

df = pd.json_normalize(data)
df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])

Output:

  factorId  Index  weight                 name
0      C24      0       1  Growth Led Momentum
0      C25      1       1  Growth Led Momentum
0      C26      2       1  Growth Led Momentum
  • Related