Home > Software engineering >  How to get mongodb nested documents in Python Pandas dataframe table format
How to get mongodb nested documents in Python Pandas dataframe table format

Time:10-28

name age address
1 "Steve" 27 {"number": 4, "street": "Main Road", "city": "Oxford"}
2 "Adam" 32 {"number": 78, "street": "High St", "city": "Cambridge"}

However the subdocuments will just appear as JSON inside the subdocument cell

from pandas import DataFrame

df = DataFrame(list(db.collection_name.find({}))
print(df)

how can I get a below 2nd table like this using python?

what is the approach after this?

name age address.number address.street address.city
1 Steve 27 4 "Main Road" "Oxford"
2 Adam 32 78 "High St" "Cambridge"

CodePudding user response:

You can use pd.DataFrame to expand the JSON/dict in column address into a dataframe of the JSON/dict contents. Then, join with the original dataframe using .join(), as follows:

Optional step: If your JSON/dict are actually strings, convert them to proper JSON/dict first. Otherwise, skip this step.

import ast
df['address'] = df['address'].map(ast.literal_eval)

Main codes:

import pandas as pd

df[['name', 'age']].join(pd.DataFrame(df['address'].tolist(), index=df.index).add_prefix('address.'))

Result:

    name  age  address.number address.street address.city
1  Steve   27               4      Main Road       Oxford
2   Adam   32              78        High St    Cambridge

Alternatively, if you have only a few columns to add from the JSON/dict, you can also add them one by one, using the string accessor str[], as follows

df['address.number'] = df['address'].str['number']
df['address.street'] = df['address'].str['street']
df['address.city'] = df['address'].str['city']

Setup

import pandas as pd

data = {'name': {1: 'Steve', 2: 'Adam'},
        'age': {1: 27, 2: 32},
        'address': {1: {"number": 4, "street": "Main Road", "city": "Oxford"},
                    2: {"number": 78, "street": "High St", "city": "Cambridge"}}}
df = pd.DataFrame(data)

CodePudding user response:

Depending on use case, it may make more sense to setup an aggregation pipeline and $project the necessary nested documents up to the top level:

df = pd.DataFrame(db.collection_name.aggregate([{
    '$project': {
        '_id': 0,
        'name': '$name',
        'age': '$age',
        # Raise Sub-documents to top-level under new name
        'address_number': '$address.number',
        'address_street': '$address.street',
        'address_city': '$address.city'
    }
}]))

df:

    name  age  address_number address_street address_city
0  Steve   27               4      Main Road       Oxford
1   Adam   32              78        High St    Cambridge

Or if there are many too many fields to do manually we could also repalceRoot and mergeObjects:

df = pd.DataFrame(db.collection_name.aggregate([
    {'$replaceRoot': {'newRoot': {'$mergeObjects': ["$$ROOT", "$address"]}}},
    {'$project': {'_id': 0, 'address': 0}}
]))

df:

    name  age  number     street       city
0  Steve   27       4  Main Road     Oxford
1   Adam   32      78    High St  Cambridge

collection_name setup:

# Drop Collection if exists
db.collection_name.drop()
# Insert Sample Documents
db.collection_name.insert_many([{
    'name': 'Steve', 'age': 27,
    'address': {"number": 4, "street": "Main Road", "city": "Oxford"}
}, {
    'name': 'Adam', 'age': 32,
    'address': {"number": 78, "street": "High St", "city": "Cambridge"}
}])
  • Related