Home > Blockchain >  Trying to open .bson file and read to pandas df but getting 'bson.errors.InvalidBSON: objsize t
Trying to open .bson file and read to pandas df but getting 'bson.errors.InvalidBSON: objsize t

Time:06-12

#This is my code

import pandas as pd
import bson

FILE="users_(1).bson"

with open(FILE,'rb') as f:
    data = bson.decode_all(f.read())

main_df=pd.DataFrame(data)
main_df.describe()

#This is my .bson file

[{'_id': ObjectId('999f24f260f653401b'),
    'isV2': False,
    'isBeingMigratedToV2': False,
    'firstName': 'Jezz',
    'lastName': 'Bezos',
    'subscription': {'_id': ObjectId('999f24f260f653401b'),
     'chargebeeId': 'AzZdd6T847kHQ',
     'currencyCode': 'EUR',
     'customerId': 'AzZdd6T847kHQ',
     'nextBillingAt': datetime.datetime(2022, 7, 7, 10, 14, 6),
     'numberOfMonthsPaid': 1,
     'planId': 'booster-v3-eur',
     'startedAt': datetime.datetime(2022, 6, 7, 10, 14, 6),
     'addons': [],
     'campaign': None,
     'maskedCardNumber': '************1234'},
    'email': '[email protected]',
    'groupName': None,
    'username': 'jeffbezy',
    'country': 'DE'},
   {'_id': ObjectId('999f242660f653401b'),
    'isV2': False,
    'isBeingMigratedToV2': False,
    'firstName': 'Caterina',
    'lastName': 'Fake',
    'subscription': {'_id': ObjectId('999f242660f653401b'),
     'chargebeeId': '16CGLYT846t99',
     'currencyCode': 'GBP',
     'customerId': '16CGLYT846t99',
     'nextBillingAt': datetime.datetime(2022, 7, 7, 10, 10, 41),
     'numberOfMonthsPaid': 1,
     'planId': 'personal-v3-gbp',
     'startedAt': datetime.datetime(2022, 6, 7, 10, 10, 41),
     'addons': [],
     'campaign': None,
     'maskedCardNumber': '************4311'},
    'email': '[email protected]',
    'groupName': None,
    'username': 'cfake',
    'country': 'GB'}]

I get the error

'bson.errors.InvalidBSON: objsize too large'

Is it something to do with the datetime? Is it the structure of the .bson file, been at this for hours and can't seem to see the error. I know how to work with json and tried to convert it to json but no success. Any tips would be appreciated.

CodePudding user response:

If the main goal here is to read the data into a pandas DataFrame you could indeed format the data to json and use bson.json_util.loads:

import pandas as pd
from bson.json_util import loads

with open(filepath,'r') as f:
    data = f.read()

mapper = {
    '\'': '"',   # using double quotes
    'False': 'false',
    'None': '\"None\"',  # double quotes around None
    # modifying the ObjectIds and timestamps
    '("': '(', 
    '")': ')', 
    ')': ')"',
    'ObjectId': '"ObjectId',
    'datetime.datetime': '"datetime.datetime'
}
for k, v in mapper.items():
    data = data.replace(k, v)

data = loads(data)
df = pd.DataFrame(data)
  • Related