I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter. I am trying to import specific data in a dataframe.
Sample data.
{
"stored": "2022-04-xx",
...
...
"completedQueues": [
"STATEMENT_FORWARDING_QUEUE",
"STATEMENT_PERSON_QUEUE",
"STATEMENT_QUERYBUILDERCACHE_QUEUE"
],
"activities": [
"https://example.com
],
"hash": "xxx",
"agents": [
"mailto:[email protected]"
],
"statement": { <=== I want to import the data from "statement"
"authority": {
"objectType": "Agent",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"stored": "2022-04-xxx",
"context": {
"platform": "Unknown",
"extensions": {
"http://example.com",
"xxx.com": {
"user_agent": "xxx"
},
"http://example.com": ""
}
},
"actor": {
"objectType": "xxx",
"name": "xxx",
"mbox": "mailto:[email protected]"
},
"timestamp": "2022-04-xxx",
"version": "1.0.0",
"id": "xxx",
"verb": {
"id": "http://example.com",
"display": {
"en-US": "viewed"
}
},
"object": {
"objectType": "xxx",
"id": "https://example.com",
"definition": {
"type": "http://example.com",
"name": {
"en-US": ""
},
"description": {
"en-US": "Viewed"
}
}
}
}, <=== up to here
"hasGeneratedId": true,
...
...
}
Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.
What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:
id | authority objectType | authority name | authority mbox | stored | context platform | context extensions | actor objectType | actor name | ... |
---|---|---|---|---|---|---|---|---|---|
00 | Agent | xxx | mailto | 2022- | Unknown | http://1 | xxx | xxx | ... |
01 | Agent | yyy | mailto | 2022- | Unknown | http://2 | yyy | yyy | ... |
The idea is to be able to access any data like "authority name" or "actor objectType".
I have tried:
df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))
with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.
I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1) ?), and maybe apply limit(xx) to the dataframe as well.
Any ideas?
Thanks in advance.
CodePudding user response:
Try this
df = json_normalize(list(
collection.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
}
])
)
CodePudding user response:
Thanks for the answer, @pavel. It is spot on and pretty much solves the problem.
I also added sorting and limit, so if anyone is interested, the final code looks like this:
df = json_normalize(list(
statements_coll.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
},
{
"$sort": {
"stored": -1
}
},
{
"$limit": 10
}
])
))