I'm a newbie to MongoDB and Python scripts. I'm confused how a $match
term is handled in a pipeline.
Let's say I manage a library, where books are tracked as JSON files in a MongoDB. There is one JSON for each copy of a book. The book.JSON files look like this:
{
"Title": "A Tale of Two Cities",
"subData":
{
"status": "Checked In"
...more data here...
}
}
Here, status
will be one string from a finite set of strings, perhaps just: { "Checked In", "Checked Out", "Missing", etc. } But also note also that there may not be a status
field at all:
{
"Title": "Great Expectations",
"subData":
{
...more data here...
}
}
Okay: I am trying to write a MongoDB pipeline within a Python script that does the following:
- For each book in the library:
- Groups and counts the different instances of the
status
field
- Groups and counts the different instances of the
So my target output from my Python script would be something like this:
{ "A Tale of Two Cities" 'Checked In' 3 }
{ "A Tale of Two Cities" 'Checked Out' 4 }
{ "Great Expectations" 'Checked In' 5 }
{ "Great Expectations" '' 7 }
Here's my code:
mydatabase = client.JSON_DB
mycollection = mydatabase.JSON_all_2
listOfBooks = mycollection.distinct("bookname")
for book in listOfBooks:
match_variable = {
"$match": { 'Title': book }
}
group_variable = {
"$group":{
'_id': '$subdata.status',
'categories' : { '$addToSet' : '$subdata.status' },
'count': { '$sum': 1 }
}
}
project_variable = {
"$project": {
'_id': 0,
'categories' : 1,
'count' : 1
}
}
pipeline = [
match_variable,
group_variable,
project_variable
]
results = mycollection.aggregate(pipeline)
for result in results:
print(str(result['Title']) " " str(result['categories']) " " str(result['count']))
As you can probably tell, I have very little idea what I'm doing. When I run the code, I get an error because I'm trying to reference my $match
term:
Traceback (most recent call last):
File "testScript.py", line 34, in main
print(str(result['Title']) " " str(result['categories']) " " str(result['count']))
KeyError: 'Title'
So a $match
term is not included in the pipeline? Or am I not including it in the group_variable
or project_variable
?
And on a general note, the above seems like a lot of code to do something relatively easy. Does anyone see a better way? Its easy to find simple examples online, but this is one step of complexity away from anything I can locate. Thank you.
CodePudding user response:
Here's one aggregation pipeline to "$group"
all the books by "Title"
and "subData.status"
.
db.collection.aggregate([
{
"$group": {
"_id": {
"Title": "$Title",
"status": {"$ifNull": ["$subData.status", ""]}
},
"count": { "$count": {} }
}
},
{ // not really necessary, but puts output in predictable order
"$sort": {
"_id.Title": 1,
"_id.status": 1
}
},
{
"$replaceWith": {
"$mergeObjects": [
"$_id",
{"count": "$count"}
]
}
}
])
Example output for one of the "books":
{
"Title": "mumblecore",
"count": 3,
"status": ""
},
{
"Title": "mumblecore",
"count": 3,
"status": "Checked In"
},
{
"Title": "mumblecore",
"count": 8,
"status": "Checked Out"
},
{
"Title": "mumblecore",
"count": 6,
"status": "Missing"
}
Try it on mongoplayground.net.