Home > OS >  What happens to a $match term in a pipeline?
What happens to a $match term in a pipeline?

Time:06-23

I'm a newbie to MongoDB and Python scripts. I'm confused how a $match term is handled in a pipeline.

Let's say I manage a library, where books are tracked as JSON files in a MongoDB. There is one JSON for each copy of a book. The book.JSON files look like this:

{
    "Title": "A Tale of Two Cities",
    "subData":
        {
            "status": "Checked In"
            ...more data here...
        }
}

Here, status will be one string from a finite set of strings, perhaps just: { "Checked In", "Checked Out", "Missing", etc. } But also note also that there may not be a status field at all:

{
    "Title": "Great Expectations",
    "subData":
        {
            ...more data here...
        }
}

Okay: I am trying to write a MongoDB pipeline within a Python script that does the following:

  • For each book in the library:
    • Groups and counts the different instances of the status field

So my target output from my Python script would be something like this:

{ "A Tale of Two Cities"   'Checked In'    3 }
{ "A Tale of Two Cities"   'Checked Out'   4 }
{ "Great Expectations"     'Checked In'    5 }
{ "Great Expectations"     ''    7 }

Here's my code:

mydatabase = client.JSON_DB
mycollection = mydatabase.JSON_all_2

listOfBooks = mycollection.distinct("bookname")
for book in listOfBooks:
    match_variable = {
        "$match": { 'Title': book }
    }
    group_variable = {
        "$group":{
            '_id': '$subdata.status',
            'categories' : { '$addToSet' : '$subdata.status' },
            'count': { '$sum': 1 }
        }
    }
    project_variable = {
        "$project": {
            '_id': 0,
            'categories' : 1,
            'count' : 1
        }
    }
    pipeline = [
        match_variable,
        group_variable,
        project_variable
    ]
    results = mycollection.aggregate(pipeline)
    for result in results:
        print(str(result['Title']) "  " str(result['categories']) "  " str(result['count']))

As you can probably tell, I have very little idea what I'm doing. When I run the code, I get an error because I'm trying to reference my $match term:

Traceback (most recent call last):
  File "testScript.py", line 34, in main
    print(str(result['Title']) "  " str(result['categories']) "  " str(result['count']))
KeyError: 'Title'

So a $match term is not included in the pipeline? Or am I not including it in the group_variable or project_variable ?

And on a general note, the above seems like a lot of code to do something relatively easy. Does anyone see a better way? Its easy to find simple examples online, but this is one step of complexity away from anything I can locate. Thank you.

CodePudding user response:

Here's one aggregation pipeline to "$group" all the books by "Title" and "subData.status".

db.collection.aggregate([
  {
    "$group": {
      "_id": {
        "Title": "$Title",
        "status": {"$ifNull": ["$subData.status", ""]}
      },
      "count": { "$count": {} }
    }
  },
  { // not really necessary, but puts output in predictable order
    "$sort": {
      "_id.Title": 1,
      "_id.status": 1
    }
  },
  {
    "$replaceWith": {
      "$mergeObjects": [
        "$_id",
        {"count": "$count"}
      ]
    }
  }
])

Example output for one of the "books":

  {
    "Title": "mumblecore",
    "count": 3,
    "status": ""
  },
  {
    "Title": "mumblecore",
    "count": 3,
    "status": "Checked In"
  },
  {
    "Title": "mumblecore",
    "count": 8,
    "status": "Checked Out"
  },
  {
    "Title": "mumblecore",
    "count": 6,
    "status": "Missing"
  }

Try it on mongoplayground.net.

  • Related