Visit The School

How MongoDB Aggregation Pipeline Operators Work

Alexander Paterson
|
Posted 8 months ago
|
14 minutes
How MongoDB Aggregation Pipeline Operators Work
They're confusing at first, but actually not that tricky

MongoDB pipeline operators are extremely powerful and getting your head around them really pays off, so here I'm going to show you exactly how a few of the most important operators actually work. By using the $unwind, $group, $project$sort, and $limit operators, we can form queries that extract exactly what we need from complex models.

We'll look at an example model where we have users, posts and comments. Posts are a subcollection of users, and comments are a subcollection of posts. Take a minute to read through the following setup code which defines this schema, generates some test data, and in the last few lines gives us a nice spot to experiment in.

// index.js

var mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    uuid = require('uuid');

var postSchema = new Schema({
  text: {
    type: String,
    required: true
  },
  comments: [
    {
      time: {
        type: Date,
        required: true,
        default: Date.now()
      },
      text: {
        type: String,
        required: true
      }
    }
  ]
});

var userSchema = new Schema({
    username: {
      type: String,
      required: true,
      unique: true
    },
    posts: [
      postSchema
    ]
});

var User = mongoose.model('User', userSchema);

var users = [1,2,3,4,5].map(function(_) {
  return {
    username: uuid.v4(),
    posts: [1,2,3,4].map(function(_) {
      return {
        text: uuid.v4(),
        comments: [1,2,3,4,5,6,7,8,9].map(function(_) {
          return {
            time: Date.now() + Math.floor(Math.random() * 100),
            text: uuid.v4()
          }
        })
      }
    })
  }
});

function generateData(callback) {
  User.create(users, function(err) {
    if (err) { return console.log(err) }
    callback(err);
  });
}

mongoose.connect('mongodb://localhost:testmongo/testmongo', function(err) {
  // Reset database
  var db = mongoose.connection;
  db.collection('users').drop();

  generateData(function(err) {
    if (err) { return console.log(err) }

    // -------------------------------------------
    // Work here
    // -------------------------------------------

    // Exit application
    // mongoose.disconnect();
  });
});

Any following code snippets should be placed in the Work here section.

Basic Match Queries

The simplest aggregation we can perform is a simple find operation using the $match operator. Here, I'll find a user with the username of the first user in our seed data:

User.aggregate([
  {$match: {username: users[0].username}}
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

Executing this, you should see this user printed out:

[ { _id: 5812de03d214f698d805c9f8,
    username: 'a40dc799-2760-4128-9433-7c90a2aea919',
    posts: [ [Object], [Object], [Object], [Object] ],
    __v: 0 } ]

Easy enough. Now, what if we wanted to replace this array containing a single user with an array containing this user's posts? To achieve this we're going to have to use both $unwind and $project.

Unwinding

This is this tricky bit. Right now, we have one user object which has a subcollection of four posts. What we want is 4 post objects. To achieve this effect, we have to go through an intermediate step where we actually have four user objects with one post each. This is what $unwind does: it's going to create a copy of the parent object for each element in the array. Just watch what happens. 

We wish to unwind the posts property of the user object, so the syntax for this is: {$unwind: '$posts'}, not forgetting the second $. We add this to our aggregate pipeline:

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'}
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

The output will look like this (take a close look at it):

[ { _id: 5812deccc7c427992566d5d1,
    username: '668eabc9-d190-45b8-b01e-d45a6b2e821e',
    posts: 
     { text: '74040111-8c8e-4652-8403-ef40ae5107c3',
       _id: 5812deccc7c427992566d5f0,
       comments: [Object] },
    __v: 0 },
  { _id: 5812deccc7c427992566d5d1,
    username: '668eabc9-d190-45b8-b01e-d45a6b2e821e',
    posts: 
     { text: 'baaee1a4-46a6-48d4-b159-001b5f9c4944',
       _id: 5812deccc7c427992566d5e6,
       comments: [Object] },
    __v: 0 },
  { _id: 5812deccc7c427992566d5d1,
    username: '668eabc9-d190-45b8-b01e-d45a6b2e821e',
    posts: 
     { text: '3f72ca1c-fc28-4cae-9e6d-27bf1b61c040',
       _id: 5812deccc7c427992566d5dc,
       comments: [Object] },
    __v: 0 },
  { _id: 5812deccc7c427992566d5d1,
    username: '668eabc9-d190-45b8-b01e-d45a6b2e821e',
    posts: 
     { text: '3d70af54-a333-4369-9a23-ac5b6b0e1e1b',
       _id: 5812deccc7c427992566d5d2,
       comments: [Object] },
    __v: 0 } ]

The above is an array of user objects, but posts is no longer an array, it's a single object. This takes us half way to the outcome we desire, which is to have four post objects being displayed. All we need to do now is simply map these user objects to their posts attribute, and we're done. We can achieve this with $project.

Projecting

The projection operator simply allows us to add and remove fields from objects. We want to get rid of _id and username properties from the objects above, and map each user.posts.text field to user.text (and similarly for user.posts.comments and user.posts._id). Recapping again: we turned a single user object into multiple user objects with one post each using $unwind, and now we're turning these numerous user objects into the desired post objects using $project.

The object passed to a $project operator is of the form of the desired output, and can reference values from the input object by prepending a $ to property names, as such:

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'},
  {$project: {
    _id: '$posts._id',
    text: '$posts.text',
    comments: '$posts.comments',
  }},
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

And the output now looks like this:

[ { _id: 5812e312c9b61f9aba90ccf1,
    text: '5be64832-bd3d-4631-9b30-e2697acce36f',
    comments: 
     [ [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object] ] },
  //...
  { _id: 5812e312c9b61f9aba90ccd3,
    text: '68714da2-18fe-4022-b2bf-b61a4e261a87',
    comments: 
     [ [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object] ] } ]

This is an array of post objects, just like we wanted! Now, let's make sure those comments are in chronological order.

Sorting

This goes another level deeper. We're going to need to first unwind the comments, then sort them, then group them again (undoing the unwind).

So, first the unwind:

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'},
  {$project: {
    _id: '$posts._id',
    text: '$posts.text',
    comments: '$posts.comments',
  }},
  {$unwind: '$comments'},
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

Here's the output; it's a super long array of postsone for each comment, and with one comment each (though this property is still confusingly named comments):

[ { _id: 5812e40697788e9afe4a057a,
    text: '7af9488f-8f22-4aa3-a6ca-c7fbdfbf7762',
    comments: 
     { text: 'cfc1212c-b4cc-48ca-a789-32aa39a4ddcc',
       _id: 5812e40697788e9afe4a0583,
       time: 2016-10-28T05:37:10.228Z } },
  //...
  { _id: 5812e40697788e9afe4a055c,
    text: '85e73fd4-e9a3-42f4-aa9c-e99b695a9335',
    comments: 
     { text: '2e9f77c9-563d-4b78-9c88-b3e7e547c201',
       _id: 5812e40697788e9afe4a055d,
       time: 2016-10-28T05:37:10.287Z } } ]

Now, we apply the $sort operator, to which we'll specify the field we wish to sort on (this gives us newest last):

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'},
  {$project: {
    _id: '$posts._id',
    text: '$posts.text',
    comments: '$posts.comments',
  }},
  {$unwind: '$comments'},
  {$sort: {'comments.time': 1}},
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

The output will be like above, but in chronological order according to the comment time.

Now it's time to re-group all these comments back into their original post objects. Unsurprisingly, we use the $group operator. Basically, this operator's going to take in a bunch of documents, and we're going to tell it how to calculate a new _id for each document. If two documents have the same calculated new _id, then they're merged into one.

We also have to specify how to merge properties together; for example, when merging two posts, do we take the first one's posts.text property, or should we take whichever's longest? Obviously in this situation if two posts.text properties are being merged together, they'll be equal, so we can take any.

We're looking at a very simple example here, but here are some more complicated ones. What we're going to do is group any posts with matching _id properties by simply calculating the output _id from the input _id. We'll just take the first posts.text property, and we'll specify that posts.comments properties should be pushed ontop of eachother into an array:

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'},
  {$project: {
    _id: '$posts._id',
    text: '$posts.text',
    comments: '$posts.comments',
  }},
  {$unwind: '$comments'},
  {$sort: {'comments.time': 1}},
  {$group: {
    _id : "$_id",
    text: { $first: '$text' },
    comments: { $push: '$comments' },
   }}
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

And the output is as expected, an array of posts:

[ { _id: 5812e7a2a465aa9bc8dc7b87,
    text: '81492ebb-964c-47a6-8903-855c41bbb9fd',
    comments: 
     [ [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object] ] },
  //...
  { _id: 5812e7a2a465aa9bc8dc7b91,
    text: '5694f608-aef9-4a62-9956-ec1a1428e7f0',
    comments: 
     [ [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object],
       [Object] ] } ]

Nice!

While we're at it, let's take the first two posts with $limit

User.aggregate([
  {$match: {username: users[0].username}},
  {$unwind: '$posts'},
  {$project: {
    _id: '$posts._id',
    text: '$posts.text',
    comments: '$posts.comments',
  }},
  {$unwind: '$comments'},
  {$sort: {'comments.time': 1}},
  {$group: {
    _id : "$_id",
    text: { $first: '$text' },
    comments: { $push: '$comments' },
  }},
  {$limit: 2}
]).exec(function(err, users) {
  if (err) { return console.log(err) }
  console.log(users);
});

Summary

That's as far as I'm going to go. I hope overall this made sense; I found these a bit tricky at first, and the documentation didn't really help me get over that.

Subscribe to the newsletter.



ALEXANDER
PATERSON