Let's start with the basics
The key to building performant applications is understanding how to model your data based on your unique access patterns. In MongoDB, you store data as JSON-like documents which enables you to model your data the same way you structure objects in your application code. This makes it intuitive to embed related fields in the same document when they’re frequently accessed together, or use data references across collections when complex relationships are large or will grow significantly over time. This is how these two data modeling strategies in MongoDB are defined:
Embed: Store related data in one document when it's often read together.
Reference: When data grows large, exceeds 16MB, or involves complex relationships.
It's recommended you store data the same way you model objects in your code via embedding. Duplication in embedding is intentional, enabling faster reads while MongoDB gives you the flexibility to enforce data consistency. For multi-document operations, ACID transactions are also available.
For example, if you often query movies by the number of awards it won, embedding in a JavaScript object might look like:
const movie = {
type: "movie",
awards: {
wins: 127,
nominations: 63,
text: "Won 11 Oscars. Another 116 wins & 63 nominations.",
}
};
Which can map directly to a MongoDB document where awards is embedded as an object with nested fields:
{
"_id": { "$oid": "68191862ad207d83dd04d916" },
"type": "movie",
"awards": {
"wins": 127 ,
"nominations": 63 ,
"text": "Won 11 Oscars [...]" } ...
}
This allows you to write a simple and performant query that returns the intended results with no joins:
{"type": "movie", "awards.wins": {"$gt": 100}}
For data that can grow unbounded, like having many comments per movie, referencing is a better fit, and you have options to join data using MongoDB’s powerful $lookup aggregation stage.
An example of referencing is in the comments collection. In this collection, movie_id points to a document in the movies collection:
{
"_id": { "$oid": "681917e1ad207d83dd04d915" },
"name": "Yolanda Owen",
"email": "yolanda_owen@fakegmail.com",
"movie_id": {
"$oid": "573a1391f29313caabcd6d40"
},
"text": "Sample comment."
}
With movies stored separately, you can use $lookup to join these collections, matching each comment’s movie_id to the _id in movies. The following example query retrieves comments on award-winning movies:
{
$match: { "awards.wins": {"$gt": 0 } }
},
{
$lookup: {
from: "comments",
localField: "_id",
foreignField: "movie_id",
as: "movie_comments"
}
}
To make $lookup operations efficient, ensure the fields involved in the join are indexed (e.g., _id in movies and movie_id in comments). The index will help MongoDB quickly locate the value without scanning the entire collection, improving query and join performance.
Building a strong understanding of data modeling in MongoDB is key to choosing the right approach for your application’s unique access patterns and to building performant, cost-optimized, and hyper scalable systems that streamline development workflows and reduce time to deploy new features and applications.
If you’re interested in learning more, build your data modeling skills with a free hands-on, self-paced module that guides you through modeling for workloads, designing relationships, and validating schemas in under 75 minutes. Upon completion of the skill check, you’ll earn a badge that you can share with your network!
In this FAQ, we’ll dive deeper into when and why to use embedding or referencing, how to enforce data consistency, model relationships by joining data, and more.
Frequently Asked Questions (FAQ)
Question | Answer |
When should I embed data in the same collection vs. store data in different collections with references? | How you store data should reflect how your application uses it. With JSON, it’s easy to structure documents around your access patterns — just like you do with objects in code. This alignment is critical for performance and development velocity. When designing schemas in MongoDB, you should gravitate towards embedding data that’s accessed together in the same document. This works especially well when one entity “contains” another, or when there is a one-to-many relationship.
Use references linking documents across collections only when the tradeoffs justify it. This might be the case when data duplication would not deliver meaningful read performance benefits, you are modeling complex many-to-many relationships, or the related entity is large, updated frequently, or often queried on its own.
|
How can I ensure data consistency? | Data consistency can always be maintained with MongoDB. There are different strategies to reflect updates across duplicated data:
The best approach depends on your application's tolerance for looser consistency and some level of stale data, how frequently updates happen, and whether strong referential integrity is needed. MongoDB also lets you define schema validation rules to enforce structure and consistency where needed.
|
How can I JOIN related data? | You can use the $lookup aggregation operator to join related data between collections. $lookup performs a left outer join between two collections, allowing you to include documents from a "joined" collection in your query results. Beyond using $lookup directly in individual queries, you can save complex aggregation pipelines as a MongoDB view, similar to materialized views in SQL.
Ensure that the fields involved in the join are indexed to help quickly locate the value without scanning the entire collection, improving performance. If your pipeline processes many documents, consider adding stricter $match filters before $lookup. Keep in mind that the power of the document model lies in storing data used together in one place to increase compute efficiency and simplicity. Only use $lookup when necessary.
|
Are there schema design patterns I should know about? | Yes, MongoDB supports several schema design patterns that can help optimize your data model based on your application’s needs. The most common patterns include:
Additional schema patterns include the approximation pattern, computed pattern, tree pattern, and more.
|
