Mongodb Guide For Beginners

0
40
Mongodb Guide For Beginners

MongoDB is the most popular NoSQL database solution. Make no mistake, SQL still dominates production web projects, but interest in MongoDB is growing.

Be it because of fashion or due to a real technical difficulty faced by web developers, MongoDB now has many big customers and ambassadors.

In this mongodb guide for beginners, I will try to show you everything there is to know about this database technology, the need it meets, as well as tips on how MongoDB works and how to use it for your projects.

The origin of MongoDB

The original project

It was in the fall of 2007 that Kevin Ryan, Dwight Merriman and Eliot Horowitz founded the company 10gen, with the aim of offering a Platform as a Service product, similar to Heroku, AWS Elastic Beanstalk or Google App Engine, but based on open source components.

Their experience through various web projects such as DoubleClick and ShopWiki have taught them that an application that becomes popular will run into scalability issues at the database level. In their search for a database to integrate into their PaaS product, no open source solution met their needs for scalability and compatibility with a cloud architecture.

That’s why the 10gen team developed a new document-oriented NoSQL database technology in-house. They will baptize it MongoDB, inspired by the word “Hu mongo us” which could be translated as “Gigantic”, like the data it is supposed to host.

The birth of MongoDB

The PaaS product developed by 10gen, named ed, not really finding a buyer, the founders decided to extract the database technology from it.

In February 2009, MongoDB became open source and the number of users grew exponentially within the Google group bringing together the community of early adopters. On March 1, 2009, only 9 threads existed on this group, three months later this number rose to 613!

MongoDB was built for speed. The data is based on BSON documents, short for binary JSON. The BSON allows MongoDB to be faster at finding data in documents.

In order to be even more efficient in its queries, MongoDB invites the denormalization of the data in its documents. Where a good practice in SQL was to have specific tables and foreign keys to refer to data during joins, MongoDB encourages denormalization by duplicating the data where it is requested.

Although MongoDB offers reference mechanisms, they must be used wisely in order to benefit from the performance provided by the MongoDB database.

MongoDB was designed for the age of cloud and distributed infrastructure. To ensure stability, one of the key concepts of MongoDB is to always have more than one copy of the database available in order to ensure availability even in the event of host machine failure. This ability to replicate the database on several machines in several places easily improves the horizontal scalability of the database.

MongoDB was designed for flexibility. Unlike SQL databases, data in a Mongo collection can be completely heterogeneous. This is known as Schemaless. The advantage of not necessarily having a strict data structure is to be able to quickly change its data structure.

This flexibility is very appreciated in projects at the prototype stage which are still discovering how their data should be structured. However, going the Schemaless way has its drawbacks. It becomes more difficult to perform analysis operations on the data if all the documents do not follow the same structure. This is why it is also possible to impose a Schema on the collection.

MongoDB products

MongoDB

MongoDB’s core business is its document-oriented NoSQL database technology. It is now used in more than 100 million projects.

MongoDB Atlas

MongoDB Atlas is a Database as a Service (DBaaS). Atlas helps you to deploy a managed MongoDB server on Amazon Web Services, Google Cloud Platform or Microsoft Azure cloud, in the region of your choice.

You will have the choice of the size of your cluster while having the advantage of having your database managed by the MongoDB engineering team.

Atlas even offers a free plan for 500MB, ideal for your personal projects or for experimenting.

MongoDB Stitch

MongoDB Stitch is a Serverless platform that allows you to build an application directly from Mongo Atlas.

MongoDB Compass

MongoDB Compass is the GUI client developed by MongoDB. This comprehensive tool will allow you to consult, modify, run queries or aggregations on your local or cloud database directly from its graphical interface. The work of Mongo designers provides a pleasant user experience and interface to manipulate your data.

MongoDB Realm

Realm is a lightweight database embedded in the mobile client.

In the case of a mobile application, Realm allows you to store part of the data directly on the device and to coordinate synchronizations with the main database according to different events. This is often used to avoid network requests and allow better offline use of the application.

Charts

Charts enables the creation of graphs in order to visualize your data directly from MongoDB. You can create several types of graphics based on the data you have on your Atlas cluster and integrate them directly on your site directly in HTML. It equally allows you to exploit your data quickly without having to develop a specific Frontend interface for this need.

Cloud Manager

Cloud Manager is a comprehensive performance monitoring and optimization tool for your cluster on MongoDB Atlas. You have access to a dozen indicators across your database in order to analyze performance and understand the queries made by your application. An alert system can be configured to notify you of emergencies and connects natively to Slack, DataDog or PagerDuty.

Cloud Manager is also able to identify slow queries and suggest indexes to add in order to improve your database performance .

Atlas search is one of the newest members of the Mongo Cloud family. It aims to compete with Algolia and ElasticSearch in the field of search engines. It allows you to index your data differently in order to have a finer and more intelligent search function than a simple query with filters.

The MongoDB community

Very present since the first moments of MongoDB, the community now lives on a very active forum. You will be able find helpful resources, ask questions and discover best practices.

This large community is an asset to get out of blocking points during your development and to find solutions to thorny problems thanks to collective intelligence.

The technical advantages of MongoDB

Reading speed

Duplicate data

For years, the best practice in SQL has been data normalization. In order to ensure the best reliability of the data, it was necessary to avoid duplication and to refer to another table containing the definitions. For example, in a customer management tool, there would be an address table and a city table. The address table would use a foreign key to refer to a city rather than entering the city into the address table.

Customer table:

idClientcity_id
1François1
2Marcel2
3Guillaume1

City table :

idcity
1Paris
2Little

This paradigm made it possible to avoid data duplication and made it possible to avoid input errors, such as a typing error or a different case that would make the analysis false. It was also due to historical hardware constraints from a bygone era where infrastructure was on-premises and the cost of storage higher than today.

MongoDB has made the opposite choice of encouraging data duplication. Indeed, in SQL as in Mongo, the slowest queries are those that involve references to other tables or other collections. Today, with the cost of storage being much lower and the distributed infrastructure being well supported by MongoDB, there are far fewer problems with duplicating data.

Although it is possible to refer to other documents in Mongo, it is encouraged to write as much data as possible within a document. The limit of a document being 16MB, the use of the reference finds its interest when a document becomes too large to include everything.

Let’s take the example of a blog using a MongoDB database. Where we would be tempted to have separate documents for Posts and Comments, Mongo encourages us to include comments in the post document. There is no use case where we would need to load comments without posts so it is entirely possible to embed comments in a blog post document.

This nesting allows MongoDB to improve its reading performance by reading just one document rather than having to iterate through several documents in order to put together an object which will be delivered to the server.

Diversity of index types

At the heart of its reading speed lies the diversity of the indexes offered. Using and combining these indexes allows your queries to scan a sample of the data instead of searching the entire collection. MongoDB offers the possibility to index an object on several fields, to index a field containing an array of elements, to index GPS coordinates and to index a block of text in order to search its content.

As with any database, index multiplication increases database size and write speed. This is why an index strategy is key to having the minimum possible index while covering a maximum of requests coming from the application.

Write Performance

As we have seen above, a MongoDB database cluster is replicated several times, with a primary database and replicas considered secondary. MongoDB can present interesting performances even when writing thanks to the notion of Write concerns. The more a database is replicated and/or sharded (the notion of sharding is explained below), the more MongoDB has to write data to different places.

Write concern is the notion of write confirmation. By default, the write concern is set to 1, i.e. during a write, MongoDB will write the data for the first time on the primary database and return its confirmation to you. She will then coordinating the replications behind the curtains.

Depending on your needs, you could increase the write concern if you want to ensure that the data has been replicated to the secondary databases of your cluster, or on the contrary reduce it to zero to deactivate this write acknowledgment in the database and increase the performance.

Scalability

In computing as in life, the only certainty is that something will die.

Based on this observation and Cloud first positioning, MongoDB chooses to replicate your database on several mongod. One primary and several replicas. If your (virtual) machine hosting the MongoDB server were to be defective, your data remains very quickly accessible thanks to the replica followers available. The faulty VM would be deleted and a new Leader would emerge while a new follower was being created.

The easiest way to scale a database is vertical scaling, which involves making the machine hosting the database more powerful by adding more RAM, disk space, and CPU power. This method is relatively expensive because the more powerful the components or with a large capacity, the more expensive they are.

The other method is to scale horizontally, adding more machines to host the same database. This method is more complicated because SQL databases often lose performance during queries that involve several machines.

MongoDB offers the notion of Sharding. This method makes it possible to distribute the data across several machines in order to optimize the performance of incoming requests. The data will be distributed through different fragments of the database and by using a Shard Key, mongo queries will be very efficient to retrieve data directly from the fragment that contains it.

Flexibility

Unlike SQL databases, a MongoDB database does not necessarily need to have a fixed data structure for all the objects present in a collection.

Indeed, in SQL if you want to add a column to your Customers, you will have to add a column having this property and setting it to NULL for all customers that do not have this field filled in.

With MongoDB, you can do Schemaless design. A client may have the “FAX” field because he is one of the last specimens on earth to own such a machine without all clients having to have this property.

The fact of proposing Schemaless allows rapid prototyping. This is a great feature for projects that are still figuring out how their data is going to be structured.

On the other hand, Schemaless present certain limitation when it comes to aggregating data for analysis.

Must-Have Features

Aggregations

The purpose of any database is to store and order data in order to be able to exploit it. To be able to exploit this data, MongoDB offers aggregation operators.

Aggregations are series of manipulation operations on data in order to produce a specific document. They make it possible to extract certain data from objects, to group them, to assemble them in a new format in order to be able to recover the structured data as desired.

Text index allows users to specifically search for text inside an object. For example to search for a keyword in a series of blog posts.

GPS coordinates

MongoDB also offers Geo-spatial indexes that allow you to define a point via GPS coordinates, or an area, from a central point and a radius or from several points to define a specific area.

This feature is useful for storing locations and calculating the distance between multiple points.

The transactions

MongoDB keeps the reputation that it is not possible to perform a transaction operation and that an SQL database should be preferred if it is an essential functionality for your application.

What is a transaction?
When an application requires several database operations before it completes a certain task, it is called a transaction.

For example, when a bank transfer takes place, it is imperative that the money in account A be deducted AND that it be credited to account B. If the write operation failed in the middle, account A would be debited, account B would not be credited and the transferred amount would be lost in nature . During a transaction, if the operation is not completely finished, the database performs a Rollback before reporting the error.

MongoDB hasn’t always been good at doing transactions.
Remember, MongoDB is designed to be distributed across multiple machines. In the example of the banking transaction, one can imagine that the accounts are located on different machines, in different regions of the world. It is therefore all the more complicated to perform this type of operation and to guarantee its atomicity.

MongoDB introduced transactions across sharded collections with version 4.2.
It is now possible to perform transactions on multiple documents, even if the collection is sharded on multiple machines.

The limitations of MongoDB

All these nice features of MongoDB seem to make it a foolproof tech. But in tech, the only truth is that there is no “Silver bullet”. All functionality comes at a cost.

Here are the limitations of mongoDB

Excessive de-normalization

As we have seen, MongoDB encourages denormalization. Whether in official MongoDB University documentation and training resources, you are encouraged to duplicate your data whenever possible.

Although it seems relevant given the low cost of storage capacity today, this duplication will create a new problem: data integrity.

Imagine having to update several collections each time a user corrects his address. This complexity adds to the workload of the developers who, each time they touch a feature that refers to addresses, will have to coordinate the update in all the collections.

This kind of duplication will cause errors where two collections will have contradictory data and recovery batches will have to be created to correct the data.

Joins

Joins were introduced in MongoDB with the $lookup aggregate since version 3.2. This feature comes to meet the needs in cases where denormalization is not an option and there is no choice but to create a reference to another object.

But mongoDB was not designed for this type of need and the performance of a query with a $lookup aggregate is much less interesting. You lose the initial advantage that MongoDB offers.

If your application needs to relate a lot of data, MongoDB might not be the right choice.

Over indexation

MongoDB’s performance is good. But to achieve this performance on a maximum of queries, you will be encouraged to create more and more indexes.

This over-indexing will tax the writing performance of your database.

Each addition of an object in the collection will require the creation of a multitude of indexes in several collections, which will be greedy in resources of your database.

You might be tempted to couple a MongoDB database for reading only and have a database dedicated to writing, such as Cassandra, which would feed the MongoDB database more quietly. This solution, although functional, adds immense complexity to your application and is often the result of misuse of your indexes.

Remember, you’re not Facebook, you shouldn’t have their level of complexity.

Why care about MongoDB?

Skill sought

MongoDB’s success with developers has also been found in enterprise projects. Today, there are projects using MongoDB in Startups, large groups, government projects and associations .

Among the French users of MongoDB we find among others AXA, Bouygues Telecom and Leroy Merlin. At the time I wrote this article, more than 2000 job offers in France mention MongoDB on Linkedin.

Who should learn to use MongoDB?

Any Backend developer, regardless of the technology used, would benefit from training on the basics of MongoDB. Learning how to create a database and make the first CRUD queries is enough at first if you don’t have a project to implement it.

Fullstack JavaScript developers also have a lot to gain. Stacks used for JavaScript projects most often use MongoDB. It’s a skill you’ll probably be in more demand of than a PHP developer.

MongoDB quick guide

First CRUD queries

Getting started with MongoDB is very simple. All you have to do is launch your Mongo server on your workstation with the command mongod or through a Docker image. Then just enter the command mongo to access the Mongo Shell and make its first MongoDB queries.

Organize your collections

If you used to develop your applications using RDBMS databases such as MySQL or PostgreSQL, the way data is modeled in MongoDB is going to make you question your way of doing things.

In SQL, the good practice has long been to normalize its data. A table will refer to data in another via a Foreign Key, thus avoiding data duplication and possible inconsistencies between different database objects.

In MongoDB, this strategy is abandoned as a last resort in favor of data duplication. Depending on the volume of data to be linked, we will opt for different strategies.

MongoDB tries to mimic these two strategies. The relationship to another document, as we know it in RDBMS databases, or embed , which would translate to include, in the original document.

For example, a blog post with its related comments:

{
  "_id":"iqdsfjaf9043ngfdslk9",
  "title":"MongoDB: The complete guide",
  "body":"MongoDB is a noSQL database...",
  "comments": ["6qd77a03faf1f8436048q8", "5db57a04faf1f8434098f7f9"]
}

in relation to comments stored in the corresponding collection:

[{
  "_id": "6qd77a03faf1f8436048q8",
  "username": "Marc",
  "text": "Super article",
  "createdAt": "2019-10-27T11:05:39.898Z",
},
{
  "_id": "5db57a04faf1f8434098f7f9",
  "username": "John doe",
  "text": "MongoDB is not good for very large projects",
  "createdAt": "2019-10-27T11:05:40.710Z"
}]

Alternatively, MongoDB offers to include comments directly in the Post so:

{
  "_id":"iqdsfjaf9043ngfdslk9",
  "title":"MongoDB: The complete guide",
  "body":"MongoDB is a noSQL database...",
  "comments": [ 
    {
      "_id": "6qd77a03faf1f8436048q8",
      "username": "Marc",
      "text": "Super article",
      "createdAt": 2019-10-27T11:05:39.898Z
    },
    {
      "_id": "5db57a04faf1f8434098f7f9",
      "username": "Jean",
      "text": "MongoDB is not good for very large projects",
      "createdAt": 2019-10-27T11:05:40.710Z
    }
  ]
}

Depending on the nature of your documents and the number of relationships, one strategy will be more preferable than another.

Relations 1-to-1

In the case of a one-to-one relationship, we will most often favor an embed strategy over a reference. Let’s take the example of a customer and an order

// client object
{
  _id: "abc",
  lastName: "Martin",
  firstName: "Peter",
  email: "peter.martin@gmail.com"
}

// order object

{
  _id:'def'
  client: {
    lastName: 'Martin',
    firstName: 'Peter',
    email:'peter.martin@gmail.com'
  }
  currency: 'Eur'
  amount: 1000
  created_at: 2020-07-31,
  updated_at: 2020-07-31
}

In this case, the object client is small and relevant enough to include it completely in the object command.

Let us now assume that our object client is much richer in data:

// client object
{
  _id: "abc",
  lastName: "Martin",
  firstName: "Peter",
  email: "pierre.martin@gmail.com",
  dateOfBirth: 1987-05-06,
  jobTitle: 'engineer',
  company: 'Société Générale',
  address: {
    number: "4ter",
    street: "Rue Ranelagh",
    postalCode: "75016",
    city: "Paris"
  },
  CIN: 187057534298717,
  premium: true,
  lastVisit: 2020-07-12,
}

Many of these fields are irrelevant to an order.

In order to avoid unnecessarily weighing down our object Command, you can use a subset embed that consists of including in the order document only the fields that are relevant to the processing of the order.

{
  id:'def'
  client: {
    _id: 'abc',
    lastName: 'Martin',
    firstName: 'Peter',
  }
  currency: 'Eur'
  amount: 1000
  created_at: 2020-07-31,
  updated_at: 2020-07-31
}

This strategy allows access to the property _id from Client, in case we need to make a request to have all the customer details, while directly proposing the most used data in the document Command.

1-to-Many Relations

In most cases, the embed strategy in 1-to-many relationships will provide the best performance. Similar to the example provided in a 1-to-1 relationship above, embedding the data from the target object into the original object provides reading gains .

For example, an e-commerce site could include ratings in the article object

{
  _id: "abc",
  name: "Clean Code",
  price: 19.99,
  reviews: [
    {client:"Tom", rating:5, comment:"very good book"},
    {client:"Marc", rating:4, comment:"pertinent examples"},
  ]
}

The most common use for an E-store is to display the product page with customer reviews.

Therefore, in a simple request the page has all the information necessary to produce the page.

If the frontend wants to offer a page grouping Marc’s reviews on all the products, it would suffice to make an aggregation request on the reviews containing the client Marc in the article collection.

On the other hand, embedding the target article in the original article can cause problems if the array is set to grow significantly. This is because MongoDB is not going to appreciate having an object containing an array with a length of 10,000.

Let’s take the example of an Instagram post from Ariana Grande, who has one of the accounts with the most followers.

The post in question has over 68k comments and include each comment in the subject Post would exceed the 16MB limits that MongoDB imposes for each document.

In this case, the solution is to return the issue and include a reference to the post in the comment object. We therefore move to a 1-to-1 relationship where we will only mention the id of the post in the document comment.

{
  _id:'abc'
  post_id:'xyz',
  user: 'khloekardashian',
  comment: 'Happy birthday beautiful ❤️❤️❤️'
},
{
  _id:'def'
  post_id:'xyz',
  user: 'djsnake',
  comment: '?'
}

In this case, the reference is made to the objectId of the document Post and an aggregation query would retrieve and format the post and comments.

Understanding aggregations

Sometimes the information we need is not stored as it is in our database. You may need a sum, an average, or group only part of the documents.

To do this, MongoDB offers aggregations. These are operations capable of succeeding each other in order to manipulate the data and return one or more modified documents which correspond to your needs.

Imagine a McDonald’s collection Commands:

[
  { 
    _id: abc, 
    contenu: [ 
      {sku: "BigMac", quantity: 1}, 
      {sku: "medium frite", quantity: 1},
      {sku: "Medium Coca Zero", quantity: 1},
      {sku: "big Coca Cola", quantity: 1},
      {sku: "big frite", quantity: 1},
      {sku: "280", quantity: 1}
     ],
    total: 21.40
    type: "drive"
    created_at: "2020-07-27T13:07:41.657Z"
  },
  { 
    _id: bcd, 
    contenu: [ 
      {sku: "CBO", quantity: 1}, 
      {sku: "Medium potatoe", quantity: 1},
      {sku: "medium Fanta", quantity: 1},
      {sku: "big Coca Cola", quantity: 1},
      {sku: "big frite", quantity: 1},
      {sku: "BigMac", quantity: 1},
      {sku: "Sundae Chocolate", quantity: 1}
     ],
    total 26.10
    type: "drive"
    created_at: "2020-07-27T13:07:55.657Z"
  },
  { 
    _id: cde, 
    contenu: [ 
      {sku: "CBO", quantity: 1}, 
      {sku: "Medium frite", quantity: 1},
      {sku: "Medium Sprite", quantity: 1},
      {sku: "big Coca Cola", quantity: 1},
      {sku: "big frite", quantity: 1},
      {sku: "280", quantity: 1}
     ],
    type: "for delivery"
    total: 23.80
    created_at: "2020-07-27T13:08:41.657Z"
  },
  { 
    _id: def, 
    contenu: [ 
      {sku: "big Coca Cola", quantity: 1},
      {sku: "big frite", quantity: 1},
      {sku: "280", quantity: 1}
     ],
    type: "for delivery"
    total: 9.70
    created_at: "2020-07-27T13:09:41.657Z"
  },
  { 
    _id: efg, 
    contenu: [ 
      {sku: "big Coca Cola", quantity: 1},
      {sku: "big frite", quantity: 1},
      {sku: "BigMac", quantity: 1}
     ],
    type: "for delivery"
    total: 8.70
    created_at: "2020-07-27T13:09:42.657Z"
  },
  { 
    _id: fgh, 
    contenu: [ 
      {sku: "big Coca Cola", quantity: 2},
      {sku: "big frite", quantity: 2},
      {sku: "BigMac", quantity: 2}
     ],
    type: "for delivery"
    total: 9.70
    created_at: "2020-07-27T13:09:42.657Z"
  }
]

Using mongodb aggregations would allow you to bring out data such as the number of Bigmacs sold per day, the ranking of the best-selling products over a month, the average ticket by type of order…

Know how to index your data to improve performance

Indexes are the key to a successful MongoDB database. It is thanks to them that a request will be able to provide the data in the most efficient way possible.

What is an index?

Take the example of an encyclopedia.

In order to find the page that talks of the “solar system”, you would need to browse through all the pages of the encyclopedia, until you find the right one.

Whatever the database, an index is a partial copy of the stored data so that the engine can quickly find it.

Indexes in MongoDB

By following a relevant indexing strategy, you will greatly improve the reading performance of your database. The more indexes you have, the faster your database will be able to respond to different queries.

Be careful, however, to have a relevant indexing strategy and not to abuse the tool. Indeed, adding an index amounts to making a copy of a fragment of the data. Not only does this index add weight to your entire database, it also slows down write operations. If the User collection has 14 different indexes, when you insert a new user, MongoDB will have to copy 14 data chunks back, which has a performance cost.

MongoDB shines with the diversity of its indexes. Thanks to its different options, you will be able to categorize the key data of your database through different types of indexes. This diversity makes it possible to have an optimal reading performance while minimizing the writing cost.

Alternatives to MongoDB

DynamoDB is the NoSQL database service made in AWS. It is often contrasted with MongoDB although they are not really similar. DynamoDB is completely managed by AWS and it is not possible to have it on its own on-premises or on a competing cloud. Moreover DynamoDB is closer to a key-value database than a document-oriented database. Each item in the database is stored in a table containing an id and the tables can be indexed to perform queries faster. Dynamo integrates seamlessly with other AWS products and allows some logic to be outsourced to the infrastructure. For example, if you need to react to an update of an item in the database, DynamoDB gives you that ability without you having to code that logic in your application.

RethinkDB is a document-oriented NoSQL database that sits on real-time data. It pushes data to your application whenever an event occurs. RethinkDB is open source and therefore can be deployed on-premises or cloud of your choice. It’s a great choice if you need to create a data-responsive application like a dashboard but it doesn’t support ACID principles and a strict data schema. Also, if you do a lot of calculations on your data, a columnar database such as Cassandra will be more suitable.

FaunaDB is a transactional, document-oriented database, specialized in Serverless. With Fauna, developers no longer have to worry about setting up a cluster. Data is automatically replicated across multiple regions and accessible via API. Unlike other technologies, FaunaDB is operated exclusively by Fauna, which will choose to operate its infrastructures where it sees fit. Particularly suited to Serverless infrastructures, FanuaDB has made a name for itself in the Next.JS ecosystem by competing with DynamoDB and the Serverless offer from MongoDB Atlas which arrived with version 5.0, after FaunaDB was able to find a place for itself.

How to learn MongoDB:

Mongo University is an excellent resource, in English , for learning and manipulating MongoDB. Thanks to a series of videos, quizzes and exercises, you will learn the different notions of MongoDB at your own pace.

The difficulty is progressive and you can skip the first modules if you are already familiar with the basics of MongoDB. They also offer to use MongoDB Atlas for free to host your database and provide the datasets for you to practice.

Conclusion

Although MongoDB was designed for a very specific purpose, which was use in the distributed cloud, its evolution and the practice of web development make it a multi-purpose database.

Today, unless you have a clear reason not to use MongoDB, it is the most flexible and adaptable database technology to start a project.