2107.9 Reference Materials

3.2.8 Backup

Even with replica sets and journaling, it is still a good idea to regularly back up your data. You can find an overview about the topic and possible strategies here.

Passive MongoDB node

One approach is to run a passive MongoDB node for all backups and filesystem snapshots to take the actual backup. If journaling is enabled, it's possible to take hot snapshots of a MongoDB data directory. Without journaling it's recommended to fsync and lock the passive node and then take the snapshot from there. See the code below for an example:

from pymongo import Connection
def do_backup():
    <insert your snapshot and backup code here>
def lock_and_backup():
    conn = Connection(slave_okay=True)
    try:
        conn.admin.command("fsync", lock=True)
        do_backup()
    finally:
        conn.admin["$cmd.sys.unlock"].find_one()

Example 3.4. Snapshot from a passive node

A more detailed example how this pattern can be used with Amazon S3 can be found here.

Backup Tools

MongoDB provides tools to dump and restore the current content of the databases. mongodump and mongorestore allow you to create exact copies of your current database. You can find a detailed description here.

Incremental backup

Incremental backup is only useful in rare cases. Usually you want to restore data, if your primary is down. But if your primary is down, you will want to restore your data as quick as possible. Restoring an old state and slowly adding your incremental backup parts will take lots of time that you usually do not have in these moments. Incremental backups make restoring your data more complicated and slow them down. All you gain is mildly less disk usage. Look here for a more detailed discussion on incremental backups.

Sharding

MongoDB sharding can be used when one MongoDB replication set becomes too small to handle the application load. Sharding does not need to be configured in advance, servers can be added during normal operation and the configuration can be updated to enable sharding. Make sure to read the MongoDB sharding documentation for a deeper insight.

For an efficient sharding configuration you need to know which databases and collections are used by Elastic Social.

Four databases are created for each tenant. The database names are generated from the mongodb.prefix setting, the tenant name and the service name separated by underscores. The service name is one of blobs, counters, models and tasks. When mongodb.prefix is "blueprint" and the tenant name is "media" then four databases named "blueprint_media_blobs", "blueprint_media_counters", "blueprint_media_models" and "blueprint_media_tasks" will be created.

The BlobService uses MongoDB GridFS for storing blobs and metadata. Please refer to the MongoDB documentation on how to configure sharding for GridFS. Example for configuring sharding for GridFS:

db.runCommand({ shardcollection : "blueprint_media_blobs.fs.chunks", key : { files_id : 1 }});

The counter services create six collections with the counters database. The highest_average_counters and highest_histogram_counters can not be sharded. They contain aggregated counter values so these collections are rather small and this imposes no limitation. The other collections in the counters database can be sharded with the name attribute as shard key. An example is given below:

db.runCommand( { shardcollection : "blueprint_media_counters.average_counters" ,
key : { name : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_counters.average_histogram_counters" ,
key : { name : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_counters.counters" ,
key : { name : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_counters.histogram_counters" ,
key : { name : 1 } } );

Example 3.5. Shard other collections

The models database contains one collection per model collection. Sharding of the blacklist and complaints collections is not recommended because they are comparatively small. For the other model collections the following shard keys are recommended:

Collection	Shard Key
comments	target : 1
likes	target : 1
ratings	target : 1
shares	target : 1
users	name : 1 or email: 1
notes	user : 1

Table 3.2. Recommended shard keys

An example is given below:

db.runCommand( { shardcollection : "blueprint_media_models.comments",
key : { target : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_models.likes",
key : { target : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_models.ratings",
key : { target : 1 } } );
db.runCommand( { shardcollection : "blueprint_media_models.users",
key : { name : 1 } } );

Example 3.6. Creating shard keys

The tasks database contains one collection per task queue. Configuring sharding for the task collections is not recommended because the tasks are removed after successful executions thus making the collections small.

If you are running a multi-tenant application you should consider spreading the databases of each tenant across the cluster so that the load is distributed evenly.

Was this article useful?

Search Results

Table Of Contents

Filter

Elastic Social Manual / Version 2107