Elastic Social Manual / Version 2010
Table Of ContentsEven with replica sets and journaling, it is still a good idea to regularly back up your data. You can find an overview about the topic and possible strategies here.
Passive MongoDB node
One approach is to run a passive MongoDB node for all backups and filesystem snapshots to take the actual backup. If journaling is enabled, it's possible to take hot snapshots of a MongoDB data directory. Without journaling it's recommended to fsync and lock the passive node and then take the snapshot from there. See the code below for an example:
from pymongo import Connection def do_backup(): <insert your snapshot and backup code here> def lock_and_backup(): conn = Connection(slave_okay=True) try: conn.admin.command("fsync", lock=True) do_backup() finally: conn.admin["$cmd.sys.unlock"].find_one()
Example 3.4. Snapshot from a passive node
A more detailed example how this pattern can be used with Amazon S3 can be found here.
Backup Tools
MongoDB provides tools to dump and restore the current content of the databases. mongodump
and
mongorestore
allow you to create exact copies of your current database. You can find a detailed
description here.
Incremental backup
Incremental backup is only useful in rare cases. Usually you want to restore data, if your primary is down. But if your primary is down, you will want to restore your data as quick as possible. Restoring an old state and slowly adding your incremental backup parts will take lots of time that you usually do not have in these moments. Incremental backups make restoring your data more complicated and slow them down. All you gain is mildly less disk usage. Look here for a more detailed discussion on incremental backups.
Sharding
MongoDB sharding can be used when one MongoDB replication set becomes too small to handle the application load. Sharding does not need to be configured in advance, servers can be added during normal operation and the configuration can be updated to enable sharding. Make sure to read the MongoDB sharding documentation for a deeper insight.
For an efficient sharding configuration you need to know which databases and collections are used by Elastic Social.
Four databases are created for each tenant. The database names are generated from the
mongodb.prefix
setting, the tenant name and the service name separated
by underscores. The service name is one of blobs, counters, models and tasks. When
mongodb.prefix
is "blueprint" and the tenant name is "media" then four
databases named "blueprint_media_blobs", "blueprint_media_counters",
"blueprint_media_models" and "blueprint_media_tasks" will be created.
The BlobService
uses
MongoDB GridFS for storing blobs and metadata. Please refer to the
MongoDB
documentation on how to configure sharding for GridFS. Example for configuring sharding for GridFS:
db.runCommand({ shardcollection : "blueprint_media_blobs.fs.chunks", key : { files_id : 1 }});
The counter services create six collections with the counters database. The highest_average_counters and highest_histogram_counters can not be sharded. They contain aggregated counter values so these collections are rather small and this imposes no limitation. The other collections in the counters database can be sharded with the name attribute as shard key. An example is given below:
db.runCommand( { shardcollection : "blueprint_media_counters.average_counters" , key : { name : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_counters.average_histogram_counters" , key : { name : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_counters.counters" , key : { name : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_counters.histogram_counters" , key : { name : 1 } } );
Example 3.5. Shard other collections
The models database contains one collection per model collection. Sharding of the blacklist and complaints collections is not recommended because they are comparatively small. For the other model collections the following shard keys are recommended:
Collection | Shard Key |
---|---|
comments | target : 1 |
likes | target : 1 |
ratings | target : 1 |
shares | target : 1 |
users | name : 1 or email: 1 |
notes | user : 1 |
Table 3.2. Recommended shard keys
An example is given below:
db.runCommand( { shardcollection : "blueprint_media_models.comments", key : { target : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_models.likes", key : { target : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_models.ratings", key : { target : 1 } } ); db.runCommand( { shardcollection : "blueprint_media_models.users", key : { name : 1 } } );
Example 3.6. Creating shard keys
The tasks database contains one collection per task queue. Configuring sharding for the task collections is not recommended because the tasks are removed after successful executions thus making the collections small.
If you are running a multi-tenant application you should consider spreading the databases of each tenant across the cluster so that the load is distributed evenly.