MongoDB Replication & Sharding

Geoffrey Berrard, LesFurets.com, tp-bigdata@lesfurets.com

MongoDB Replication&Sharding

  1. MongoDB durability

  2. Replication principles

  3. Replica set Read and Write Semantics

  4. Sharding

  5. Replica&sharding practice

MongoDB durability

journaling

MongoDB Journal vs Oplog

  • journal

    • low level log of an operation for crash recovery (can be turned off)

  • oplog

    • similar to RDBMS binlog

      • stores (idempotent) high-level transactions that modify the database

      • kept on the master and used for replication

Plan

  1. MongoDB write path

  2. Replication principles

  3. Replica set Read and Write Semantics

  4. Sharding

  5. Replica&sharding practice

Master-slave replication

  • Replica set = a group of mongod processes that provide redundancy and high availability

  • Writes: write to single node replicated to the others members of the replica set

  • Read: read from a single member of the replica set

Replica set members

  • Primary

    • acceptes all writes and reads

    • 1 primary per replica set

  • Secondaries replicates data (and can serve reads ⇒ reads preferences)

    • Priority 0 ⇒ Hidden members ⇒ Delayed

  • Arbiters (usually at most one) : break ties

Primary and secondary members

  • Primary acceptes all writes + reads + records them in oplog

  • Secondary replicates primary oplogs (also accept reads)

replica set read write operations primary

Replication data flow

  • asynchronous oplog replication

  • heartbeat for monitoring status

replica set primary with two secondaries

Automatic failover via new primary election

replica set trigger election

Strategy for election

  • member’s priority

  • latest optime in the oplog

  • uptime

  • break the tie rules

Secondary members: Priority 0

  • cannot become primary

  • cannot trigger elections

  • can vote in elections

  • copy of data + accepts reads

replica set three members geographically distributed

Secondary members: Hidden replica set member

  • Priority 0 members that don’t accept reads

replica set hidden member

Secondary members: Delayed replica set members

  • reflect an delayed state of the set

    • must be priority 0 ⇒ prevent them to become primary

    • should be hidden ⇒ prevent application to query stale data

replica set delayed member

Elections on odd number of nodes

  • a replica cannot become primary with only 1 vote

  • majority with even numbers of members ?

replica set trigger election
  • use Arbitrers to break ties

    • does not hold data

    • cannot became a primary

Arbiters

replica set four members add arbiter

Fault tolerance

  • No primary ⇒ writes no longer possible, reads still accepted

  • Fault tolerance : number of members that can become unavailable and still be able to elect a primary

diag f086404c8d35a32883efe909898e44c5
https://docs.mongodb.org/manual/core/replica-set-architectures/

Rollbacks during replica set failover

  • a rollback reverts write operations on a former primary when the member rejoins its replica set after a failover

    • the primary accepted a write that was not sucessfuly replicated to secondaries !

Cause of the problem ?

default write semantics { w:1 } ⇒ the primary acknowledge the write after the local write (local Journal!)

How to handle rollbacks

  • manually apply/discard rollbacks ( rollback/ folder)

  • avoid rollbacks use { w:majority }

    • READ UNCOMMITED SEMANTICS

      • ! Regardless of write concern, other clients can see the result of the write operations before the write operation is acknowledged to the issuing client.

      • ! Clients can read data which may be subsequently rolled back.

Plan

  1. MongoDB write path

  2. Replication principles

  3. Replica set Read and Write Semantics

    1. Write concerns

    2. Read concerns

    3. Read preferences

  4. Sharding

  5. Replica&sharding practice

Replica set and consistency

  • parameters to move the CAP cursor

    • write concern

      • is the guarantee an application requires from MongoDB to consider a write operation successful

    • read concern

      • specify the desired consistency for read operations

    • read preference

      • applications specify read preference to control how drivers direct read operations to members of the replica set

Write semantics

  • w:1 (default)

    • the primary acknowledge the write after the local write

  • w:N

    • ack the write after the ack of N members

  • w:majority

    • ack the write after the ack of the majority of the members

  • wtimeout (optional)

    • prevents write operations from blocking indefinitely if the write concern is unachievable

W:2 write semantics

crud write concern w2

Changing the write semantics

  • at the query level

db.products.insert(
   { item: "envelopes", qty : 100, type: "Clasp" },
   { writeConcern: { w: 2, wtimeout: 5000 } }
)
  • change the default write concern:

cfg = rs.conf()
cfg.settings = {}
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)

Read concern

  • local(default) ⇒ returns the instance’s(P/S) most recent data

    • no guarantee that the data has been written to a majority of the replica set

    • may be rolled back).

  • majority ⇒ the most recent data acknowledged by a majority of members in RS

  • linearizable ⇒ successful writes issued with a write concern of "majority" and acknowledged prior to the start of the read operation

Tunning the read concern

db.restaurants.find( { _id: 5 } ).readConcern("linearizable").maxTimeMS(10000)

Read preference

  • primary (default): read from the current replica set primary.

  • primaryPreferred: read from primary (or secondary iff no primary)

  • secondary: read from secondary members

  • secondaryPreferred: read from secondary(or primary iff no secondary)

  • nearest ⇒ least network latency

  • async replication ⇒ stale data !

Read preferences example

replica set read preference

Read preferences example

db.emails.find().readPref("secondary")
db.collection.find().readPref('nearest', [ { 'dc': 'east' } ])

Read preferences use cases

  • Maximize Consistencyprimary read preference

  • Maximize AvailabilityprimaryPreferred read preference

  • Minimize Latencynearest read preference

Plan

  1. MongoDB write path

  2. Replication principles

  3. Replica set Read and Write Semantics

  4. Sharding

  5. Replication and sharding practice

Sharding terminology

  • shard: server/replica set that contains a subset of the data

  • mongos: servers that acts as a query router

  • config servers: server/replica set that store metadata and configuration

  • chunk: mongo partitions sharded data into chunks

    • lower and exclusive upper range based on the shard key

    • mongo migrates chunks across the shards (cluster balancer)

Sharding architecture

sharded cluster production architecture

Sharded and Non-Sharded Collections

sharded cluster primary shard

Sharded and Non-Sharded Collections

sharded cluster mixed

Data Partitioning with Chunks

  • default chunk size 64MB (configurable)

    • small chunks ⇒ even distribution but frequent migrations (network/mongos overhead)

    • large chunks_ lead ⇒ fewer migrations, potentially uneven distribution of data (preffered)

  • Chunk size

    • ⇒ maximum collection size when sharding an existing collection

Shard key

  • indexed field/compound fields that exists for every document in the collection

    • becames immutable after sharding

  • determines the distribution of documents among the shards

  • !Note!

    • cannot shard a collection with unique indexes on other fields.

    • cannot create unique indexes on other fields on sharded collection

Shard key

  • Ideal :

    • mutations and queries should be distributed uniformly across all of the shards

    • all operations would only target the shards of interest

    • the load (queries/data) should be uniformly distributed

  • The shard key that matches your workload:

Shard key properties

  • cardinality ⇒ determines the maximum number of chunks the balancer can create

  • frequency ⇒ how often a given value occurs in the data

  • monotonicity ⇒ likelihood to distribute inserts to a single shard

Shard key: low cardinality

sharded cluster ranged distribution low cardinal

Shard key: high frequency

sharded cluster ranged distribution frequency

Shard key monotonicity

sharded cluster monotonic distribution

Sharding strategies: Ranged

  • base on the domain of the shard key

  • documents with shard key values close to one another are likely to be co-located on the same shard

  • optimize range based queries

sharding range based

Sharding strategies: Ranged

mongos> db.temperature.createIndex( { creation_date : 1 } ); //create index
mongos> sh.enableSharding( "test" ); //enable sharding for database
mongos> sh.shardCollection("test.temperature",{"creation_date":1}); //shard collection
sharded cluster monotonic distribution

Sharding strategies: Hashed

  • documents distributed according to an MD5 hash of the shard key value

  • guarantees a uniform distribution of writes across shards

    • but is less optimal for range-based queries

sharding hash based

Sharding strategies: Hashed

mongos> db.emails.createIndex({sender:"hashed"}); //create hash-index
mongos> sh.enableSharding( "test" ); //enable sharding for database
mongos> sh.shardCollection("test.emails",{sender:"hashed"}); //shard collection
sharding hash based

Sharding strategies: Zones

sharded cluster zones

Sharding: chunk split + migration

mongos> sh.shardCollection("test.big_collection",{"number":1});
mongos> db.big_collection.getShardDistribution();

Shard rs0 at rs0/localhost:27017,localhost:27018,localhost:27019,localhost:27020
 data : 122.41MiB docs : 1812256 chunks : 5
 estimated data per chunk : 24.48MiB
 estimated docs per chunk : 362451

Shard rs1 at rs1/localhost:27026,localhost:27027,localhost:27028
 data : 32.36MiB docs : 479181 chunks : 2
 estimated data per chunk : 16.18MiB
 estimated docs per chunk : 239590

Totals
 data : 154.78MiB docs : 2291437 chunks : 7
 Shard rs0 contains 79.08% data, 79.08% docs in cluster, avg obj size on shard : 70B
 Shard rs1 contains 20.91% data, 20.91% docs in cluster, avg obj size on shard : 70B
sharding splitting
sharding migrating

Sharding migration outcome

mongos> db.big_collection.getShardDistribution();

Shard rs0 at rs0/localhost:27017,localhost:27018,localhost:27019,localhost:27020
 data : 85.85MiB docs : 1270900 chunks : 4
 estimated data per chunk : 21.46MiB
 estimated docs per chunk : 317725

Shard rs1 at rs1/localhost:27026,localhost:27027,localhost:27028
 data : 49.25MiB docs : 729100 chunks : 3
 estimated data per chunk : 16.41MiB
 estimated docs per chunk : 243033

Totals
 data : 135.1MiB docs : 2000000 chunks : 7
 Shard rs0 contains 63.54% data, 63.54% docs in cluster, avg obj size on shard : 70B
 Shard rs1 contains 36.45% data, 36.45% docs in cluster, avg obj size on shard : 70B

MongoDB Replication

  1. MongoDB write path

  2. Replication principles

  3. Replica set Read and Write Semantics

  4. Sharding

  5. Replication and sharding practice

Replica set in practice

replicaset slides

Replication&Sharding

sharding slides

Ressources

Read the documentation for the systems you depend on thoroughly–then verify their claims for yourself. You may discover surprising results!
— Kyle Kingsbury(Aphyr)