MongoDB Replication & Sharding

Geoffrey Berrard, LesFurets.com, tp-bigdata@lesfurets.com

MongoDB Replication&Sharding

MongoDB durability
Replication principles
Replica set Read and Write Semantics
Sharding
Replica&sharding practice

MongoDB durability

MongoDB Journal vs Oplog

journal
- low level log of an operation for crash recovery (can be turned off)
oplog
- similar to RDBMS binlog
  - stores (idempotent) high-level transactions that modify the database
  - kept on the master and used for replication

https://docs.mongodb.org/manual/core/read-isolation-consistency-recency/

Plan

MongoDB write path
Replication principles
Replica set Read and Write Semantics
Sharding
Replica&sharding practice

Master-slave replication

Replica set = a group of mongod processes that provide redundancy and high availability
Writes: write to single node replicated to the others members of the replica set
Read: read from a single member of the replica set

Replica set members

Primary
- acceptes all writes and reads
- 1 primary per replica set
Secondaries replicates data (and can serve reads ⇒ reads preferences)
- Priority 0 ⇒ Hidden members ⇒ Delayed
Arbiters (usually at most one) : break ties

Primary and secondary members

Primary acceptes all writes + reads + records them in oplog
Secondary replicates primary oplogs (also accept reads)

replica set read write operations primary

Replication data flow

asynchronous oplog replication
heartbeat for monitoring status

replica set primary with two secondaries

Automatic failover via new primary election

Strategy for election

member’s priority
latest optime in the oplog
uptime
break the tie rules

Secondary members: Priority 0

cannot become primary
cannot trigger elections
can vote in elections
copy of data + accepts reads

replica set three members geographically distributed

Secondary members: Hidden replica set member

Priority 0 members that don’t accept reads

Secondary members: Delayed replica set members

reflect an delayed state of the set
- must be priority 0 ⇒ prevent them to become primary
- should be hidden ⇒ prevent application to query stale data

Elections on odd number of nodes

a replica cannot become primary with only 1 vote
majority with even numbers of members ?

use Arbitrers to break ties
- does not hold data
- cannot became a primary

Arbiters

Fault tolerance

No primary ⇒ writes no longer possible, reads still accepted
Fault tolerance : number of members that can become unavailable and still be able to elect a primary

https://docs.mongodb.org/manual/core/replica-set-architectures/

Rollbacks during replica set failover

a rollback reverts write operations on a former primary when the member rejoins its replica set after a failover
- the primary accepted a write that was not sucessfuly replicated to secondaries !

Cause of the problem ?

default write semantics { w:1 } ⇒ the primary acknowledge the write after the local write (local Journal!)

How to handle rollbacks

manually apply/discard rollbacks ( rollback/ folder)
avoid rollbacks use { w:majority }
- READ UNCOMMITED SEMANTICS
  - ! Regardless of write concern, other clients can see the result of the write operations before the write operation is acknowledged to the issuing client.
  - ! Clients can read data which may be subsequently rolled back.

https://docs.mongodb.org/manual/core/replica-set-rollbacks/

https://docs.mongodb.org/manual/core/read-isolation-consistency-recency/

Plan

MongoDB write path
Replication principles
Replica set Read and Write Semantics
1. Write concerns
2. Read concerns
3. Read preferences
Sharding
Replica&sharding practice

Replica set and consistency

parameters to move the CAP cursor
- write concern
  - is the guarantee an application requires from MongoDB to consider a write operation successful
- read concern
  - specify the desired consistency for read operations
- read preference
  - applications specify read preference to control how drivers direct read operations to members of the replica set

Write semantics

w:1 (default)
- the primary acknowledge the write after the local write
w:N
- ack the write after the ack of N members
w:majority
- ack the write after the ack of the majority of the members
wtimeout (optional)
- prevents write operations from blocking indefinitely if the write concern is unachievable

W:2 write semantics

Changing the write semantics

at the query level

db.products.insert(
   { item: "envelopes", qty : 100, type: "Clasp" },
   { writeConcern: { w: 2, wtimeout: 5000 } }
)

change the default write concern:

cfg = rs.conf()
cfg.settings = {}
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)

Read concern

local(default) ⇒ returns the instance’s(P/S) most recent data
- no guarantee that the data has been written to a majority of the replica set
- may be rolled back).
majority ⇒ the most recent data acknowledged by a majority of members in RS
linearizable ⇒ successful writes issued with a write concern of "majority" and acknowledged prior to the start of the read operation

Tunning the read concern

db.restaurants.find( { _id: 5 } ).readConcern("linearizable").maxTimeMS(10000)

Read preference

primary (default): read from the current replica set primary.
primaryPreferred: read from primary (or secondary iff no primary)
secondary: read from secondary members
secondaryPreferred: read from secondary(or primary iff no secondary)
nearest ⇒ least network latency
async replication ⇒ stale data !

Read preferences example

Read preferences example

db.emails.find().readPref("secondary")
db.collection.find().readPref('nearest', [ { 'dc': 'east' } ])

Read preferences use cases

Maximize Consistency ⇒ primary read preference
Maximize Availability ⇒ primaryPreferred read preference
Minimize Latency ⇒ nearest read preference

Plan

MongoDB write path
Replication principles
Replica set Read and Write Semantics
Sharding
Replication and sharding practice

Sharding terminology

shard: server/replica set that contains a subset of the data
mongos: servers that acts as a query router
config servers: server/replica set that store metadata and configuration
chunk: mongo partitions sharded data into chunks
- lower and exclusive upper range based on the shard key
- mongo migrates chunks across the shards (cluster balancer)

Sharding architecture

Sharded and Non-Sharded Collections

Sharded and Non-Sharded Collections

Data Partitioning with Chunks

default chunk size 64MB (configurable)
- small chunks ⇒ even distribution but frequent migrations (network/mongos overhead)
- large chunks_ lead ⇒ fewer migrations, potentially uneven distribution of data (preffered)
Chunk size
- ⇒ maximum collection size when sharding an existing collection

Shard key

indexed field/compound fields that exists for every document in the collection
- becames immutable after sharding
determines the distribution of documents among the shards
!Note!
- cannot shard a collection with unique indexes on other fields.
- cannot create unique indexes on other fields on sharded collection

Shard key

Ideal :
- mutations and queries should be distributed uniformly across all of the shards
- all operations would only target the shards of interest
- the load (queries/data) should be uniformly distributed
The shard key that matches your workload:

https://www.mongodb.com/blog/post/on-selecting-a-shard-key-for-mongodb

Shard key properties

cardinality ⇒ determines the maximum number of chunks the balancer can create
frequency ⇒ how often a given value occurs in the data
monotonicity ⇒ likelihood to distribute inserts to a single shard

Shard key: low cardinality

sharded cluster ranged distribution low cardinal

Shard key: high frequency

sharded cluster ranged distribution frequency

Shard key monotonicity

Sharding strategies: Ranged

base on the domain of the shard key
documents with shard key values close to one another are likely to be co-located on the same shard
optimize range based queries

Sharding strategies: Ranged

mongos> db.temperature.createIndex( { creation_date : 1 } ); //create index
mongos> sh.enableSharding( "test" ); //enable sharding for database
mongos> sh.shardCollection("test.temperature",{"creation_date":1}); //shard collection

Sharding strategies: Hashed

documents distributed according to an MD5 hash of the shard key value
guarantees a uniform distribution of writes across shards
- but is less optimal for range-based queries

Sharding strategies: Hashed

mongos> db.emails.createIndex({sender:"hashed"}); //create hash-index
mongos> sh.enableSharding( "test" ); //enable sharding for database
mongos> sh.shardCollection("test.emails",{sender:"hashed"}); //shard collection

Sharding strategies: Zones

Sharding: chunk split + migration

mongos> sh.shardCollection("test.big_collection",{"number":1});
mongos> db.big_collection.getShardDistribution();

Shard rs0 at rs0/localhost:27017,localhost:27018,localhost:27019,localhost:27020
 data : 122.41MiB docs : 1812256 chunks : 5
 estimated data per chunk : 24.48MiB
 estimated docs per chunk : 362451

Shard rs1 at rs1/localhost:27026,localhost:27027,localhost:27028
 data : 32.36MiB docs : 479181 chunks : 2
 estimated data per chunk : 16.18MiB
 estimated docs per chunk : 239590

Totals
 data : 154.78MiB docs : 2291437 chunks : 7
 Shard rs0 contains 79.08% data, 79.08% docs in cluster, avg obj size on shard : 70B
 Shard rs1 contains 20.91% data, 20.91% docs in cluster, avg obj size on shard : 70B

Sharding migration outcome

mongos> db.big_collection.getShardDistribution();

Shard rs0 at rs0/localhost:27017,localhost:27018,localhost:27019,localhost:27020
 data : 85.85MiB docs : 1270900 chunks : 4
 estimated data per chunk : 21.46MiB
 estimated docs per chunk : 317725

Shard rs1 at rs1/localhost:27026,localhost:27027,localhost:27028
 data : 49.25MiB docs : 729100 chunks : 3
 estimated data per chunk : 16.41MiB
 estimated docs per chunk : 243033

Totals
 data : 135.1MiB docs : 2000000 chunks : 7
 Shard rs0 contains 63.54% data, 63.54% docs in cluster, avg obj size on shard : 70B
 Shard rs1 contains 36.45% data, 36.45% docs in cluster, avg obj size on shard : 70B

MongoDB Replication

MongoDB write path
Replication principles
Replica set Read and Write Semantics
Sharding
Replication and sharding practice

Replica set in practice

Replication&Sharding

Ressources

Read the documentation for the systems you depend on thoroughly–then verify their claims for yourself. You may discover surprising results!

— Kyle Kingsbury(Aphyr)