4.2. CouchDB Replication Protocol
The CouchDB Replication protocol is a protocol for synchronizing
documents between 2 peers over HTTP/1.1.
4.2.1. Language
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”,
“SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this
document are to be interpreted as described in RFC 2119.
4.2.2. Goals
The CouchDB Replication protocol is a synchronization protocol for
synchronizing documents between 2 peers over HTTP/1.1.
In theory the CouchDB protocol can be used between products that
implement it. However the reference implementation, written in Erlang, is
provided by the couch_replicator module available in Apache CouchDB.
The CouchDB replication protocol is using the CouchDB REST API and so is based on HTTP and
the Apache CouchDB MVCC Data model. The primary goal of this
specification is to describe the CouchDB replication algorithm.
4.2.3. Definitions
- ID:
- An identifier (could be an UUID) as described in RFC 4122
- Sequence:
- An ID provided by the changes feed. It can be numeric but not
necessarily.
- Revision:
- (to define)
- Document
- A document is JSON entity with a unique ID and revision.
- Database
- A collection of documents with a unique URI
- URI
- An uri is defined by the RFC 2396 . It can be an URL as defined
in RFC 1738.
- Source
- Database from where the Documents are replicated
- Target
- Database where the Document are replicated
- Checkpoint
- Last source sequence ID
4.2.4. Algorithm
- Get unique identifiers for the Source and Target based on their URI if
replication task ID is not available.
- Save this identifier in a special Document named _local/<uniqueid>
on the Target database. This document isn’t replicated. It will
collect the last Source sequence ID, the Checkpoint, from the
previous replication process.
- Get the Source changes feed by passing it the Checkpoint using the
since parameter by calling the /<source>/_changes URL. The
changes feed only return a list of current revisions.
Note
This step can be done continuously using the feed=longpoll or
feed=continuous parameters. Then the feed will continuously get
the changes.
- Collect a group of Document/Revisions ID pairs from the changes
feed and send them to the target databases on the
/<target>/_revs_diffs URL. The result will contain the list of
revisions NOT in the Target.
- GET each revisions from the source Database by calling the URL
/<source>/<docid>?revs=true&open_revs`=<revision> . This
will get the document with its parent revisions. Also don’t forget to
get attachments that aren’t already stored at the target. As an
optimisation you can use the HTTP multipart api to get all.
- Collect a group of revisions fetched at previous step and store them
on the target database using the Bulk Docs API
with the new_edit: false JSON property to preserve their revisions
ID.
- After the group of revision is stored on the Target, save
the new Checkpoint on the Source.
Note
- Even if some revisions have been ignored the sequence should be
take in consideration for the Checkpoint.
- To compare non numeric sequence ordering, you will have to keep an
ordered list of the sequences IDS as they appear in the _changes
feed and compare their indices.
4.2.5. Filter replication
The replication can be filtered by passing the filter parameter to the
changes feeds with a function name. This will call a function on each
changes. If this function return True, the document will be added to the
feed.
4.2.6. Optimisations
- The system should run each steps in parallel to reduce the latency.
- The number of revisions passed to the step 3 and 6 should be large
enough to reduce the bandwidth and make sure to reduce the latency.
4.2.7. API Reference
- HEAD /{db} – Check Database existence
- POST /{db}/_ensure_full_commit – Ensure that all changes are stored
on disk
- :get:`/{db}/_local/{id}` – Read the last Checkpoint
- :put:`/{db}/_local/{id}` – Save a new Checkpoint
Pull Only
- GET /{db}/_changes – Locate changes since on Source the last pull.
The request uses next query parameters:
- style=all_docs
- feed=feed , where feed is normal or
longpoll
- limit=limit
- heartbeat=heartbeat
- GET /{db}/{docid} – Retrieve a single Document from Source with attachments.
The request uses next query parameters:
- open_revs=revid - where revid is the actual Document Revision at the
moment of the pull request
- revs=true
- atts_since=lastrev