Data Synchronization Primer
Step by step introduction to data synchronization strategies and solutions
In my long career of software developer (but I’m still young:)) I have often dealt with issues like these:
- “This wonderful app works great when you’re connected to internet (and bandwidth is pretty large) but is a mess when connection is intermittent or, worse, stops working at all when is disconnected”.
- Back-end systems, designed to be responsive even when managing GB’s of records, often require some forms of caching (for example using a NoSql database in front of an RDMS): how keep synchronized cache and main db?
- PWA or Progressive Web Application, depending on how are implemented, could require some sort of local data storage so that it’s able to display and manage already downloaded data allowing users to work even when browser is disconnected from internet.
In above scenarios and many other you may have faced we, as software engineers, are required to synchronize data between a central data source and a local data storage or between two or more database replicas.
In past I’ve often based my solutions on Microsoft Synchronization Framework (https://msdn.microsoft.com/en-us/library/mt763482.aspx): a powerful and flexible .NET toolkit written ten years ago, for windows platforms and especially designed to work well with SQL Server and SQL Server Compact Edition. Years are gone, world is no more “Windows-only”, Microsoft itself is a completely different company playing a big role in opensource world: it’s not surprising that this as other technologies are left behind. Nevertheless many concepts introduced with that framework are still valid and applicable to today solutions.
Database or more generally data synchronization tools help developers to keep synchronized and consistent data between 2 or more data sources, given them a standardized way of dealing with a recurrent pattern avoiding custom in-house solutions.
When synchronizing we can identify 2 components involved:
- Source data store: is the component which initiates the synchornization.
- Destination or target data store: is the component which the source “wants” to synchronize to.
Generally any store, in any configuration can act as Source or Destination: we can consider synchronization a bidirectional process. When synchronizing more than 2 data store you can always split the process in one or more 2-component synchronization.
Following I’m listing 3 general rules that a data store should satisfy before it can be synchronized:
- It should maintain a version of data: any changes to any record/file etc must increment in some way a counter/timestamp (in a transaction) so that one can always detect which changes were made to the data store in order to move its data from one version to the next.
- It should allows a developer to get changes occurred to the data store between two versions. These ‘changes’ must returned in a format public and documented.
- It should allows a developer to apply changes that were read from another source. It also should have a developer friendly way to handle data conflicts.
Every data store may implement different change tracking strategies: some have native implementation like Microsoft SQL Server (https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/track-data-changes-sql-server?view=sql-server-2017). In most RDMS anyway, stored procedures, triggered by add/edit/delete operations on records, could track changes at same time.
List of changes occurred to data is usually maintained in specific internal structure (like hidden or system tables/columns/files).
For capacity reasons these hidden structures are cleaned up with regularity.
Considering that, it’s easy to understand why data stores can “get” or “apply” changes for/to version that are “not too old”.
This is usually called Retention Period and essentially means how back in time you can go in term of versions and still be able to get or apply changes to the data repository.
In this article I’ll show you which steps are involved when synchronizing 2 data source. Finally I’ll present a synchronization library I’ve developed in .net core. Same concepts, of course, can be implemented in any language/OS with any tools you may like.
Let’s take an example that, for simplicity, deals with local-central database synchronization.
Say we are working to a Medium clone, so we have a “Posts” table in a central database. We have an app that should let users edits their post even when disconnected so we created a local database with a “Posts” table: app will update local database and synchronize back changes to central database. Of course changes occurred to central database (for example because same user edits its post via web app) should be reflected locally.
Synchronization process should be initiated from app and should take these steps:
- Read local database changes since last synchronization
- Send changes to server that applies them to database
- Handle any conflict may occur
- Read changes from central db since last time client synchronized and send back to app
- Apply changes received from server to local database
Initially both databases contains a table for Post like this:
Posts
+----+-------+---------+-------+
| Id | Title | Content | Claps |
+----+-------+---------+-------+
and a table to track changes that can be something like following:
Posts-Changes
+---------+----+----+
| Version | Id | Op |
+---------+----+----+
Lastly local database should save version of server from last successfully synchronization; it should be used every time it get or apply data from/to server. Initially this value is set to 0.
Synchronizations
+----------------+
| Server Version |
+----------------+
| 0 |
+----------------+
User makes a Post using app (notes that Id is an unique id):
Posts local-db
+----------+----------+-----------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------+-------+
| 3ea52d6c | My Title | My amazing post | 0 |
+----------+----------+-----------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
+---------+----------+----+
Synchronizations
+----------------+
| Server Version |
+----------------+
| 0 |
+----------------+
Synchronization is initiated from app (source) towards central database (destination).
As first step, it get changes from local database starting from version 0 to version 1: it should be something like:
{
"Server-Version": 0,
"Changes": [
{
"Operation" : "I",
"Id" : "3ea52d6c",
"Title" : "My Title",
"Content" : "My amazing post",
"Claps" : 0
}]
}
Server receive above message and applies changes to its database:
Posts central-db
+----------+----------+-----------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------+-------+
| 3ea52d6c | My Title | My amazing post | 0 |
+----------+----------+-----------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
+---------+----------+----+
A couple of important notes:
- Server can apply changes because Server-Version proposed by client is “not too old”; actually is the current version of central database
- There is no conflict: actually no conflicts can occur while we have Insert operations. As we’ll see in a moment, conflicts can only be triggered when Update or Delete operations are applied.
Server get changes from central database since version 1: no changes are detected and this is the message going back to client:
{
"Server-Version": 1
}
Server version on local database is updated as well.
Synchronizations local-db
+----------------+
| Server Version |
+----------------+
| 1 |
+----------------+
To make things a bit more complicated, suppose user edit his post on app while a couple of other Medium-clone users “claps” his article:
Posts local-db
+----------+----------+-----------------------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------------------+-------+
| 3ea52d6c | My Title | My amazing post with images | 0 |
+----------+----------+-----------------------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
| 2 | 3ea52d6c | U |
+---------+----------+----+Posts central-db
+----------+----------+-----------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------+-------+
| 3ea52d6c | My Title | My amazing post | 2 |
+----------+----------+-----------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
| 2 | 3ea52d6c | U |->first clap
| 3 | 3ea52d6c | U |->second clap
+---------+----------+----+
When synchronization starts this is the message posted to server:
{
"Server-Version": 1,
"Changes": [
{
"Operation" : "U",
"Id" : "3ea52d6c",
"Title" : "My Title",
"Content" : "My amazing post with images",
"Claps" : 0
}]
}
Changes are applied but server soon realize that record with Id=3ea52d6c has been changed since version 1 (version knew by client): in other words that record has a version higher than 1:
Posts left join Posts-Changes central-db
+----------+-------+---------+-------+---------+----+
| Id | Title | Content | Claps | Version | Op |
+----------+-------+---------+-------+---------+----+
| 3ea52d6c | ... | ... | 2 | 3 | U |
+----------+-------+---------+-------+---------+----+
A conflict can be of 3 different types:
- Update-Update: occur when a data store is trying to update a record with an higher version i.e. it has been updated since last time client has read it.
- Update-Delete: occur when a data store wants to update a record that has been removed
- Delete-Update: occur when a data store is going to delete a record with an higher version, i.e. it has been updated since last synchronization.
There can be different strategies to handle conflicts and mainly depends on developer choices:
- Server-wins: Automatically let the server or target data store keeps their data as it is.
Using this approach an update-update conflict is resolved not updating the record, an update-delete and a delete-update simply ignoring it. - Client-wins: Automatically and forcefully apply changes: update-update conflict is resolved updating the record, update-delete resolution actually causes a record restore with data coming from client, finally the delete-update resolves in a record deletion. All the operations causes server version to upgrade.
- Manually: let developer handle conflicts according to application/cases strategies.
In our example I would follow last approach so to be able to merge changes coming from client app with central database record:
Posts central-db
+----------+----------+-----------------------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------------------+-------+
| 3ea52d6c | My Title | My amazing post with images | 2 |
+----------+----------+-----------------------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
| 2 | 3ea52d6c | U |->first clap
| 3 | 3ea52d6c | U |->second clap
| 4 | 3ea52d6c | U |->content change
+---------+----------+----+
As last step server get changes from central database using version posted by client (that is 1) to current server version (that is 3).
This time it find 3 changes: first clap, second clap and content change. Of course 2 or more consecutive updates to same record can be collapsed in a message like the following:
{
"Server-Version": 4,
"Changes": [
{
"Operation" : "U",
"Id" : "3ea52d6c",
"Title" : "My Title",
"Content" : "My amazing post with images",
"Claps" : 2
}]
}
Finally app applies changes to its local database:
Posts local-db
+----------+----------+-----------------------------+-------+
| Id | Title | Content | Claps |
+----------+----------+-----------------------------+-------+
| 3ea52d6c | My Title | My amazing post with images | 2 |
+----------+----------+-----------------------------+-------+
Posts-Changes
+---------+----------+----+
| Version | Id | Op |
+---------+----------+----+
| 1 | 3ea52d6c | I |
| 2 | 3ea52d6c | U |
| 3 | 3ea52d6c | U |
+---------+----------+----+
Synchronizations
+----------------+
| Server Version |
+----------------+
| 4 |
+----------------+
Now we have our happy writer looking at those 2 new claps!
Recently I’ve developed a small .NET core library called CoreSync (https://github.com/adospace/CoreSync) that implement most of the synchronization process.
CoreSync uses a plug-in approach to allow synchronization between different data sources: currently I’ve developed providers for SQL Server and SQLite.
Install also from nuget:
Install-Package CoreSync -Version 0.1.12-beta
Install-Package CoreSync.Sqlite -Version 0.1.12-beta
Install-Package CoreSync.SqlServer -Version 0.1.12-beta
There are other important aspects to deal with while synchronizing, like data filtering: it can be the subject for a second part of the article.
I hope you enjoined my first story on Medium, please reach me here or on GitHub if have any question.
✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.