How to Automate your DWH Deployments
Efficient development processes are based on automated processes to test, integrate, and deploy applications into production. With typical applications, this is quite an easy task. For database applications, it’s a bit harder. This article explains why and how to execute this process successfully.
Some time ago, I wrote about the Challenges in Automating Multi-Schema Database Deployments. It was about why automated database deployments are crucial to efficient development processes and why they are challenging to establish with multi-schema applications.
After implementing many automated DWH deployment processes, I now summarized the most important steps to get there and explain why not only technological implementations have to be made but also process changes and why they are crucial.
A deployment process from hell
Manual deployments include all the processes that require manual steps to install your software into production. Those manual steps can be:
- Emailing the install scripts to other people (lack of centralized repository).
- Manual execution of installation scripts (lack of automated pipelines).
- Copying files around (lack of centralized repository).
- Sending emails (lack of automated processes).
- People outside the team execute the installation (lack of ownership).
In the picture above, an example of a manual deployment process is shown. The team member sends an installation package over email to the DBA. The DBA then manually installs the scripts on the target database. He then sends an email back to the team member with a message if the deployment was successful or not. If it wasn’t, he attaches a log file with the error. The team member then has to start looking for the problem and sends a fix again to the DBA…
Such a deployment process has many problems:
- The involvement of other teams is very time-consuming. The waiting times for the development team members are very long. This waiting time makes the process very expensive if repeated often.
- Manual executions are prone to errors. They are not repeatable and therefore, don’t reduce risk throughout the deployment pipeline.
- Automated testing is not possible because no mechanism ensures automatic test execution.
Multi-Schema Database Applications
Multi-Schema Database Applications are applications that include more than one database schema. They share data or access data across schemas but are physically on the same database instance. Typically this is seen in data warehouses as shown in the picture below where each non-grey box shows a database schema. Those schemas are used by different applications that share and exchange data (DWH Core and Marts).
Suppose a data warehouse contains separate schemas for its layers, there’s most likely one schema that contains all the metadata and configuration data for the functionality of the ETL pipelines and the data loads. Also, standard functionality like logging is usually installed in such a separate schema.
This approach leads to an application that stretches across multiple schemas and is therefore called a multi-schema database application.
The Data Marts will most likely also have some conformed dimensions that have been physically referenced in multiple tables. They allow references to the same structure, attributes, values, definitions and concepts like time and dates. Because of this, they are used across many fact tables; so even if the single data marts could be split into separate schemas, there would still be the dependency to the conformed dimensions that all the marts have.
Advantages of automated deployment
The goal of continuous integration and automated deployments is to achieve stable releases with high quality. We also save time during the process by eliminating repetitive manual tasks through automation. With improved stability and higher quality, it should be possible to achieve less unplanned work. And the automation is a huge timesaver that also saves money because the developers’ time can be used to create actual value for the product on which we are working.
In their book “Accelerate — The Science of DevOps” Nicole Fosgren, Jez Humble and Gene Kim published their research findings of their “State of DevOps Reports”. They found out that high performance in development teams is possible with all kinds of systems and is independent of the technologies they use.
“High performance is possible with all kind of systems and is independent of the technologies the teams use! “— Accelerate — The Science of DevOps
Another interesting finding is that — where code deployments are most painful — one can find the most deficient organizational performance and culture. So if your deployment process hurts, most likely the culture around it isn’t the best either.
“Where code deployments are most painful, you’ll find the poorest organizational performance and culture”— Accelerate — The Science of DevOps
The third finding I want to highlight is that software delivery performance positively impacts organizational performance. This finding implies that with automation and better development and delivery processes, not only do the software teams improve their performance but the whole organization profits from those improvements.
“Software delivery performance positively impacts organizational performance!”— Accelerate — The Science of DevOps
Three stages of optimizing your database deployment
Now, if you want to improve your continuous integration and delivery process then certain preconditions have to be established. The most basic is in regards to how changes are handled. If multiple changes are put into one change package (see the image below — left stage), this is the first thing that can be optimized. Every change in the application code needs to be treated as an individual change. For example, one change or deployment per implemented feature. This way, failing features don’t stop the successful features from getting installed in production. This splitting should then result in the second stage in the image below. The third step is automating deployment for all those changes.
How to get there
But how to accomplish automated deployments? There are two crucial things to implement:
- Establish the Practices of Continuous Integration
Martin Fowler describes very well what is necessary to implement a continuous integration process that’s robust and reliable. - Build Deployment Pipelines
Pipelines will be needed to automate the installations until production. There are three kinds of pipelines that are most likely required described in the image below.
Continuous Integration Pipelines will assure that changes on feature branches are deployable and don’t break stuff.
If your application is a database, a build is the same thing as a deployment.
Nightly Build Pipelines assure the health status of the files in your main branch and ensure that the whole installation is installable.
Deployment Pipelines provide you with the automated functionality to deploy automatically through your environment pipeline from test to integration until production in a safe and repeatable way. Always the same way.
More about the DevOps Report Studies can be found on the DORA Research Program website.