First let me tell you a story.
Imagine you're an account manger trying to help customer solve their problem. It seems to be a bug in the software system. They use a version that is over year old (couple releases ago) with some custom feature and client-specific configuration. So far the attempts to replicate the bug were futile.
The idea is to re-create the system setup in the same way the customer has it, so we could reproduce the problem, debug the system and provide a fix. However the person who installed it no longer works at the company, 100-page operating manual is out of date and there's no record of what exactly has been customized.
Does it sound familiar?
Basically that was the situation our customer was dealing with. People tried to cope with that producing those 100+ pages operating manuals, writing down how they configured the system (if they still remembered what worked after trying 8th time), gathering fact sheets and diagrams that might even reflect how the system is setup.
Documentation is great, but who likes to write it? Not to mention keeping it up to date and verifying that what is written there works?
I'm going to explain how using DevOps tools changed how people changed approach to the work and how did they felt. Then I'm going to unveil how we have done it.
In the previous life our account manger would need to ask people responsible for various components to assemble them in a way that this particular customer setup looked like. We assume he was lucky to actually find out the configuration and feature customization, of course. If someone was on holidays then our poor AM was out of luck and needed to wait.
By using DevOps tools out AM could do all of that himself. All version of software components were stored in artifacts repository. Even if deleted, they could be re-created from source code exactly as they were, due to reproducible code builds on CI server. Configuration is stored alongside code, so the full setup was in source code repository.
Assembling the customer specific setup was as easy as checking out a branch with configuration code for this customer and running release pipeline for this branch on one of testing environments. Scripts would deploy correct component versions and apply customer-specific configuration automatically.
Better yet such configuration branch can be turned into pull request and provided to someone for review. There's no need to check if the code and configuration is valid - automatic tools on CI server already did that. It's more about asking others if those exact changes should have been applied for this customer use case in first place.
At first it was quite overwhelming to use source control and coding tools for people like account managers. However we've learned fast that sometimes it's easier to read YAML configuration than to process 100+ pages manual to find one specific setting. Also asking someone for review was great learning and collaboration experience.
Hunting esoteric bug for a customer makes a good story, but it's not an everyday work. Normally we'd have much smaller features, fixes and improvements releases on day-to-day basis. The configuration of the system also evolves at steady pace. Since the scope of those was small the release was no longer a scary process.
Also the fact that it was done on a daily basis contributed to increased confidence. The typical flow of creating pull request with changes, automatic validation by CI, review by peers, finally approve, merge and deploy has been harnessed.
By incorporating review process we not only gain an additional verification before deploy. This is also a learning and collaboration event - both parties are able to contribute the improvements. I saw improvements and simplifications introduced as a result of change review process.
The great benefit of DevOps approach is that if something is wrong you learn that early in the process. No longer need to wait weeks to see if change yields expected result on production setup. Personally I've used to forget why this change have been made before I got feedback whether it worked or not months later. If not the work had to be started over.
Fast feedback loop is the essence of DevOps approach. In case of typo in configuration that is going to be caught by automatic validation in minutes. Misapplication of some setting is pointed out during review in few hours. Automatic deploy routine tells you if the change worked the same day.
Change review process saved me many times from breaking the system or doing something stupid, because I've missed something. Code review comments are invaluable way of learning about the system and about your craft too. The necessary bit is to hide your ego and take feedback as it is. Simple, but not easy.
In DevOps flow, you (the owner of the change) is responsible of deploying it. This is how you learn how th system actually works or in which ways it breaks. Observing deployment progress and behaviour of the system (i.e. how metrics and logs do change right after that) becomes second nature after a while.
Observability is the new hot trend in DevOps world. Observability without the observer is just an empty slogan.
Lead time reduced from months to days
Such dramatic shift of delivery pace was the observed benefit of the approach we've taken.
Cultural change was enabled by DevOps practices and tools we've used.
This is what tools we've used and how.
Our rule of thumb was that the whole infrastructure setup has to be automated, no exceptions. It started from virtual machines and networks in AWS cloud that were spin up by Terraform - cloud-agnostic infrastructure as code tool. We've chosen Terraform, because customer infrastructure could be on various providers: in some cases it was cloud, in others it was on-premise dedicated or virtual machines.
Once the basic virtual machine was running Ansible that applied operating system configuration and installed necessary tooling. We kept that layer thin, having only a few small roles in Ansible. This decision improved manipulability and security by having narrow attack surface.
The heavy lifting has been done by Docker Swarm orchestrator. Every application had a dedicated Docker image with all the runtime dependencies and Swarm managed a workload over a fleet of VM nodes.
Why Docker Swarm? Back then it was under heavy development at Docker Inc company. Swarm was much simpler compared to Kubernetes and back then K8s was not that fully-featured yet. However in 2021 I'd discourage using Docker Swarm for a greenfield project.
Infrastructure as Code
Having all the infrastructure as code in one big repository was a big enabler for collaboration and shared responsibility. Gone were the days of "this is my special machine, so don't touch it" approach. Transparency was a key value to fight knowledge silos that would otherwise happen.
Additional benefit was the ability to view and compare all the development and testing environments all at once. It helped our teams to wrap their heads around what feature is being deployed or tested at which stage. Our QA engineers developed their own tools to make it easier.
Event though the infrastructure was pretty big and complex, there was only one dedicated person to manage all of that. Well, that's not entirely true - the whole team was responsible to the software and environment it run on. What I'm trying to say is: due to efficient DevOps tools one high-class specialist was enough to manage it.
As DevOps states there's no distinction on dev and ops, so everyone was encouraged to perform configuration changes, deploys and contribute to infrastructure setup. And this is what happened. I've personally added the ability to verify artifact version before it gets deployed, because I missed that feature in the
People with various skills installed, configured and operated our system. That's why we focused on using tools that are easy to reason about. Majority of that were YAML configuration files for Docker Swarm. Those have exactly the same format as Docker Compose - a tool that makes it very easy to install that on any machine running Docker.
Similar story with
deploy.sh Bash script. It basically codes the steps an operator could re-type on the machine themselves. With additional comments that made it an executable documentation that was run every day - so we made sure it works. Gone are the days of re-typing commands from operating manual only to find out they do not work in this version of the system.
Clear separation of the various layers (VM via Terraform, OS via Ansible, apps via Docker Swarm) made it easy for customers to pick and choose how much they wanted to use it. That was extremely important for some closed setups where public cloud was out of question.
The bonus point I wanted to mention was a script to generate release notes from code and source control metadata. That was yet another attribute of sticking to approach that every commit message should refer Jira ticket, so changelog in release notes could be generated from that data. Also installation instructions were copy-paste of the scripts that we've prepared. Great documentation with minimal effort.
Is DevOps approach worth it?
Well, it depends. Such extensive automation pays off for any medium project (with dozens of people involved). For large projects I'd dare to say it is a requirement: otherwise we loose so much time trying to do basic things that are repeated every day (doing manual deploy, reading operational docs, finding where things are, etc). On the other hand we've applied the same principles (without extensive tooling) on a small team (5-7 devs) and got most of the benefits too.
Despite technical and process advantages of DevOps I'd say: do it for the people! This way you gain more engaged team that feels empowered and responsible for the product. In turn that leads to many learning experiences fed by honest feedback and increased confidence. I must admit it's a pleasure working in a team in such environment.