<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Adam on DevOps]]></title><description><![CDATA[Documenting my learning of SRE / DevOps topics like: Kubernetes, Docker, AWS, Cloud, etc. Writing on software from (A)nalysis to (T)ermination.]]></description><link>https://adambrodziak.pl</link><generator>RSS for Node</generator><lastBuildDate>Sun, 17 May 2026 23:42:31 GMT</lastBuildDate><atom:link href="https://adambrodziak.pl/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Log retention in ELK stack]]></title><description><![CDATA[Developers kept complaining that they can't find recent logs in Kibana. It happened before for many reasons (worthy of another post), but this time was different. There was no evident problem with the log structure or FluentD log shipper anymore.
We'...]]></description><link>https://adambrodziak.pl/log-retention-in-elk-stack</link><guid isPermaLink="true">https://adambrodziak.pl/log-retention-in-elk-stack</guid><category><![CDATA[logstash]]></category><category><![CDATA[elasticsearch]]></category><category><![CDATA[logging]]></category><category><![CDATA[Bash]]></category><category><![CDATA[elk]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Sun, 07 May 2023 15:32:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683473248840/9abb5384-14aa-4bd0-a99f-38e84defc5e1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Developers kept complaining that they can't find recent logs in Kibana. It happened before for many reasons (worthy of another post), but this time was different. There was no evident problem with the log structure or FluentD log shipper anymore.</p>
<p>We've noticed that one app went haywire and started sending logs like crazy. Because of that disks on Elasticsearch nodes got full and ES started to reject new logs. when the disk gets full Elasticsearch switches all indexes into read-only mode.</p>
<h2 id="heading-curator-for-logstash">Curator for Logstash</h2>
<p>Curator is a solution that allows you to set how many days you want to keep your log indexes in Elasticsearch. That is the most popular configuration.</p>
<p>Keeping the last 2 weeks of logs using Curator was our setup. Normally it worked just fine. With predictable log influx, it can be managed with Curator. You can calculate how many days of logs to keep not to overflow the storage on ES nodes.</p>
<p>The problem is when more events are arriving due to some problem with the app or some kind of DoS attack for example. In such case, fresh logs are the most valuable to detect the attack progress or how an outage is spreading across the system.</p>
<p>The other option is to set to remove (or apply any other supported action) if an index grows to a certain size in gigabytes. That gets closer to an ideal scenario where we maximize disk space utilization. That is if you know your disk size for the whole cluster upfront and want to manage those values across clusters (dev, test, prod).</p>
<p>What I wanted was simple. Keep <em>as much logs</em> as available space allows, but <em>do not drop log events</em> when disks are full (more on that later).</p>
<p>Curator does not have such mode, unfortunately. I've been looking around for alternatives but found nothing.</p>
<h2 id="heading-the-solution-is-bash-script">The solution is Bash script</h2>
<p>Fortunately checking for disk usage on nodes is fairly easy in Elasticsearch API. So with a few <code>curl</code> calls and a sprinkle of bash scripting here's the solution to avoid lost data because of full disk.</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>
<span class="hljs-comment"># Newline\tab as only separator, required for for loop</span>
IFS=$<span class="hljs-string">'\n\t'</span>
<span class="hljs-comment"># Fail on first error</span>
<span class="hljs-built_in">set</span> -euo pipefail

ELASTIC_URL=<span class="hljs-variable">${ELASTIC_URL:=localhost:9200}</span>
<span class="hljs-comment"># At 90% usage ES will try to move shards to other nodes. See `disk.watermark.high` in docs.</span>
DISK_WATERMARK=88

NODES_UTILIZATION=$(curl --fail-with-body -s -X GET <span class="hljs-string">"<span class="hljs-variable">$ELASTIC_URL</span>/_cat/allocation?h=disk.percent&amp;pretty"</span>)
<span class="hljs-keyword">for</span> DISK_USAGE <span class="hljs-keyword">in</span> <span class="hljs-variable">$NODES_UTILIZATION</span>; <span class="hljs-keyword">do</span>
    <span class="hljs-keyword">if</span> [ <span class="hljs-string">"<span class="hljs-variable">$DISK_USAGE</span>"</span> -gt <span class="hljs-string">"<span class="hljs-variable">$DISK_WATERMARK</span>"</span> ]; <span class="hljs-keyword">then</span>
        OLDEST_INDEX=<span class="hljs-string">"<span class="hljs-subst">$(curl --fail-with-body -s -X GET <span class="hljs-string">"<span class="hljs-variable">$ELASTIC_URL</span>/_cat/indices/logstash-*?h=index&amp;s=index"</span> | head -n 1)</span>"</span>
        curl --fail-with-body -s -X DELETE <span class="hljs-string">"<span class="hljs-variable">$ELASTIC_URL</span>/<span class="hljs-variable">$OLDEST_INDEX</span>"</span>
        <span class="hljs-built_in">exit</span> 0
    <span class="hljs-keyword">fi</span>
<span class="hljs-keyword">done</span>
</code></pre>
<p>As you can see it is pretty straightforward. One caveat: it's using <code>--fail-with-body</code> param added to curl 7.76.0 version, so it might not be available in older Linux distributions. That is just to show the error response from ES server for debugging.</p>
<h2 id="heading-run-script-periodically">Run script periodically</h2>
<p>Logstash indexes are created daily. Actually, the index name follows <code>logstash-YYYY-MM-DD</code> format by default. This is also the assumption in the script above in <code>_cat/indices/logstash-*</code> GET query.</p>
<p>However, to make the script efficient it should be run more often than once a day. The reason is some app could go haywire with logging and fill up storage in the evening. In such cases we have lost data on what happened around that failure.</p>
<p>The solution is simple. Make the script <strong>run by cron every hour</strong>. It worked for us flawlessly.</p>
<h2 id="heading-why-such-disk-usage-values">Why such disk usage values?</h2>
<p>Why delete an index when circa 90% disk usage is reached? It is related to how Elasticsearch behaves where very little storage space is left.</p>
<p>The <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/8.7/modules-cluster.html#disk-based-shard-allocation">official Elasticsearch docs</a> are not very clear, so let me briefly explain what happens when usage reaches a given level for default values.</p>
<p>Assuming on any given node disk is being filled:</p>
<ul>
<li><p>at 85% - ES will stop allocating shards to that node, see <code>disk.watermark.low</code> setting.</p>
</li>
<li><p>at 90% - ES will try to re-allocate shards to other nodes, see <code>disk.watermark.high</code> setting.</p>
</li>
<li><p>at 95% - ES enforces read-only index block, see <code>disk.watermark.flood_stage</code> setting.</p>
</li>
</ul>
<p>Preventing reaching 90% is the goal here, but even that could not help. Imagine one node disk is over 90%, so ES will try to move shards, but it will fail. Most likely other nodes will be over 85% already, so allocating is blocked. That is for equal disk sizes and shards being spread evenly - something to strive for anyway.</p>
<p>Let's assume Elasticsearch could move a shard from a node that is filling up to another one. Now think of the load that moving a giant slab of data (shards with logs are pretty big) from one ES node to another. Such operation can grind the cluster to a halt. We don't want that.</p>
<p>To be on the safe side we target 85% disk usage then. If that level is reached nothing active is being done, the node is just cordoned (to use Kubernetes lingo). Elasticsearch will not try to shuffle shards around and we have some room to spare before it does.</p>
<h2 id="heading-is-it-that-simple">Is it that simple?</h2>
<p>Well yes, but actually no ;) The idea behind it is so brilliantly simple that I was sure somebody has implemented it. However, I have found nothing, not even a post on some obscure blog ;)</p>
<p>On the other hand log aggregation in ELK stack is not an easy job. You have to define log even structure, decide what is indexed and what is not, and create a template for indexes. That is, if you have control over logging clients in apps, if not it gets much worse.</p>
<p>On top of that shard replicas, hot and cold indexes, and archived indexes are probably on your mind too. That's a lot and something that deserves another blog post. Let me know if you're interested.</p>
<p>Make your logs like Pokemons. Gotta Catch 'Em All</p>
]]></content:encoded></item><item><title><![CDATA[IT maturity levels]]></title><description><![CDATA[This post was written for my colleagues that only ever worked at software house. It was supposed to be short and sweet, but giving a kick for self-reflection too.
Why does a company need IT?
Today we're going to look at IT as a whole, a bit more broa...]]></description><link>https://adambrodziak.pl/it-maturity-levels</link><guid isPermaLink="true">https://adambrodziak.pl/it-maturity-levels</guid><category><![CDATA[IT]]></category><category><![CDATA[Business Technology]]></category><category><![CDATA[business]]></category><category><![CDATA[Investment]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Mon, 03 Oct 2022 18:09:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1664820487128/jUXfNQsW9.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This post was written for my colleagues that only ever worked at software house. It was supposed to be short and sweet, but giving a kick for self-reflection too.</p>
<h2 id="heading-why-does-a-company-need-it">Why does a company need IT?</h2>
<p>Today we're going to look at IT as a whole, a bit more broadly than just software development. As it happens, modern IT is a business unit that is supposed to (at the very least) support the organization in executing its strategy. Software development itself is not a business unit, unless we're talking about a company that sells software. But I'm not about that...</p>
<p>So what else counts as IT:</p>
<ul>
<li>systems maintenance</li>
<li>support (helpdesk)</li>
<li>proxy (purchasing, licensing).</li>
<li>hardware</li>
</ul>
<p>There may be more, depending on the size of the company.</p>
<h2 id="heading-it-maturity-levels">IT maturity levels</h2>
<p>But let's focus on this: what does IT (and therefore software) give to an organization to fulfill its mission? That depends on the level of development of that IT. From the beginning.</p>
<h3 id="heading-level-0-no-it">Level 0: No IT</h3>
<p>It's 2022 and some businesses work without any IT involvement. Reality check.</p>
<h3 id="heading-level-1-it-is-a-cost">Level 1: IT is a cost</h3>
<p>A typical situation when, for example, a company needs a server to run a website that someone did. The company pays and that's it. The sad reality.</p>
<p>More investment in IT = more cost.</p>
<h3 id="heading-level-2-it-cuts-costs">Level 2: IT cuts costs</h3>
<p>Newly purchased invoicing software means that accounting has less to do and more time to drink coffee. Since coffee is expensive, this leads to a reduction in FTEs. The result is a reduction in costs for the company.</p>
<p>IT expenditures only make sense up to the amount of expected savings.</p>
<h3 id="heading-level-3-it-makes-a-profit">Level 3: IT makes a profit</h3>
<p>The implementation of an online e-commerce store has been a success and customers are buying vacuum cleaners like crazy. Salespeople have less work to do because the customer chooses the model and color himself. No one gets fired because salespeople work on commission.</p>
<p>IT expenditures fall down according to the decreasing marginal utility function.</p>
<h3 id="heading-level-4-it-creates-a-new-market">Level 4: IT creates a new market.</h3>
<p>Our mobile app makes it so that passenger and driver can figure out where they are and where they are going. Customers no longer want to wait for a questionably fresh cab that is unknowable-where. Sales people stroll on Facebook making viral videos, accountants transfer profits to tax havens.</p>
<p>More expenses = more revenue (to some point, of course).</p>
<h2 id="heading-well-wheres-the-software-development">Well, where's the software development?</h2>
<p>I don't know if you've noticed, but it's only at Level 4 that there is software that is owned by the company. For an organization like Uber, software is not only a competitive advantage, but is even essential to the company's existence. In other words: Uber would not be possible without their proprietary, unique system.</p>
<p>The other levels have been simplified. For each of the problems in Levels 1-3, it is possible to find a better or worse existing solution that can be bought. Sure, the cash may be in the millions, but the product is already in place and possibly needs to be implemented. I'm talking about all those SAS, SAP and similar businesses.</p>
<h2 id="heading-what-does-this-mean-for-a-software-house">What does this mean for a software house?</h2>
<p>That's a very good question :) I myself am curious about your opinions at what level software house (SH) operates. Specifically, at what level are the projects you work in? A separate question is at what level does SH want to be? Feel free to comment!</p>
]]></content:encoded></item><item><title><![CDATA[Cloud-Native Platforms]]></title><description><![CDATA[When talking about prominent technology trends it's good to ask ourselves who is going to benefit from that. In the case of Cloud-Native Platforms those are billion-dollar businesses (AWS, GCP, Azure), owned by trillion-dollar organisations (Amazon, ...]]></description><link>https://adambrodziak.pl/cloud-native-platforms</link><guid isPermaLink="true">https://adambrodziak.pl/cloud-native-platforms</guid><category><![CDATA[cloud native]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[containers]]></category><category><![CDATA[platforms]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Fri, 19 Aug 2022 15:19:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1660922247826/tzvfKr5Eg.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When talking about prominent technology trends it's good to ask ourselves who is going to benefit from that. In the case of Cloud-Native Platforms those are billion-dollar businesses (AWS, GCP, Azure), owned by trillion-dollar organisations (Amazon, Alphabet, Microsoft). Promoting cloud-native meme is in the best interest of their shareholders.</p>
<p>But the cloud-native platform, in practice, is a modern way to build complex distributed systems. Truth to be told, it has more to do with advanced system architecture and practical software engineering, than the cloud. It just so happens those systems are being deployed to the cloud, nowadays. Is that going to happen in the future the same way?</p>
<p>Cloud revolution was an important-break through. The timeline is unfortunate though: cloud offering started circa 2005, while containerization exploded a decade later. In fact it's the containers that offered scalability, portability and cost-efficiency that VM or cloud instances promised, but could not deliver.</p>
<p>Experienced corporations start to realize the cost of cloud and risks related to vendor lock-in. As an effect companies started to revisit their IT strategies regarding cloud. Some decided to invest in their own data centers, using software-based networking and container orchestrators as main building blocks. That is one of the trends.</p>
<p>Other trend started with multi-cloud approach. Currently implementing such solution is complicated, because various cloud providers have incompatible APIs. There's a hope to develop standardized layer on top of those, called sky computing. I'd like to see that happening, but I dare to ask a question: is it in the best interest of the cloud behemoths?</p>
<p>Why bother investing in hardware sitting in some data centre or building a portability layer between cloud offerings? The so called vendor lock-in is not only about could outages that we've experienced last few months. It's also about decisions where cloud providers refuse to host your business due to political reasons. Politics change, so who knows what would be future line of thought?</p>
<p>That brings us to what stands behind cloud-native mnemonic: distributed architecture and solid software engineering. When choosing your next software solution vendor ask yourself a question whether they want to sell specific cloud solution, or rather if they understand how to build resilient distributed system that is independent and gives you freedom.</p>
]]></content:encoded></item><item><title><![CDATA[DevOps skills for Medium, Senior and Architect levels]]></title><description><![CDATA[I've been asked to prepare an outline of DevOps skills for different levels of experience (and salary). To be honest I have no idea how those should be, as DevOps is such a broad area that it's almost impossible. Also DevOps is a little bit different...]]></description><link>https://adambrodziak.pl/devops-skills-for-medium-senior-and-architect-levels</link><guid isPermaLink="true">https://adambrodziak.pl/devops-skills-for-medium-senior-and-architect-levels</guid><category><![CDATA[Devops]]></category><category><![CDATA[skills]]></category><category><![CDATA[learning]]></category><category><![CDATA[Career]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Sun, 13 Feb 2022 19:21:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1644779945260/AH1dD0gtJ.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I've been asked to prepare an outline of DevOps skills for different levels of experience (and salary). To be honest I have no idea how those should be, as DevOps is such a broad area that it's almost impossible. Also DevOps is a little bit different, so the skills described below are biased towards what we do at my company <a target="_blank" href="https://www.future-processing.com/">Future Processing</a>. In this list I've focused on technical skills, because for "soft skills" we've got separate matrix applicable for all positions.</p>
<p>The levels in questions could be defined as follows (just to give you some context):</p>
<ul>
<li><strong>Medium</strong> has some experience and is able to deliver simple, well defined tasks. Usually requires mentoring from more experienced colleagues.</li>
<li><strong>Senior</strong> is someone who can work on their own, being able to deliver projects rather just than tasks.</li>
<li><strong>Architect</strong> can design and deliver complex projects and also advise, train and mentor colleagues and clients.</li>
</ul>
<p>As you many notice <strong>JUnior</strong> level is missing, on purpose. I strongly believe there's no <em>Junior DevOps</em> but rather <em>Junior Ops</em> or <em>Junior Dev</em> that gains experience. After some time they can transition to <em>Medium DevOps</em> to earn their stripes. Such process is something that I went myself and see in the wild.</p>
<p>Most of the skills listed for particular level do apply for the level above too. So Architect should know everything Senior does. In some cases I've used the same skill to describe expected change of attitude or understanding when evaluating for promotion.</p>
<p>Next to each skill there's a grade stating how important certain skill is:</p>
<ul>
<li><strong>(5)</strong> means critical, must have on this level.</li>
<li><strong>(3)</strong> means important, but can live without if other skills are strong.</li>
<li><strong>(1)</strong> means optional, nice to have but not required.</li>
</ul>
<h2 id="heading-development">Development</h2>
<p>Dictionary for the terms used:</p>
<ul>
<li>Programming language is any general purpose language like Python, Java or Go.</li>
<li>Software development life cycle (SDLC) is all the steps needed to deliver a software product including: design, development, build, test, deploy.</li>
<li>Git actually means any (distributed) version control system, but it's shorter just to write git ;)</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(3) Knows at least basics of one programming language.</li>
<li>(3) Knows what SDLC is and can explain steps.</li>
<li>(1) Is aware of Agile and Waterfall approach to SDLC.</li>
<li>(3) Can use the basic git commands (commit, pull, push).</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(3) Is fluent in at least one programming language.</li>
<li>(1) Can point out problems in existing software development life cycle.</li>
<li>(1) Is able to judge if project is Agile or Waterfall and explain the consequences.</li>
<li>(3) Knows various branching models in git and how to use them.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(5) Took active part in writing a complex, enterprise-grade system.</li>
<li>(5) Can read and provide hotfix in more than one scripting languages.</li>
<li>(3) Can propose and design optimizations in software development life cycle.</li>
<li>(1) Can design SDLC in the Agile or Waterfall approach.</li>
<li>(5) Knows git-flow is a lie and trunk-based development is the way to achieve proper CI setup.</li>
</ul>
<h2 id="heading-cicd">CI/CD</h2>
<p>Dictionary:</p>
<ul>
<li>CI/CD stands for Continuous Integration / Continuous Delivery (or Deployment) - an approach to test, build and deliver software.</li>
<li>CI system (Continuous Integration) is a software (or service) that runs test, build and other automation around SDLC. Examples are Jenkins, GitLab, GitHub Actions, AWS CodeBuild.</li>
<li>Package manager is a system to manage dependencies (i.e. libraries, modules) during software development, used to build a software artifact. Examples are Maven or Gradle in Java/JVM, NPM or Yarn for JavaScript or TypeScript.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Knows the CI/CD terms.</li>
<li>(5) Can setup a simple CI pipeline (few steps) based on existing setup.</li>
<li>(3) Can use at least one package manager.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Can tell the difference between Continuous Integration and Delivery.</li>
<li>(5) Can design and deliver CI pipeline (with many steps) based on requirements.</li>
<li>(3) Can use many package managers and is fluent in at least one.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(5) Can differentiate between Continuous Delivery and Deployment and advice which one is better.</li>
<li>(3) Can propose and build CI setup (many pipelines) based on developer team needs.</li>
<li>(3) Can point out common pitfalls in package manager usage and optimize it.</li>
</ul>
<h2 id="heading-observability">Observability</h2>
<ul>
<li>Log aggregation is process of gathering logs from many nodes (typically in a cluster) into a single place for processing. Examples are ELK stack (Elasticseach, Logstash, Kibana), Grafana Loki, Splunk, AWS CloudWatch.</li>
<li>Metrics visualization is a way to use graphs to display application or system metrics over time. Example tools are Grafana, Kibana.</li>
<li>Alert is a notification of some issue (incident) in the system.</li>
<li>Runbook is a tutorial that describes how to react on an alert to troubleshoot or mitigate an incident.</li>
<li>Trace is a detailed information of what caused software issue (i.e. exception stack trace). Example of distributed tracing software is OpenTelemetry, Jaeger, Grafana Tempo.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Knows the benefits of log aggregation and can it use it.</li>
<li>(5) Can use metrics visualization to troubleshoot some problems.</li>
<li>(3) Knows how to react to alert using existing runbooks and can update / create runbooks.</li>
<li>(1) Knows what is trace and why it is useful.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Can build the log aggregation system using self-hosted solution based on provided design.</li>
<li>(5) Can build metrics visualization in some popular tool.</li>
<li>(3) Can tell the difference between log even and metric and is able to advice dev team on that.</li>
<li>(3) Can advice on actions that will prevent alert from firing on known issues.</li>
<li>(1) Uses traces to discover problems with the software to help dev teams fix them.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(5) Can provide many designs of log aggregation and explain their pros and cons.</li>
<li>(5) Can design and advice on system or application metrics, where tracking those will prevent some issues.</li>
<li>(3) Can advice dev team on alerts that will help them react before issue happens.</li>
<li>(1) Can setup trace aggregation system.</li>
</ul>
<h2 id="heading-containers">Containers</h2>
<p>Definitions:</p>
<ul>
<li>Container is a way to run application in a cloud-native way. Example container runtimes are Docker and containerd.</li>
<li>Orchestrator is a system that manages containers in a cluster. Examples are Kubernetes, ECS (Elastic Container Service).</li>
<li>VM stands for Virtual Machine, a way to isolate apps before containers came along.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Understands containers and how do they work. Can run an app in container.</li>
<li>(3) Knows there is an orchestrator and what it does.</li>
<li>(3) Understands the difference between Docker container and image.</li>
<li>(1) Understands te difference between container and VM.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Knows one container runtime in-depth (i.e. attaching volumes, config options, health-check).</li>
<li>(5) Can setup and manage orchestrator in production.</li>
<li>(5) Understands the difference between Docker container and image.</li>
<li>(1) Understands te difference between container and VM.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(5) Can advice dev team how containers should be build and run (i.e. 12factor app).</li>
<li>(3) Can design a platform based on orchestrator that is the best suited based on requirements.</li>
<li>(3) Understands that container is just a Linux process with some degree of isolation.</li>
</ul>
<h2 id="heading-operations">Operations</h2>
<p>Dictionary:</p>
<ul>
<li>Shell script is a Bash or PowerShell script. Used to automate actions.</li>
<li>System signals are a way to manage processes by the kernel. Examples are SIGTERM, SIGKILL, SIGHUP.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Can write a simple shell script.</li>
<li>(5) Can check state of processes on the system.</li>
<li>(3) Knows system signals and how to use them.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Can write significantly complex scripts (i.e. deployment pipeline), using conditionals and loops.</li>
<li>(3) Is able to debug process (i.e. logs, strace).</li>
<li>(3) Can advise dev team how to handle system signals in the app.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(3) Knows when shell script is not enough and proper programming language should be used.</li>
<li>(3) Can advise if distributed tracing solution would help.</li>
<li>(3) Is able to propose a fix in code based on signals not being handled.</li>
</ul>
<h2 id="heading-networks-and-cloud">Networks and cloud</h2>
<ul>
<li>DNS stands for Domain Name System ahd how the name resolution works.</li>
<li>RBAC stands for Role-Based Access Control. Examples are Kubernetes RBAC or IAM Roles.</li>
<li>Proxy is software that passes network traffic. HTTP reverse proxy is a layer 7 proxy, it can act as a load balancer too. Layer 7 refers to ISO/OSI network model.</li>
<li>IaaS, PaaS and SaaS are Infrastructure / Platform / Software as a Service.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Knows how DNS works, for the basic name resolution.</li>
<li>(5) Understands the concept of RBAC and can explain rules evaluation.</li>
<li>(3) Knows the concept of proxy (i.e. HTTP reverse proxy) and load balancer.</li>
<li>(3) Knows the difference between IaaS, PaaS and SaaS.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Knows all the bits that are involved in DNS name resolution (i.e. in Linux).</li>
<li>(5) Can setup RBAC rules according to specification and good practices (i.e. the least possible permissions).</li>
<li>(5) Can setup proxy server using some popular tools (i.e. Nginx).</li>
<li>(3) Knows the difference between layer 4 and layer 7 load balancer (or proxy).</li>
<li>(3) Knows varius classes of cloud offerings (i.e. object storage, load balancer) and how to use them.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(3) Understands the performance implications of DNS in a large cluster.</li>
<li>(5) Able do design RBAC setup (roles, policies, groups) for the whole system (i.e. cluster).</li>
<li>(3) Can design a cluster using both layer 4 and layer 7 proxies or load balancers where appropriate.</li>
<li>(1) Can leverage PaaS / SaaS offerings (i.e. load balancers) to achieve business goals (i.e. time-to-market, cost reduction).</li>
</ul>
<h2 id="heading-infrastructure-as-code">Infrastructure as Code</h2>
<p>Dictionary:</p>
<ul>
<li>Infrastructure as Code (IaC) is a concept to keep setup and configuration of infrastructure as code in git, so it can be developed just as applications are.</li>
<li>Provisioning is a way to reach desired state of software and configuration on the sever. Example tools are Ansible, Puppet, cloud-init. Those tools are not used in the era of container orchestration, but the concept is still relevant.</li>
<li>Configuration management is about providing config for app (in many ways), but also a system to deliver and distribute the configuration.</li>
<li>12factor is a set of good practices for modern apps https://12factor.net/</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(5) Knows IaC concept and its benefits.</li>
<li>(5) Understands the server should be managed (spin up, provision) automatically, by some tool.</li>
<li>(3) Can use various ways to configure app (i.e. env vars, files).</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(5) Can prevent common pitfalls of IaC setup using tools and processes (i.e. git branches, code review).</li>
<li>(5) Is able to manage server (spin up, provision) automatically, by some tool.</li>
<li>(5) Understands configuration is separate from app deliverable (12factor app).</li>
<li>(3) Can setup configuration delivery using existing tool (i.e. orchestrator or some open-source or SaaS).</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(5) Can advise on GitOps drawbacks and put guardrails into place.</li>
<li>(3) Understands the provisioning tools do not fit the cloud-native era. Can advise on how to migrate out of them.</li>
<li>(3) Can design configuration management system for the cluster.</li>
<li>(1) Can coach dev teams to adhere to config best practices (12factor app).</li>
</ul>
<h2 id="heading-others">Others</h2>
<ul>
<li>DORA metrics (DevOps Research &amp; Assessment) are 4 basic metrics of software delivery performance, proven to work using scientific approach.</li>
</ul>
<h3 id="heading-medium">Medium</h3>
<ul>
<li>(1) Understands DevOps role in SDLC.</li>
</ul>
<h3 id="heading-senior">Senior</h3>
<ul>
<li>(3) Understands how DevOps practices affect SDLC.</li>
</ul>
<h3 id="heading-architect">Architect</h3>
<ul>
<li>(1) Can use metrics (i.e. DORA) to lead and track SDLC optimization.</li>
</ul>
<h2 id="heading-tell-me-what-do-you-think">Tell me what do you think</h2>
<p>All of the skills and the grades are highly subjective, I realise that. Also as I mentioned they reflect the environment my company is operating, so it (DevOps, skills, grades) might be completely different for you. Even though I'd like to know your opinion and I'm open to feedback what I'm missing, what is wrong or unclear. Feel free to leave a comment or hit me on <a target="_blank" href="https://twitter.com/AdamBrodziak">Twitter</a> :)</p>
]]></content:encoded></item><item><title><![CDATA[Wasted 10 years with Bash]]></title><description><![CDATA[Originally I  published this 3 years ago on Medium, but keeping a copy here too.
When you're thinking of shell on any Linux system you probably think Bash (Bourne Again Shell). On Windows too actually, as Git Bash is quite popular (despite it has got...]]></description><link>https://adambrodziak.pl/wasted-10-years-with-bash</link><guid isPermaLink="true">https://adambrodziak.pl/wasted-10-years-with-bash</guid><category><![CDATA[Bash]]></category><category><![CDATA[zsh]]></category><category><![CDATA[shell]]></category><category><![CDATA[cli]]></category><category><![CDATA[terminal]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Wed, 06 Oct 2021 15:09:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1633532754366/JMczU14dY.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Originally I  <a target="_blank" href="https://adambrodziak.medium.com/wasted-10-years-with-bash-a7f6cb480419">published this 3 years ago on Medium</a>, but keeping a copy here too.</em></p>
<p>When you're thinking of shell on any Linux system you probably think Bash (Bourne Again Shell). On Windows too actually, as Git Bash is quite popular (despite it has got an ugly terminal emulator) and WSL (Windows Subsystem for Linux) uses Bash too by default. At least those have quite modern 4.x versions of Bash, while MacOS X ships an outdated 3.2 version and there are tutorials how to <a target="_blank" href="https://itnext.io/upgrading-bash-on-macos-7138bd1066ba">upgrade Bash on Mac to 5.0</a> which was released days ago.</p>
<p>The Bash 5.0 upgrade on Mac looks quite complicated and can yield unexpected errors, so that begs the question: why not use better shell instead?</p>
<h2 id="my-history-with-bash">My history with Bash</h2>
<p>My story with Linux started in 2007 and for 10 straight years I've been using Bash by default, as it came with Ubuntu. It's not <em>that bad</em>, as Ubuntu at least ships autocomplete configuration by default, so you can use Tab to complete command or it's params. Just try to type <code>cd /ho&lt;Tab&gt;</code> to see it in action. I've learned that very qickly and blessed Ubuntu for <a target="_blank" href="https://www.tecmint.com/install-and-enable-bash-auto-completion-in-centos-rhel/">making it work for me</a>.</p>
<p>Over the years I've learned <a target="_blank" href="https://zwischenzugs.com/2018/01/06/ten-things-i-wish-id-known-about-bash/">some useful things</a>, like Ctrl+R to search history or Ctrl+K to clear line. I did not know any better (cnconscious incompetence), becuase for me shell==Bash. Even after finding that there are other shells and relying on <a target="_blank" href="https://linux.die.net/man/1/checkbashisms">Bash quirks in scripts is bad practice</a> I haven't looked for greener pastures. Somehow it did not even occur to me to alter Bash stupid defaults (case sensitive file completion - why?!), besides maybe increasing history buffer (huge productivity boost BTW). There is <a target="_blank" href="https://github.com/mrzool/bash-sensible">sensible Bash configuration</a> that you should definately check out!</p>
<h2 id="fish-the-user-friendly-command-line-shell">Fish - The user-friendly command line shell</h2>
<p>It was late 2017 when I stubled upon <a target="_blank" href="https://fishshell.com/">Fish Shell</a>. The tipping point was aptly named <a target="_blank" href="https://jvns.ca/blog/2017/04/23/the-fish-shell-is-awesome/">The fish shell is awesome</a> post by Julia Evans. Then I've watched few <a target="_blank" href="https://www.youtube.com/watch?v=g_HoW4iek2Q">videos</a> and was totally hooked. Fish comes with all bells and wistles out of the box, setup for you. No need to configure anything, install plugins, etc.</p>
<p>As you can see Fish (on the right) has got <a target="_blank" href="https://www.youtube.com/watch?v=0NOAUogSSMo">the best tab-complation</a> compared to ZSH and Bash.</p>
<p>Of course Fish community created <a target="_blank" href="https://github.com/jorgebucaran/awesome-fish">plugins for more advanced features</a> - I can recommend the following:</p>
<ul>
<li><a target="_blank" href="https://github.com/rafaelrinaldi/pure">rafaelrinaldi/<strong>pure</strong></a> - Pure-fish port of <a target="_blank" href="https://github.com/sindresorhus/pure">sindresorhus/pure</a> prompt</li>
<li><a target="_blank" href="https://github.com/franciscolourenco/done">franciscolourenco/<strong>done</strong></a> - Automatically receive notifications when a long process finish</li>
<li><a target="_blank" href="https://github.com/jethrokuan/z">jethrokuan/<strong>z</strong></a> - Pure-fish <a target="_blank" href="https://github.com/rupa/z">rupa/z</a>-like directory jumping</li>
</ul>
<p>Notice those plugins are <em>Pure-fish</em> implementations of 3rd party scripts. The reason is: Fish ofers sane scripting (<a target="_blank" href="https://twitter.com/fishpkg/status/1087159872561414145">unlike other shells</a> some say), but it's not POSIX compliant. This is probably the biggest drawback (or downside) of using Fish. Over the years I've accumulated oneliners, scripts and habits from Bash (i.e. using <code>&amp;&amp;</code> to join commands), but most were POSIX-compliant.  Which led me to ZSH...</p>
<p>Note: <a target="_blank" href="https://github.com/fish-shell/fish-shell/releases/tag/3.0.0">Fish 3.0.0</a> released month ago have added <code>&amp;&amp;</code> support. It is that important :)</p>
<h2 id="zsh-z-shell-designed-for-interactive-use">ZSH (Z shell) - designed for interactive use</h2>
<p>ZSH has been recommended to me by my collegues at work as better Bash alternative, becuase it is POSIX-compliant. That means my Bash habits still work: <code>&amp;&amp;</code> to join commands, way of exporting ENV variables, storing command output in var using backticks, etc.</p>
<p>Since I've been using Fish before I've started to look for plugins that replicate features that Fish provides out of the box. To get some of the Fish coolness install the following:</p>
<ul>
<li><a target="_blank" href="https://github.com/zsh-users/zsh-autosuggestions">zsh-autosuggestions</a> - Fish-like fast/unobtrusive autosuggestions for zsh.</li>
<li><a target="_blank" href="https://github.com/zsh-users/zsh-syntax-highlighting">zsh-syntax-highlighting</a> - Fish shell-like syntax highlighting for Zsh.</li>
<li><a target="_blank" href="https://github.com/zsh-users/zsh-history-substring-search">zsh-history-substring-search</a> - This is a clean-room implementation of the Fish shell's history search feature</li>
</ul>
<p>There's plenty of <a target="_blank" href="https://github.com/unixorn/awesome-zsh-plugins">awesome ZSH plugins</a>, but I tend use only few, not to overload <a target="_blank" href="https://github.com/adambro/dotfiles/blob/master/home/.zshrc">my .zshrc</a> file. The reason: it is on me to make sure everything works well. I'm using <a target="_blank" href="http://antigen.sharats.me/">Antigen plugin manager</a> that is supposedly solving plugins installation issues (see motivation section), but problems still happen. To be honest I have never tried installing <a target="_blank" href="https://ohmyz.sh/">oh-my-zsh</a> directly, because it does not ship Fish-like plugins listed above and <a target="_blank" href="https://joshldavis.com/2014/07/26/oh-my-zsh-is-a-disease-antigen-is-the-vaccine/">configuring custom plugins in OMZ is awful</a>.</p>
<h2 id="bash-as-a-scripts-runtime">Bash as a scripts runtime</h2>
<p>OK, so we've established that there are better interactive shells than Bash, but Bash is still useful for scripts. It's becuase of it's ubiqoutness obviously - it's available on (almost) every Linux/Unix system those days. However Bash has it's gotachas as scripting language, so beware. Here are few links to make your life easier with Bash scripts:</p>
<ul>
<li>http://redsymbol.net/articles/unofficial-bash-strict-mode/</li>
<li>https://zwischenzugs.com/2018/01/06/ten-things-i-wish-id-known-about-bash/</li>
<li>https://github.com/dylanaraps/pure-bash-bible</li>
</ul>
<h2 id="partying-words">Partying words</h2>
<p>If you're working with shell do yourself a favour and try something more modern than Bash. Do not waste a decade as I did. My suggestions are as follows:</p>
<ul>
<li>Try Fish as it's awesome out of the box. If you know something even cooler - let me know!</li>
<li>If you like to tinker or need POSIX shell give ZSH a go. No idea which plugin manager to recommend though ;)</li>
<li>Use <a target="_blank" href="https://github.com/mrzool/bash-sensible">Bash sensible config</a> if you really need Bash, i.e. on remote host you do not control.</li>
</ul>
<p>Let me know if I missed anything or there's even better shell that I'm not aware of. Basically rising awareness of Bash alternatives is my point here, so I'm open to learn.</p>
]]></content:encoded></item><item><title><![CDATA[How DevOps tools affect culture]]></title><description><![CDATA[First let me tell you a story.
Imagine you're an account manger trying to help customer solve their problem. It seems to be a bug in the software system. They use a version that is over year old (couple releases ago) with some custom feature and clie...]]></description><link>https://adambrodziak.pl/how-devops-tools-affect-culture</link><guid isPermaLink="true">https://adambrodziak.pl/how-devops-tools-affect-culture</guid><category><![CDATA[Devops]]></category><category><![CDATA[Culture]]></category><category><![CDATA[Company]]></category><category><![CDATA[challenge]]></category><category><![CDATA[infrastructure]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Thu, 08 Jul 2021 15:25:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1625757651103/xNO41oxxQ.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>First let me tell you a story.</p>
<p>Imagine you're an account manger trying to help customer solve their problem. It seems to be a bug in the software system. They use a version that is over year old (couple releases ago) with some custom feature and client-specific configuration. So far the attempts to replicate the bug were futile.</p>
<p>The idea is to re-create the system setup in the same way the customer has it, so we could reproduce the problem, debug the system and provide a fix. However the person who installed it no longer works at the company, 100-page operating manual is out of date and there's no record of what exactly has been customized.</p>
<p>Does it sound familiar?</p>
<p>Basically that was the situation our customer was dealing with. People tried to cope with that producing those 100+ pages operating manuals, writing down how they configured the system (if they still remembered what worked after trying 8th time), gathering fact sheets and diagrams that might even reflect how the system is setup.</p>
<p>Documentation is great, but who likes to write it? Not to mention keeping it up to date and verifying that what is written there works?</p>
<p>I'm going to explain how using DevOps tools changed how people changed approach to the work and how did they felt. Then I'm going to unveil how we have done it.</p>
<h2 id="empowerment">Empowerment</h2>
<p>In the previous life  our account manger would need to ask people responsible for various components to assemble them in a way that this particular customer setup looked like. We assume he was lucky to actually find out the configuration and feature customization, of course. If someone was on holidays then our poor AM was out of luck and needed to wait.</p>
<p>By using DevOps tools out AM could do all of that himself. All version of software components were stored in artifacts repository. Even if deleted, they could be re-created from source code exactly as they were, due to reproducible code builds on CI server. Configuration is stored alongside code, so the full setup was in source code repository.</p>
<p>Assembling the customer specific setup was as easy as checking out a branch with configuration code for this customer and running release pipeline for this branch on one of testing environments. Scripts would deploy correct component versions and apply customer-specific configuration automatically.</p>
<p>Better yet such configuration branch can be turned into pull request and provided to someone for review. There's no need to check if the code and configuration is valid - automatic tools on CI server already did that. It's more about asking others if those exact changes <em>should have been applied</em> for this customer use case in first place.</p>
<p>At first it was quite overwhelming to use source control and coding tools for people like account managers. However we've learned fast that sometimes it's easier to read YAML configuration than to process 100+ pages manual to find one specific setting. Also asking someone for review was great learning and collaboration experience.</p>
<h2 id="confidence">Confidence</h2>
<p>Hunting esoteric bug for a customer makes a good story, but it's not an everyday work. Normally we'd have much smaller features, fixes and improvements releases on day-to-day basis. The configuration of the system also evolves at steady pace. Since the scope of those was small the release was no longer a scary process.</p>
<p>Also the fact that it was done on a daily basis contributed to increased confidence. The typical flow of creating pull request with changes, automatic validation by CI, review by peers, finally approve, merge and deploy has been harnessed.</p>
<p>By incorporating review process we not only gain an additional verification before deploy. This is also a learning and collaboration event - both parties are able to contribute the improvements. I saw improvements and simplifications introduced as a result of change review process.</p>
<p>The great benefit of DevOps approach is that if something is wrong you learn that early in the process. No longer need to wait weeks to see if change yields expected result on production setup. Personally I've used to forget why this change have been made before I got feedback whether it worked or not months later. If not the work had to be started over.</p>
<h2 id="feedback">Feedback</h2>
<p>Fast feedback loop is the essence of DevOps approach.
In case of typo in configuration that is going to be caught by automatic validation in minutes. Misapplication of some setting is pointed out during review in few hours. Automatic deploy routine tells you if the change worked the same day.</p>
<p>Change review process saved me many times from breaking the system or doing something stupid, because I've missed something. Code review comments are invaluable way of learning about the system and about your craft too. The necessary bit is to hide your ego and take feedback as it is. Simple, but not easy.</p>
<p>In DevOps flow, you (the owner of the change) is responsible of deploying it. This is how you learn how th system actually works or in which ways it breaks. Observing deployment progress and behaviour of the system (i.e. how metrics and logs do change right after that) becomes second nature after a while.</p>
<p>Observability is the new hot trend in DevOps world.
Observability without the observer is just an empty slogan.</p>
<h2 id="lead-time-reduced-from-months-to-days">Lead time reduced ​from months to days</h2>
<p>Such dramatic shift of delivery pace was the observed benefit of the approach we've taken.</p>
<p>Cultural change was enabled by DevOps practices and tools we've used.</p>
<p>This is what tools we've used and how.</p>
<h2 id="infrastructure-stack">Infrastructure stack</h2>
<p>Our rule of thumb was that the whole infrastructure setup has to be automated, no exceptions. It started from virtual machines and networks in AWS cloud that were spin up by Terraform - cloud-agnostic infrastructure as code tool. We've chosen Terraform, because customer infrastructure could be on various providers: in some cases it was cloud, in others it was on-premise dedicated or virtual machines.</p>
<p>Once the basic virtual machine was running Ansible that applied operating system configuration and installed necessary tooling. We kept that layer thin, having only a few small roles in Ansible. This decision improved manipulability and security by having narrow attack surface.</p>
<p>The heavy lifting has been done by Docker Swarm orchestrator. Every application had a dedicated Docker image with all the runtime dependencies and Swarm managed a workload over a fleet of VM nodes.</p>
<p>Why Docker Swarm? Back then it was under heavy development at Docker Inc company. Swarm was much simpler compared to Kubernetes and back then K8s was not that fully-featured yet.
However in 2021 I'd <a target="_blank" href="https://www.future-processing.com/blog/does-docker-make-sense-in-2021/">discourage using Docker Swarm</a> for a greenfield project.</p>
<h2 id="infrastructure-as-code">Infrastructure as Code</h2>
<p>Having all the infrastructure as code in one big repository was a big enabler for collaboration and shared responsibility. Gone were the days of "this is my special machine, so don't touch it" approach. Transparency was a key value to fight knowledge silos that would otherwise happen.</p>
<p>Additional benefit was the ability to view and compare all the development and testing environments all at once. It helped our teams to wrap their heads around what feature is being deployed or tested at which stage. Our QA engineers developed their own tools to make it easier.</p>
<p>Event though the infrastructure was pretty big and complex, there was only one dedicated person to manage all of that. Well, that's not entirely true - the whole team was responsible to the software and environment it run on. What I'm trying to say is: due to efficient  DevOps tools one high-class specialist was enough to manage it.</p>
<p>As DevOps states there's no distinction on dev and ops, so everyone was encouraged to perform configuration changes, deploys and contribute to infrastructure setup. And this is what happened. I've personally added the ability to verify artifact version before it gets deployed, because I missed that feature in the <code>deploy.sh</code> script.</p>
<h2 id="executable-documentation">Executable documentation</h2>
<p>People with various skills installed, configured and operated our system. That's why we focused on using tools that are easy to reason about. Majority of that were YAML configuration files for Docker Swarm. Those have exactly the same format as Docker Compose - a tool that makes it very easy to install that on any machine running Docker.</p>
<p>Similar story with <code>deploy.sh</code> Bash script. It basically codes the steps an operator could re-type on the machine themselves. With additional comments that made it an executable documentation that was run every day - so we made sure it works. Gone are the days of re-typing commands from operating manual only to find out they do not work in this version of the system.</p>
<p>Clear separation of the various layers (VM via Terraform, OS via Ansible, apps via Docker Swarm) made it easy for customers to pick and choose how much they wanted to use it. That was extremely important for some closed setups where public cloud was out of question.</p>
<p>The bonus point I wanted to mention was a script to generate release notes from code and source control metadata. That was yet another attribute of sticking to approach that every commit message should refer Jira ticket, so changelog in release notes could be generated from that data. Also installation instructions were copy-paste of the scripts that we've prepared. 
Great <a target="_blank" href="https://adambrodziak.pl/ad-hoc-documentation">documentation with minimal effort</a>.</p>
<h2 id="is-devops-approach-worth-it">Is DevOps approach worth it?</h2>
<p>Well, it depends. Such extensive automation pays off for any medium project (with dozens of people involved). For large projects I'd dare to say it is a requirement: otherwise we loose so much time trying to do basic things that are repeated every day (doing manual deploy, reading operational docs, finding where things are, etc). On the other hand we've applied the same principles (without extensive tooling) on a small team (5-7 devs) and got most of the benefits too.</p>
<p>Despite technical and process advantages of DevOps I'd say: do it for the people! This way you gain more engaged team that feels empowered and responsible for the product. In turn that leads to many learning experiences fed by honest feedback and increased confidence. I must admit it's a pleasure working in a team in such environment.</p>
]]></content:encoded></item><item><title><![CDATA[Terraform is terrible]]></title><description><![CDATA[Here is my experience from running and upgrading a small Terraform project. As you might have guessed it was not great, but I'll try to focus on facts rather than opinions (even though some might sneak in). It will be mainly about the CLI client and ...]]></description><link>https://adambrodziak.pl/terraform-is-terrible</link><guid isPermaLink="true">https://adambrodziak.pl/terraform-is-terrible</guid><category><![CDATA[Terraform]]></category><category><![CDATA[infrastructure]]></category><category><![CDATA[Devops]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Thu, 27 May 2021 14:56:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1622127365200/GcJdu6Hs-.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here is my experience from running and upgrading a small Terraform project. As you might have guessed it was not great, but I'll try to focus on facts rather than opinions (even though some might sneak in). It will be mainly about the CLI client and it's versioning schema, but also some complaints about state management. I'm big proponent of CI/CD and Infrastructure as Code and I will try to explain how Terraform does not fit the picture.</p>
<p>The project is small, but manages 8 clusters. Contrary to typical case it's a SaaS: Atlas service that offers managed MongoDB on AWS in our case. Every project is using some of 7 modules that represent Atlas resources with necessary AWS bindings (i.e. secrets). When we started 0.12 was the newest version, so upgrade to 0.13 is part of the story.</p>
<p>Since the issues are about Terraform client mostly, the IaaS or SaaS used is not that relevant. However Atlas plugin had to change how internal configuration structure, which only added insult to injury.</p>
<h2 id="state">State</h2>
<p>Let's start with state, which is common pain point while working with Terraform. To some extend I understand the decision of using state, but it is inherently difficult to manage.</p>
<p>I dare to say that remote state has all the disadvantages of cache, but not many advantages. Sure, for a team working on a project remote state is a must. Actually I'd like a solution that would enforce using remote state, but there's none - we have to rely on state config will be copied over from existing project.</p>
<h3 id="state-config-on-aws-using-s3-and-dynamodb">State config on AWS using S3 and DynamoDB</h3>
<p>State configuration is another weak point. Typically S3 backend is used to store state, that's fine. But if you want to make it safe from many people overwriting each other changes by running <code>terraform apply</code> at the same time you need additional configuration for locks or mutex. You should definitely use that!</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/AdamBrodziak/status/1387764929286004745">https://twitter.com/AdamBrodziak/status/1387764929286004745</a></div>
<p>In AWS realm DynamoDB is needed for locks. My guess is S3 does not have atomic operation to obtain a lock, so key-value database is needed. What I don't understand is: why I need both? <strong>Why not store state in DynamoDB directly?</strong> The state contents is just JSON, right? If you happen to work on Terraform let me know, please. For me that's a big overlook.</p>
<p>Anyway, below is sample state config using S3 and DynamoDB - both are needed to make it safe. Feel free to copy :)</p>
<pre><code class="lang-hcl">terraform {
  backend "s3" {
    bucket          = "terraform-state-prod"
    key             = "mongo-atlas/resources/DEVELOPMENT-resources.state"
    region          = "eu-west-1"
    dynamodb_table  = "terraform-state-prod"
  }
}
</code></pre>
<p>Remember to use remote state (with locks!) if you're working on a team or using Terraform on CI/CD pipeline. This is lesser of two evils (everyone having its own state!).</p>
<h3 id="discrepancies-of-state-contents">Discrepancies of state contents</h3>
<p>Using locks avoid one way state can be corrupted: running more than one <code>terraform apply</code> at the same time. Other issues arise from HCL code, state and system discrepancies. Let's look at those.</p>
<p>First issue is the difference between Terraform state and the actual state of the system. Sometimes it's called <em>configuration drift</em>. One case is when someone adds something in the admin console, then it's invisible for Terraform. We had that for IP access list, but since all the entries are independent there was no conflict. The only drawback was that we no longer have single source of truth, so better avoid such practice and either manage things manually or via Terraform code.</p>
<p>Bigger problem is when someone <em>changes</em> something on the server manually, but the change is not reflected in Terraform. In such case it will be visible during <code>terraform plan</code> as change. You have to make a decision if to override it via <code>terraform apply</code> or reconcile to HCL code. In case of the latter there will be playing a detective investigation on who and why made a config change. My recommendation: avoid those cases at all cost and manage everything via Terraform!</p>
<h3 id="default-values-are-stored-in-state">Default values are stored in state</h3>
<p>Another interesting case is when state has items that are not in the Terraform code. I guess that happens, because Terraform stores in state the <em>evaluated</em> state from the system (for us: Atlas service), with the created resource IDs and values for default parameters. We were quite surprised that plan says <code>bi_connector</code> will be removed. What puzzled me even more: <code>bi_connector</code> was not in the code!</p>
<p>What happened was: <code>bi_connector</code> is optional, so default values were stored in state. Since Atlas plugin was upgraded <a target="_blank" href="https://github.com/mongodb/terraform-provider-mongodbatlas/pull/423">they've removed <code>bi_connector</code> attribute</a> and replaced it with <code>bi_connector_config</code> to adhere to new HCL parser. That's an example how syntax breaking changes affect you in a surprising way, but more on that later.</p>
<h3 id="other-cases-of-state-corruption">Other cases of state corruption</h3>
<p>Terraform stores a client version in the state too. It has got interesting implications: if you change the state with newer client (running <code>terraform apply</code>) it will enforce others (i.e. CI worker) to use the same version too. We've mitigated that by wrapping Terraform CLI in script that manages the version, so everyone will use exactly the same.</p>
<p>On top of all that state can get into messed up state when <code>terraform apply</code> process terminates (i.e. by hitting <code>Ctrl+C</code>). To be honest I haven't experienced that myself, just heard that on Terraform training. However I can guess why: there's no transactional support in applying changes by Terraform. I understand why, distributed transactions are hard, but still I don't like it.</p>
<h2 id="client">Client</h2>
<p>My biggest surprise was how bad the Terraform client is. Don't get me wrong: I love that it's a standalone Go binary, so it was easy to manage specific version. The Makefile script downloads Terraform binary, then runs <code>terraform init</code> and <code>terraform plan</code> for specific cluster. On the other hand I don't want to create such wrappers, just use whatever version is compatible. That is not possible due to versioning policy and instability of the tool, but more on that later.</p>
<h3 id="managing-many-workspaces">Managing many workspaces</h3>
<p>As I mentioned we have 8 clusters, each having resources configuration in its own directory. I needed to use the downloaded binary and run it in the cluster dir, easy. After looking at help there's a <code>terraform apply [options] DIR</code> param, so tried to use that. Unfortunately I've got an error about missing param values, like it would not read the <code>.tfvars</code> files from the same directory.
Apparently the <code>DIR</code> at the end does not work as you'd expect, instead Hashicorp <a target="_blank" href="https://github.com/hashicorp/terraform/commit/efe78b2910c5b1f2292f0a16d990b2ad92352feb">added <code>-chdir</code> global param</a> to handle such use case. Param <code>-chdir</code> landed in 0.14 version, but I was upgrading to 0.13, so clever workaround of <code>cd cluster/DEVELOPMENT &amp;&amp; ../../terraform apply</code> was necessary.</p>
<p>Since Makefile is excellent at managing dependency graph I thought: lets run <code>terraform init</code> only when necessary. That was quite easy, unless init fails for some reason and leaves local workspace partially initialized. Then only way to manage that in Makefile was to remove the partially initialized workspace and start over. That is how the overly familiar Makefile <code>cleanup</code> target was born, which nukes <code>.terraform</code> dir for every workspace and removes the binary too. I wish I better understood how to local Terraform workspace is created, because I sense there's a room for improvement :)</p>
<h3 id="running-apply-in-cicd-pipeline">Running apply in CI/CD pipeline</h3>
<p>Terrafrom client has been made for interactive use from the start. That's why it will force you to type <code>yes</code> in to confirm <code>terraform apply</code> changes and provide a prompt for missing parameter values (instead of just exiting with an error). To overcome that you have to resort to CLI params like <code>-auto-approve</code> or <code>-var 'foo=bar'</code> for example. Those are necessary for any CI/CD pipeline, but such UX design looks like an afterthought. The UX with <code>-auto-approve</code> on CI is crumbled in another way: <code>terraform apply -auto-approve</code> does not present changes that are going to be made. Why!? How to verify what kind of changes were made by looking at pipeline console then?</p>
<p>The solution is to run <code>terraform plan</code> just before apply, so we'll have a record of changes to the infra. There's another problem with that: there might be changes in the infra system state between <code>plan</code> and <code>apply</code> actions. Let's imagine a pull request workflow, where <code>plan</code> is run for a branch to verify that code diff has the desired effect on the infra and <code>apply</code> is run after merge to main branch. There could be hours, even days, between change plan and the actual application of them.</p>
<p>One solution to that problem would be to run <code>plan</code> command just before <code>apply</code> again on the main branch (after merge). Ideally there should be an option to prevent <code>apply</code> in case of change plan taking an undesired direction.
Second solution could be using <code>plan -out</code> parameter, which saves change plan to file, so <code>apply</code> can pick up the exact change plan that was generated before. To be honest I haven't tried that yet, but I keep wondering why <code>-out</code> is not the default setting and why apply does not require change plan as input. Such design decisions keep me puzzled.</p>
<h2 id="versioning">Versioning</h2>
<p>Our project has started around August 2020, so Terraform 0.12 was the most recent version. I was happy about that, because 0.12 introduced syntax changes to how params and should be quoted. In reality many of the existing solutions used the old syntax, which still worked in 0.12 version. As a result code became a mix of new and old syntax, which was not a problem until the upgrade to 0.13 started throwing deprecation warnings.
Of course I've learned about <code>terraform 0.12upgrade</code> command soon, but it kept throwing syntax errors on projects that had mix of old and new syntax. Our option was either to downgrade everything to 0.11 (which sounds silly) or upgrade syntax manually. I went with the latter, which was quite a lot of getting back and forth, because deprecation warnings do not show you all the occurrences of the problem, but just the first one and "there are 55 more" message. Not useful.</p>
<p>Of course <code>terraform 0.12upgrade</code> is only available in 0.12 version, so even if I wanted 0.13 first had to download older one to try to upgrade code. That was the reason that led me to creating Makefile wrapper for Terraform binary.
Why bother with 0.13 upgrade after all? Well, we needed <a target="_blank" href="https://www.terraform.io/docs/language/meta-arguments/for_each.html"><code>for_each</code> syntax feature</a> which was only in 0.13 version for modules. A side note: is it only me or adding new syntax for some cases in <code>0.12.6</code> version feels wired? The <code>for_each</code> feature was required to manage many Atlas user roles in a dynamic attribute.</p>
<p>The biggest problem is that even patch versions (x.y.Z) can introduce new features, so it might not be enough to have any 0.13.x version, but rather bind to the specific <code>0.13.7</code> for all clients. In addition client version is stored in state, so using newer by accident can enforce an upgrade for everyone.
Managing Terraform versions is really cumbersome, to the extend that tools like <a target="_blank" href="https://github.com/warrensbox/terraform-switcher">terraform-switcher</a> exist.  I did not wanted yet another interactive CLI tool for our CI server, that's why I've went with Makefile that can be used on CI and locally. In my opinion Makefile is vastly misunderstood (and hence underused tool), but that's a story for another time.</p>
<p>Since the start of our project (August 2020) Hashicorp released 3 new major versions. The newest one is <code>0.15.4</code> and we've started on <code>0.12.5</code> back then. It means that over the last 6 months 3 breaking change releases have been made, one every second month! That is rapid change rate for something that should be stable and boring like infrastructure. The other surprising fact is that initial Terraform release was in 2014, so the project is almost 7 years old.</p>
<h2 id="parting-thoughts">Parting thoughts</h2>
<p>So far it was mostly about facts and my experiences around using and upgrading Terraform. Now time for a little opinion and thoughts about the project. My experience with managing Terraform is just last few months, previously I've mostly used Terraform setup by someone else.</p>
<p>Based on the rate of breaking changes in the last 6 months I'm worried about the stability of the product. In my opinion it should have a big BETA badge to warn about that, even despite being 7 years in development. Sure the <code>0.x.y</code> versioning scheme might indicate that it's not ready for prime time and API breaking changes will happen. I understand that, even SemVer allows breaking changes for minor version bump (0.x.0) if it's not <code>1.0.0</code> yet. For me that looks like a lazy policy on the Hashicorp side that after years of development they still have a policy open to breaking changes. I thought even Facebook dropped "move fast and break things" attitude by now...</p>
<p>Even though development seems to be dynamic I see that legacy seems to creep in already. Just look at the <code>DIR</code> command line parameter that is going to be replaced by <code>-chdir</code> which is clearly stated in the commit message. Even for the Terraform documentation it is going to be quite a lot of work. What about all the solutions and workarounds existing the the wild, i.e. on StackOverlow or blog posts? This is going to have similar impact as using old syntax (0.11 and older) in new project, because someone found a solution somewhere. Without clearly communicated versioning policy it will never get in order.</p>
<p>The biggest surprise to me is that Terraform has ADOPT status on <a target="_blank" href="https://www.thoughtworks.com/radar/tools/terraform">TechRadar April 2019</a> from ThoughtWorks. Maybe the timing plays a role here, since it was way before the major change in syntax  in 0.12 version shipped mid-2020? I wonder if Terraform is still state of the art, or there are better Infrastructure as Code solutions recommended by ThoughtWorks or others?</p>
]]></content:encoded></item><item><title><![CDATA[DNS performance issues in Kubernetes cluster]]></title><description><![CDATA[One day we've been noticing a lot of ERROR getaddrinfo EAI_AGAIN log events in our Kubernetes cluster. All NodeJS apps have been having this problem from time to time, because NodeJS runtime does not cache getaddrinfo() function results. Unlike JVM t...]]></description><link>https://adambrodziak.pl/dns-performance-issues-in-kubernetes-cluster</link><guid isPermaLink="true">https://adambrodziak.pl/dns-performance-issues-in-kubernetes-cluster</guid><category><![CDATA[dns]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[performance]]></category><category><![CDATA[Devops]]></category><category><![CDATA[networking]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Tue, 27 Apr 2021 15:27:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1619537065710/IlachNy0I.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One day we've been noticing a lot of <code>ERROR getaddrinfo EAI_AGAIN</code> log events in our Kubernetes cluster. All NodeJS apps have been having this problem from time to time, because NodeJS runtime does not cache <code>getaddrinfo()</code> function results. Unlike JVM that does cache them, so Java apps were fairly silent.</p>
<p>That gave clear indication problem is on DNS server. Soon after I've notices that 1 out of 3 <code>kube-dns</code> pods is failing, so we were running at 2/3 capacity. Restarting would be enough of a fix, but being "SRE wannabe" I wanted to make sure we improve the situation for the future.</p>
<h2 id="googling-for-the-problem">Googling for the problem</h2>
<p>Soon I've found <a target="_blank" href="https://tech.findmypast.com/k8s-dns-lookup/">post listing potential causes for the issue</a> among others:</p>
<ul>
<li>NodeJS performance issues with <code>dns.lookup()</code> internal implementation (yeah, but I can't change that).</li>
<li>CPU throttling in K8s (unlikely, but very hard to pin down). </li>
<li>Linux networking race conditions in DNAT, fixed in 5.x kernel (we run older version, so it was probable cause).</li>
</ul>
<h3 id="dns-cache-in-app-not-for-apps-in-kubernetes">DNS cache in app? Not for apps in Kubernetes</h3>
<p>Above post gave two solutions. One was to install NPM package in Node app that would cache the DNS entries. Not a solution I'm particularly found of, as I prefer to have such a simple thing as domain name resolution to be available in a cluster. Also taking into consideration that DNS serves as a service discovery mechanism in Kubernetes cluster makes it even more important to keep up-to-date records.</p>
<h3 id="nodelocal-dnscache-in-kubernetes-cluster">NodeLocal DNSCache in Kubernetes cluster</h3>
<p>Better solution was to use <a target="_blank" href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/">NodeLocal DNSCache in Kubernetes cluster</a>. Essentially that runs DNS on every cluster node as a DaemonSet. Definitely the way to go for most cases, because it improves both performance and resilience for very little cost. Unfortunately it requires K8s 1.18 version, which we did not have :(</p>
<h2 id="i-dont-know-how-domain-name-resolution-works">I don't know how domain name resolution works!</h2>
<p>Something about this issue kept bugging me though, I thought I was missing something. 
Our Kube apps work in microservices fashion, so they communicate with many other services a lot. One of the main page components connects to 13 other services, but that is not unusual. All of those links are full URLs, domains setup to public ELB servers. Still you'd expect that <code>kube-dns</code> caches those names, so resolution is fast. Well yes, but actually no.</p>
<p>Enlightenment came with a <a target="_blank" href="https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html">post about <code>options ndots</code> setting in <code>/etc/resolv.conf</code></a> file. In there Marco Pranucci explains how DNS resolving works for non-qualified domain names and how <code>options ndots:5</code> affects this. I encourage you to read it through (with comments!), but here's the gist and some corrections.</p>
<h3 id="dns-for-kubernetes-pod-and-service">DNS for Kubernetes Pod and Service</h3>
<p>Kubernetes creates <a target="_blank" href="https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/">internal domain names for Pod and Service</a> objects for the purpose of service discovery pattern. On top of that the namespace is added to the domain as well, so you can have <code>data</code> service in the <code>prod</code> namespace. If pod in <code>test</code> namespace tries to connect to <code>data</code> host, DNS will not resolve it, but <code>data.prod</code> would be fine. However that allows adding <code>data</code> service to <code>test</code> namespace, so <code>data</code> would have different IP depending whether DNS query is fired from <code>test</code> or <code>prod</code> namespace.</p>
<p>My guess this dynamic nature and flexibility is the reason why Kubernetes injects the following in the <code>/etc/resolv.conf</code> for every pod:</p>
<pre><code><span class="hljs-selector-tag">nameserver</span> 10<span class="hljs-selector-class">.32</span><span class="hljs-selector-class">.0</span><span class="hljs-selector-class">.10</span>
<span class="hljs-selector-tag">search</span> &lt;<span class="hljs-selector-tag">namespace</span>&gt;<span class="hljs-selector-class">.svc</span><span class="hljs-selector-class">.cluster</span><span class="hljs-selector-class">.local</span> <span class="hljs-selector-tag">svc</span><span class="hljs-selector-class">.cluster</span><span class="hljs-selector-class">.local</span> <span class="hljs-selector-tag">cluster</span><span class="hljs-selector-class">.local</span>
<span class="hljs-selector-tag">options</span> <span class="hljs-selector-tag">ndots</span><span class="hljs-selector-pseudo">:5</span>
</code></pre><h3 id="why-ndots5-affects-name-resolution-performance">Why <code>ndots:5</code> affects name resolution performance?</h3>
<p>For <code>ndots:5</code> setting <a target="_blank" href="https://www.man7.org/linux/man-pages/man5/resolv.conf.5.html">according to docs</a> every domain that less than 5 dots in the name will not be send to DNS servers, but rather items from <code>search</code> config list will be appended to it first. So in most cases 3 local resolutions will be attempted before any query is send to DNS server! More on why it happens in this particular order read an excellent <a target="_blank" href="https://jameshfisher.com/2018/02/03/what-does-getaddrinfo-do/">post on glibc <code>getaddinfo()</code></a> function internals.</p>
<h3 id="solutions-proposed-and-my-comment">Solutions proposed and my comment</h3>
<p>First: Switch to Fully Qualified Domain Name (FQDN) for public domains is generally good advice. It will not only make name resolution faster, but also prevent security issue explained in <a target="_blank" href="https://tools.ietf.org/html/rfc1535">RFC1535</a> (quite short for a RFC!). Can't see any drawback, even thought it looks like quick and dirty solution.</p>
<p>Second: Customize <code>ndots</code> with <code>dnsConfig</code> setting. That might makes sense for specific pods that are connection to public domains mostly. You'd have to be careful picking <code>ndots</code> value that would speed things up, but do not mess with Kube DNS setup for service discovery. In other words: there might be dragons.</p>
<h2 id="what-is-the-ultimate-solution-then">What is the ultimate solution then?</h2>
<p>As I've tried to explain domain name resolutions is very nuanced problem, much more awkward than I initially anticipated. Keeping in mind that DNS should be managed on the cluster I'd approach solutions in this particular order:</p>
<ol>
<li>Setup NodeLocal DNSCache on the cluster.</li>
<li>Use Fully Qualified Domain Name (FQDN) for specific apps.</li>
<li>Set <code>ndots</code> to lower value for specific pods.</li>
<li>Try DNS cache in language runtime (JVM, NodeJS) or in code.</li>
</ol>
<p>In the case of failure I've described at the beginning bring up all 3 <code>kube-dns</code> pods was enough. We probably still suffer from a lot of local resolutions due to <code>ndots:5</code> settings. Would be nice to know if switching to FQDN made application faster, but that requires much more granular metrics. Maybe next time ;)</p>
]]></content:encoded></item><item><title><![CDATA[Ad-hoc documentation]]></title><description><![CDATA[The promise: With just little bit more effort you can create an ad-hoc documentation that is searchable and useful.
Writing documentation is tedious process, that's why in Agile we don't write documentation, right? Wrong! The truth is as software dev...]]></description><link>https://adambrodziak.pl/ad-hoc-documentation</link><guid isPermaLink="true">https://adambrodziak.pl/ad-hoc-documentation</guid><category><![CDATA[documentation]]></category><category><![CDATA[Technical writing ]]></category><category><![CDATA[writing]]></category><category><![CDATA[communication]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Tue, 20 Apr 2021 20:57:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1618952052094/4LXcLvedU.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The promise: With just little bit more effort you can create an ad-hoc documentation that is searchable and useful.</p>
<p>Writing documentation is tedious process, that's why in Agile we don't write documentation, right? Wrong! The truth is as software developers we're typing quite a lot of docs already, be it code comments, git messages, pull request comments, chats, mails, Jira comments, etc. By doing those in a more thoughtful way we can make it useful for our future selves and for our colleagues.</p>
<h2 id="work-on-the-message">Work on the message</h2>
<p>Essentially it comes down to switching to <a target="_blank" href="https://en.wikipedia.org/wiki/High-context_and_low-context_cultures">low-context (vs high-context)</a> communication style. Practical tips below.</p>
<h3 id="provide-more-details">Provide more details</h3>
<p>When you're in the middle of doing something the context is fresh in your mind, so it makes sense to use shortcuts like:</p>
<blockquote>
<p>Restarted and it's working.</p>
</blockquote>
<p>That's completely valid message. But what it is about, do you know? Will you know in 6 months from now? How about this example:</p>
<blockquote>
<p>Deleted the agent pod, so it got restarted and it's responding to HTTP requests again.</p>
</blockquote>
<p>Now that makes sense even to you, dear reader, because it includes some context:</p>
<ul>
<li>What exactly has been done: Deleted the pod.</li>
<li>Specific subject is mentioned: <em>agent</em> instead of <em>it</em> pronoun.</li>
<li>The effect is described: Restarted, responding.</li>
<li>How it was verified: <em>Responding to HTTP</em> instead of <em>working</em>.</li>
</ul>
<p>Little effort, great effect.</p>
<h3 id="summarize-often">Summarize often</h3>
<p>Chats are usually in-the-moment, high-context conversations like:</p>
<pre><code><span class="hljs-attr">Al:</span> <span class="hljs-string">kubectl</span> <span class="hljs-string">not</span> <span class="hljs-string">working</span>
<span class="hljs-attr">Bob:</span> <span class="hljs-string">is</span> <span class="hljs-string">it</span> <span class="hljs-string">in</span> <span class="hljs-string">/usr/bin?</span>
<span class="hljs-attr">Al:</span> <span class="hljs-string">yeah</span>
<span class="hljs-attr">Bob:</span> <span class="hljs-string">what</span> <span class="hljs-string">is</span> <span class="hljs-string">the</span> <span class="hljs-string">error?</span>
<span class="hljs-attr">Al:</span> <span class="hljs-string">`permission</span> <span class="hljs-attr">denied:</span> <span class="hljs-string">kubectl`</span>
<span class="hljs-attr">Bob:</span> <span class="hljs-string">is</span> <span class="hljs-string">it</span> <span class="hljs-string">executable?</span>
<span class="hljs-attr">Al:</span> <span class="hljs-string">dunno</span>
<span class="hljs-attr">Bob:</span> <span class="hljs-string">try</span> <span class="hljs-string">`chmod</span> <span class="hljs-string">+x</span> <span class="hljs-string">/usr/bin/kubectl`</span>
<span class="hljs-attr">Al:</span> <span class="hljs-string">another</span> <span class="hljs-string">error</span>
<span class="hljs-attr">Bob:</span> <span class="hljs-string">use</span> <span class="hljs-string">the</span> <span class="hljs-string">`sudo`</span> <span class="hljs-string">Luke!</span>
<span class="hljs-attr">Al:</span> <span class="hljs-string">worked</span> <span class="hljs-string">&lt;3</span>
</code></pre><p>This conversation has the information to fix the problem, but is will you be able to find it in 6 months? Even if you find it how much time you're going to spend on tring to read that all and get the gist? Maybe write a quick summary:</p>
<blockquote>
<p>After downloading <code>kubectl</code> put it in <code>/usr/bin</code> dir and make it executable using <code>chmod +x /usr/bin/kubectl</code>. Both require root privileges (use <code>sudo</code>). Otherwise you'd see <code>permission denied</code> error.</p>
</blockquote>
<p>Some effort and you're becoming the owner of the solution.</p>
<h3 id="rephrase-what-other-people-said">Rephrase what other people said</h3>
<p>In addition to summary you can also re-phrase some of the points. It servers two purposes:</p>
<ul>
<li>You make sure both sides understand the same.</li>
<li>You add additional keywords to match future search terms.</li>
</ul>
<p>This effort will show you as a good communicator.</p>
<h2 id="make-it-searchable">Make it searchable</h2>
<p>Information is useful only if it can be found. This is what made Google a giant. Similar story with Slack, which is an acronym from Searchable Log of All Communication and Knowledge.</p>
<h3 id="funnel-comments-into-slack">Funnel comments into Slack</h3>
<p>Slack has pretty good search capabilities and a lot of integrations with services. We've got Jira comment being syndicated to Slack automatically. The same can be done with other source of docs like: code review comments, git messages and other temporal texts.</p>
<p>I guess the notes, mails and documents can be copy&amp;pasted to Slack as well. There might be good to have a reference for original document, but having a snapshot copy in Slack might be useful too.</p>
<h3 id="code-comments-and-git-messages">Code comments and git messages</h3>
<p>GitHub has got pretty good search across many projects. GitLab can do search only within single project, unfortunately. BitBucket has limited cross-project search capabilities. Any of that can be syndicated to Slack.</p>
<p><a target="_blank" href="https://about.sourcegraph.com/">Sourcegraph</a> has semantic code search, because it actually understands the code being indexed. It has some powers of IDE for search. The limitation is it can't search code and git at the same time.</p>
<h3 id="use-task-numbers-and-tags-in-the-message">Use task numbers and tags in the message</h3>
<p>Quite obvious, but sometimes forgotten, is to put task number (i.e. Jira identifier) in the git commit, code comment, Slack message, etc. The same way regular tags would work, especially if we have some shared vocabulary of such tags, i.e. #versionBump indicating it's just a version increment.</p>
<h2 id="summary">Summary</h2>
<p>With little additional effort you can start building institutional knowledge. Such documentation is contextual and temporal, so it made sense at that time and given circumstances. It wont replace permanent documentation (i.e, specifications, reference manuals), but is light and agile addition that is almost free.</p>
<p>Every time when you write a comment stop for a second and think about your future self while writing. You'll thank me later.</p>
]]></content:encoded></item><item><title><![CDATA[Dockerfile good practices for Node and NPM]]></title><description><![CDATA[The goal is to produce minimal image to keep the size low and reduce attack surface. Also we want to make the docker build process fast by removing unnecessary steps and using practices outlined below to leverage internal build cache.
Besides pure Do...]]></description><link>https://adambrodziak.pl/dockerfile-good-practices-for-node-and-npm</link><guid isPermaLink="true">https://adambrodziak.pl/dockerfile-good-practices-for-node-and-npm</guid><category><![CDATA[Docker]]></category><category><![CDATA[docker images]]></category><category><![CDATA[Node.js]]></category><category><![CDATA[npm]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Wed, 27 Jan 2021 16:24:32 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1611764593683/1wg753A5P.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The goal is to produce minimal image to keep the size low and reduce attack surface. Also we want to make the <code>docker build</code> process fast by removing unnecessary steps and using practices outlined below to leverage internal build cache.</p>
<p>Besides pure Docker I'll present <code>docker-compose</code> tool, which is a tool to start many Docker containers that are required to run the application, i.e. frontend server, backend server, database.</p>
<h2 id="heading-nodejs-and-npm-examples">NodeJS and NPM examples</h2>
<p>Here I'll be using NodeJS and NPM in examples, but most of those patterns can be applied to other runtimes as well.</p>
<h3 id="heading-laverage-non-root-user">Laverage non-root user</h3>
<p>Default NodeJS images have <code>node</code> user, but it has to be enabled. The best option is to use it before any NPM dependencies or code are added.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Copy files as a non-root user. The `node` user is built in the Node image.</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /usr/src/app</span>
<span class="hljs-keyword">RUN</span><span class="bash"> chown node:node ./</span>
<span class="hljs-keyword">USER</span> node
</code></pre>
<p>Node process no longer runs with <code>root</code> privileges. By such simple change you've increased security of the image a lot.</p>
<h3 id="heading-set-nodeenvproduction-by-default">Set NODE_ENV=production by default</h3>
<p>This is the most important one, as it affects NPM described below. In short <code>NODE_ENV=production</code> switch middlewares and dependencies to efficient code path and NPM installs only packages in <code>dependencies</code>. Packages in <code>devDependencies</code> and <code>peerDependencies</code> are ignored.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Defaults to production, docker-compose overrides this to development on build and run.</span>
<span class="hljs-keyword">ARG</span> NODE_ENV=production
<span class="hljs-keyword">ENV</span> NODE_ENV $NODE_ENV
</code></pre>
<p>For local development we can override it's value. Here's an example <code>docker-compose.yml</code> file that builds and runs our Docker image in development mode:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">myapp:</span>
    <span class="hljs-attr">build:</span>
      <span class="hljs-attr">args:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">NODE_ENV=development</span>
      <span class="hljs-attr">context:</span> <span class="hljs-string">./</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">NODE_ENV=development</span>
</code></pre>
<p>To start the application just type <code>docker-compose up</code> and it will build an image on first start and then run the container(s) defined in YAML.</p>
<h3 id="heading-install-npm-dependencies-before-adding-code">Install NPM dependencies before adding code</h3>
<p>The reason is simple: dependencies change way less often than code, so we can leverage build cache. The biggest difference can be seen if you have any C++ modules that require compiling during install.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Install dependencies first, as they change less often than code.</span>
<span class="hljs-keyword">COPY</span><span class="bash"> package.json package-lock.json* ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm ci &amp;&amp; npm cache clean --force</span>
<span class="hljs-keyword">COPY</span><span class="bash"> ./src ./src</span>
</code></pre>
<p>The <code>npm ci</code> will install only packages from lock file for reproducible builds on CI server. I recommend using it by default. Have a read how it is different than <code>npm install</code> in the official docs.</p>
<p>The magic happens in <code>&amp;&amp;</code> which will execute two commands in one run producing one Docker image layer. This layer will be then cached, so subsequent run of the same command (with the same <code>package*.json</code>) will use the cache.</p>
<p>Since build uses Docker image cache the NPM cache is not needed, so we can clean downloaded packages cache. This way resulting image is smaller.</p>
<pre><code class="lang-plaintext">$ docker build .
Sending build context to Docker daemon
Step 2/5 : COPY package.json package-lock.json* ./
 ---&gt; Using cache
 ---&gt; 6fb28308975d
Step 3/5 : RUN npm ci &amp;&amp; npm cache clean --force
 ---&gt; Using cache
 ---&gt; 0a6bd71d2c2d
</code></pre>
<p>While we're at this I recommend adding <code>node_modules</code> line to <code>.dockerignore</code> file in order to avoid adding local version of modules to the resulting image. While <code>npm ci</code> would remove any existing <code>node_modules</code> directory, there's no point to increase the size of image layer.</p>
<h3 id="heading-use-node-not-npm-to-start-the-server">Use node (not NPM) to start the server</h3>
<p>Last, but not least, is to avoid <code>npm start</code> as command to start application in container. Using NPM seems reasonable, because this is how you used to run the application locally. However with Docker and Kubernetes it's a bit more complicated.</p>
<p>The main problem with <code>npm start</code> is that NPM does not pass <code>SIGTERM</code> OS signal to Node process. Because of that Node is not able to do cleanup before exit. Docker and Kubernetes send <code>SIGTERM</code> to container process when they want to stop it.</p>
<p>This can lead to many issues from hanging database connections to open file descriptors. Notice that it's not only your application code that might react to <code>SIGTERM</code>, but it might be the framework or some libraries.</p>
<p>The good practice is to simply call Node directly.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Execute NodeJS (not NPM script) to handle SIGTERM and SIGINT signals.</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"node"</span>, <span class="hljs-string">"./src/index.js"</span>]</span>
</code></pre>
<p>Notice that we've used square brackets to denote exec form of CMD command. If the string would have been used instead the container would start <code>sh -c</code> as main process and OS signals would have been lost again.</p>
<p>Having <code>node</code> as main PID 1 process is also not ideal, but at least <code>SIGTERM</code> and other signals could be handled in application code. You can test it yourself using the simplest NodeJS server code:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> http = <span class="hljs-built_in">require</span>(<span class="hljs-string">'http'</span>);
<span class="hljs-keyword">const</span> port = process.env.PORT || <span class="hljs-number">8000</span>;

http.createServer(<span class="hljs-function"><span class="hljs-keyword">function</span> (<span class="hljs-params">req, res</span>) </span>{
    res.end(req.url);
}).listen(port);
<span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Server running at http://localhost:<span class="hljs-subst">${port}</span>/ ...`</span>);

<span class="hljs-comment">// Signal handling</span>
process.on(<span class="hljs-string">'SIGTERM'</span>, <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'SIGTERM: shutting down...'</span>);
});
</code></pre>
<p>Now try to execute <code>docker container stop</code> against newly created one. The change CMD line to use NPM and see that <code>SIGTERM</code> was not caught.</p>
<p>Such even handler is the place where you want to cleanup all the resources created or opened by the application.</p>
<p>In NestJS for example add <code>app.enableShutdownHooks()</code> call in bootstrap according to <a target="_blank" href="https://docs.nestjs.com/fundamentals/lifecycle-events#application-shutdown">Nest docs</a>.</p>
<h2 id="heading-builder-pattern">Builder pattern</h2>
<p>Let's say your use case is to turn SASS/SCSS into plain CSS using Ruby Compass compiler. It has different stack than the rest of Node app, so we will need separate Docker image. Here's how to use such separate temporary image for compilation step.</p>
<p>Modern Docker versions allow to use <a target="_blank" href="https://docs.docker.com/develop/develop-images/multistage-build/">multi-stage builds</a>. Essentially it allows to have many <code>FROM</code> clauses in Dockerfile, but only the last one <code>FROM</code> will be used as a base for our image. It means that all the layers of other stages will be discarded, so the resulting image is going to be small.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-keyword">FROM</span> rubygem/compass AS builder
<span class="hljs-keyword">COPY</span><span class="bash"> ./src/public /dist</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /dist</span>
<span class="hljs-keyword">RUN</span><span class="bash"> compass compile</span>
<span class="hljs-comment"># Output: css/app.css</span>
</code></pre>
<p>Docker build engine will save resulting files in a temporary image that can be used in <code>COPY</code> expression for our final image:</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Copy compiled CSS styles from builder image.</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /dist/css ./dist/css</span>
</code></pre>
<p>Such expression will copy files from <code>/dist</code> folder, in our case <code>css.app.css</code> only. All the other image layers will be discarded for the</p>
<p>The same pattern can be used for any other compilation or transpilation tool, like Babel, Webpack, TypeScript, etc. In fact it makes sense whenever we have to install any development tool that should not be part of production build. The same applies for installing git, C++ compiler, development version of packages (packages with <code>-dev</code> suffix).</p>
<p>For some JavaScript projects you might notice that <code>npm install</code> or <code>npm ci</code> is done twice: in the builder and final image. It could mean that you mix frontend (i.e. React.js) and backend (i.e. Express.js) libraries in single <code>package.json</code> file. My advice is to separate those frontend and backend dependencies, but getting through exact strategies deserve another blog post. Let me know if you're interested.</p>
<h2 id="heading-putting-it-all-together">Putting it all together</h2>
<p>Here's an example Dockerfile for easy copy&amp;paste for your project. It covers all the good practices we've discussed earlier.</p>
<pre><code class="lang-Dockerfile"><span class="hljs-comment"># Separate builder stage to compile SASS, so we can copy just the resulting CSS files.</span>
<span class="hljs-keyword">FROM</span> rubygem/compass AS builder
<span class="hljs-keyword">COPY</span><span class="bash"> ./src/public /dist</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /dist</span>
<span class="hljs-keyword">RUN</span><span class="bash"> compass compile</span>
<span class="hljs-comment"># Output: css/app.css</span>

<span class="hljs-comment"># Use NodeJS server for the app.</span>
<span class="hljs-keyword">FROM</span> node:<span class="hljs-number">12</span>

<span class="hljs-comment"># Copy files as a non-root user. The `node` user is built in the Node image.</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /usr/src/app</span>
<span class="hljs-keyword">RUN</span><span class="bash"> chown node:node ./</span>
<span class="hljs-keyword">USER</span> node

<span class="hljs-comment"># Defaults to production, docker-compose overrides this to development on build and run.</span>
<span class="hljs-keyword">ARG</span> NODE_ENV=production
<span class="hljs-keyword">ENV</span> NODE_ENV $NODE_ENV

<span class="hljs-comment"># Install dependencies first, as they change less often than code.</span>
<span class="hljs-keyword">COPY</span><span class="bash"> package.json package-lock.json* ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm ci &amp;&amp; npm cache clean --force</span>
<span class="hljs-keyword">COPY</span><span class="bash"> ./src ./src</span>

<span class="hljs-comment"># Copy compiled CSS styles from builder image.</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /dist/css ./dist/css</span>

<span class="hljs-comment"># Execute NodeJS (not NPM script) to handle SIGTERM and SIGINT signals.</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"node"</span>, <span class="hljs-string">"./src/index.js"</span>]</span>
</code></pre>
<p>The <code>Dockerfile</code> above contains all the essential good practices for JavaScript project (either NodeJS server or some frontend). In case you're interested in more advanced optimizations check out the repository documenting more good defaults for Node on Docker: https://github.com/BretFisher/node-docker-good-defaults</p>
<p>Spread the knowledge about good practices in Dockerfile creation.</p>
]]></content:encoded></item><item><title><![CDATA[Czy Docker ma sens w 2021 roku?]]></title><description><![CDATA[Na początku grudnia 2020 gruchnęła informacja, że Kubernetes 1.20 "deprecates Docker". Póki co oznacza to, że Kubernetes będzie wyświetlał ostrzeżenie. Właściwie "deprecates Docker" odnosi się do dockershim co dokładniej wyjaśniam poniżej.
Dopiero w ...]]></description><link>https://adambrodziak.pl/czy-docker-ma-sens-w-2021-roku</link><guid isPermaLink="true">https://adambrodziak.pl/czy-docker-ma-sens-w-2021-roku</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Docker]]></category><category><![CDATA[containers]]></category><dc:creator><![CDATA[Adam Brodziak]]></dc:creator><pubDate>Wed, 06 Jan 2021 13:02:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1609934649873/PkD-xj19s.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Na początku grudnia 2020 gruchnęła informacja, że Kubernetes 1.20 "deprecates Docker". Póki co oznacza to, że Kubernetes będzie wyświetlał ostrzeżenie. Właściwie "deprecates Docker" odnosi się do <code>dockershim</code> co dokładniej wyjaśniam poniżej.</p>
<p>Dopiero w wersji 1.22 wsparcie Docker zostanie usunięte, co jest planowane na drugą połowę 2021 roku. I dlatego właśnie uważam że rok 2021 to początek końca Dockera.</p>
<h2 id="co-to-docker-i-kubernetes">Co to Docker i Kubernetes?</h2>
<p>Docker pozwala zapakować naszą aplikację (np. plik JAR ze skompilowanym kodem Java) wraz ze środowiskiem uruchomieniowym (np. OpenJRE JVM) w jeden obraz, z którego są tworzone kontenery. Właściwie wszystkie zależności z systemu operacyjnego są dodane do obrazu Docker. Pozawala to na użycie tego samego obrazu na laptopie programisty, środowisku testowym i produkcji. W teorii.</p>
<p>Kubernetes jest orkiestratorem, co oznacza że zarządza wieloma kontenerami i przydziela im zasoby (CPU, RAM, storage) z wielu maszyn w klastrze. Odpowiada też za cykl życia kontenerów i łączenie kilku w jedną całość (jako Pod). Zatem działa poziom wyżej niż Docker zarządzając wieloma kontenerami na wielu maszynach.</p>
<p>Jeśli kontener Docker to odpowiednik maszyny wirtualnej kiedyś, to Kubernetes w świecie kontenerów jest odpowiednikiem dostawców hostingu czy usług chmurowych kiedyś.
Docker (a właściwie Docker Compose) pozwala nam uruchamiać różne procesy i łączyć je w sieć oraz przydzielać storage w obrębie jednego komputera.
Kubernetes pozwala na to samo w obrębie klastra, złożonego z wielu komputerów.</p>
<p>Kubernetes sprowadził Dockera do poziomu komponentu który zajmuje się uruchamianiem kontenerów. Dzięki wprowadzeniu standardu CRI (Container Runtime Interface) te komponenty są wymienialne. Obecnie tylko <code>containerd</code> oraz <code>cri-o</code> są zgodne z CRI. Docker wymaga adaptera <code>dockershim</code> którego właśnie programiści utrzymujący Kubernetes chcą się pozbyć.</p>
<h3 id="dlaczego-docker-jest-wazny">Dlaczego Docker jest ważny?</h3>
<p>Docker jest kamieniem milowym jeśli chodzi o popularyzację konteneryzacji.
Gdy usłyszałem o nim pierwszy raz w 2013 w podcaście <a target="_blank" href="https://coder.show/66">Coder Radio</a> od założycieli dotCloud (później Docker Inc) zauważyłem potencjał.
Ledwie rok później Docker umożliwił mi uruchomienie skomplikowanego systemu legacy na swoim komputerze - wtedy wiedziałem że nastąpił przełom.</p>
<p>Przez kilka ostatnich lat Docker z pobocznego projektu w firmie dotCloud przerodził się w biznes warty miliardy dolarów. Pomimo dofinansowania w wysokości 280 mln USD od funduszy venture capital Docker Inc nie radził sobie dobrze biznesowo i został kupiony przez Mirantis. Kwota akwizycji nie została podana do publicznej wiadomości, co jest ciekawe. Zgaduję że to była okazja ;)</p>
<p>Głównym produktem firmy Mirantis jest Kubernetes-as-a-service, gdzie konkurują z VMWare oraz oczywiście dostawcami chmury. Kubernetes jest dla nich istotny do tego stopnia, że chcieli utrzymywać Docker Swarm tylko przez 2 lata, ale szybko wycofali się z tej deklaracji, zapewne pod naciskiem obecnych klientów. Osobiście znam firmę która posiada dużą instalację Docker Swarm i migracja do innego rozwiązania to niełatwa sprawa.</p>
<h3 id="co-to-jest-docker-swarm">Co to jest Docker Swarm?</h3>
<p>Docker Swarm to orkiestrator wbudowany w dystrybucję Dockera. Można powiedzieć że to taki niby Kubernetes, którego obsługuje się tak prosto jak zwykłego Dockera. Oczywiście dochodzi zarządzanie node-ami, replikami, sieciami - jednak nadal jest to znacznie uproszczony widok klastra w porównaniu do Kubernetes.</p>
<p>Zatem można powiedzieć, że Mirantis kupiło konkurencję dla swojego flagowego produktu? Tak jakby. Obecnie Docker Swarm wyzbył się już chorób wieku dziecięcego (np. bug z przydzielaniem zduplikowanych adresów IP), więc wygląda na stabilny produkt dla małych zespołów. Problem w tym że na małych zespołach i małych klastrach zarabia się mało $$$.</p>
<p>Oprócz tego Docker Swarm jest zbyt prosty, po prostu. W naszym zespole jeden człowiek był w stanie stworzyć i obsługiwać klaster Docker Swarm. Nie licząc wspomnianych błędów nie ma przy tym wiele pracy. Duże aktualizacje przychodzą razem z Docker i niezbyt często, więc kolejne zmartwienie odpada.</p>
<h2 id="jaki-interes-ma-mirantis-wlasciciel-docker-enterprise">Jaki interes ma Mirantis (właściciel Docker Enterprise)?</h2>
<p>Pewnie nasuwa wam się pytanie: skoro Mirantis zarabia na Kubernetes-as-a-service dla dużych graczy, a Kubernetes usuwa wsparcie Dockera to jaki tu jest sens? Ano właśnie.
Z mojej perspektywy wygląda to tak, że firma która zarabia na Kubernetes nie ma powodu inwestować w Docker od kiedy ten przestanie być wspierany przez Kubernetes.</p>
<p>Zastanówmy się jakie opcje ma Mirantis w temacie Dockera? Ja widzę kilka kierunków rozwoju, ale wszystkie kiepsko wróżą dla Docker:</p>
<ol>
<li>Postawić na <code>containerd</code>, ale wtedy zdegradują się jako dostawca komponentu blisko Linux kernel. Trudno będzie na tym zarobić, szczególnie jeśli obecne kontrakty są na wsparcie w innych niż Linux systemach operacyjnych.</li>
<li>Rozwinąć Docker Swarm. Problem w tym że Swarm musiałby stać się tak złożony jak Kubernetes - jaka jest wtedy jego przewaga? Póki co Swarm nadaje się do małych projektów, ale to małe pieniądze.</li>
<li>Zmienić Docker Engine w najlepsze narzędzie do rozwoju aplikacji dla Kubernetes. Coś jak https://skaffold.dev może? Ale wtedy nazwa Docker jak i dług techniczny Dockera (o tym później) będzie ciążyć.</li>
</ol>
<h3 id="moze-sprzedawac-docker-jako-narzedzie-dla-programistow">Może sprzedawać Docker jako narzędzie dla programistów?</h3>
<p>Ostatnia opcja jest ciekawa i mogłaby uratować Docker takiego jakiego znamy, jako świetne narzędzie dla deweloperów żeby szybko postawić skomplikowany system w kontrolowanym, lokalnym środowisku, które jest bardzo zbliżone do tego produkcyjnego. Niestety, sprzedawanie narzędzi dla programistów to trudny biznes i zazwyczaj mało lukratywny.</p>
<p>Wspomniane wcześniej VMWare nabyło tę lekcję wraz z akwizycją Spring Source. W skrócie, firma Spring Source próbowała sprzedawać Spring Framework programistom J2EE (Java) jako lepszy framework rozwoju aplikacji. To okazało się bardzo trudnym biznesem.</p>
<p>Punkt zwrotny to kiedy Spring zaczął być sprzedawany jako platforma zgodna z J2EE działom wsparcia i IT. To tu są prawdziwe pieniądze w świecie enterprise software ;)
Polecam obejrzeć co <a target="_blank" href="https://www.infoq.com/presentations/Things-I-Wish-I-d-Known/">Rod Johnson</a> (do niedawna Spring Source CEO) mówi na ten temat.</p>
<p>Z drugiej strony pójście w stronę budowy narzędzi dla programistów stawiałoby Mirantis jako konkurenta dla Docker Inc, a raczej tego co z tej firmy zostało. Zakładam że umowy podpisane podczas akwizycji zabraniają firmom wchodzenia na swoje rynki, czyli Docker Inc zostanie przy wsparciu programistów i narzędzi dla nich, a Mirantis będzie pracowało z klientami klasy enterprise sprzedając im usługi wdrożenia i wsparcia.</p>
<h3 id="mirantis-dba-o-obecnych-klientow-enterprise">Mirantis dba o obecnych klientów enterprise</h3>
<p>Kilka dni później firma <a target="_blank" href="https://www.mirantis.com/blog/mirantis-to-take-over-support-of-kubernetes-dockershim-2/">Mirantis wydała oświadczenie</a>, że będzie utrzymywać <code>dockershim</code> (adapter Docker do interfejsu CRI) wraz z firmą Docker Inc. Jako powód podają swoich obecnych klientów którzy mają bardziej złożone instalacje Kubernetes, które są zależne od rzeczy specyficznych dla Docker Engine. Co to zmienia? Sytuacja wygląda bardzo podobnie jak przy Docker Swarm. Mirantis będzie miało jeszcze więcej długu technicznego do utrzymania (o czym niżej).</p>
<p>Muszę podkreślić że powyższe to tylko moje spekulacje. Nie mam wglądu ani w umowy między Docker Inc a Mirantis, ani w ich strategię. Opieram się jedynie na oficjalnych informacjach prasowych i obserwacji rynku. Próba wczucia się w to co może zrobić duża firma, bazując na swoim doświadczeniu, to ciekawe ćwiczenie umysłowe. Pozwala spojrzeć z dystansu na firmy które stoją za technologią używaną przez nas. Polecam.</p>
<p>Jeśli kogoś zainteresowała firma Mirantis to wygląda na to że ma biuro w Poznaniu i szuka ludzi do działów technicznych i sprzedażowych:
https://www.mirantis.com/careers/</p>
<h2 id="kwestie-dlugu-technicznego-w-docker">Kwestie długu technicznego w Docker</h2>
<p>Właściwie sytuacja rynkowa to wystarczający powód żeby nie inwestować więcej w Docker jako narzędzie do rozwiązywania problemów biznesowych. Niestety jest jeszcze dług techniczny którego Docker nabawił się przez lata, pomimo kilku strategicznych refaktoringów (m. in. wydzielenie <code>runc</code> i <code>containerd</code>) w tym czasie.</p>
<h3 id="problemy-z-union-file-system">Problemy z union file system</h3>
<p>Docker ma już ponad 7 lat historii i to legacy zaczyna ciążyć.
Na początku Docker był interfejsem do funkcjonalności Linux kernel takich jak namespaces, union file systems (union FS) i control groups (cgroups). Z czasem gotowe rozwiązania union file system, jak AUFS, przestały wystarczać. Docker Inc postanowił dodać system plików <code>overlay</code> do kernel. Okazało się że ten system był tak nieudany że bardzo szybko powstał <code>overlay2</code> i ten wkrótce był oznaczony jako polecany.</p>
<p>Mimo rekomendacji Docker przez lata unikałem <code>overlay</code> i <code>overlay2</code> jak ognia ze względu na częste frustracje błędami i utratą danych. Oczywiście z czasem błędy zostały naprawione, ale w czasach Ubuntu 14.04 czy 16.04 LTS aktualizacje kernel nie były tak częste. Również dodanie obsługi <code>brtfs</code> (który ma funkcję union FS) nie poprawiło sytuacji, bo nadal pamiętam awarię na jedynej maszynie w klastrze która używała <code>brtfs</code> jako systemu plików.</p>
<p>Ostatnio usłyszałem stwierdzenie że cała konstrukcja sytemu plików w kontenerze i użycie union FS to "elegant hack" i powiem szczerze że to bardzo dobre podsumowanie.</p>
<h3 id="problemy-z-kontem-root-i-zaleznosciami">Problemy z kontem root i zależnościami</h3>
<p>Inną niefortunną decyzją było uruchomienie demona Docker na użytkowniku <code>root</code>, czyli administratorowi który może wszystko na danej maszynie. To powoduje że atak typu "container breakout" jest dużo bardziej grożny, niż gdyby demon działał z mniejszymi uprawnieniami. Zresztą, samo użycie demona jest też legacy, bo Podman (alternatywa Dockera od Red Hat) nie wymaga żadnego demona do uruchamiania kontenerów.</p>
<p>Domyślnie proces w kontenerze też jest uruchomiony z uprawnieniami <code>root</code>. Przez to łatwiej o atak typu "privilege escalation" i przejęcie kontroli nie tylko nad aplikacją, ale nad całą maszyną na której działa kontener. Obecnie jest to uznawane za złą praktykę i zaleca się tworzenie użytkownika z ograniczonymi uprawnieniami, ale nie wszystkie obrazy używają takiej konfiguracji.</p>
<p>Kolejny problem wynika z powyższych i traktowania kontenera jako "lekkiej maszyny wirtualnej". Chodzi mianowicie że każdy kontener ma pełny userspace danej dystrybucji Linux. Mimo że host serwer działa na CentOS to jeden kontener nakłada na to wszystkie katalogi z Ubuntu, a inny z Debian. Przez taką konstrukcję znacznie zwiększa się pole ataku na dany kontener. Co gorsze często jest to inne pole ataku niż system hosta (CentOS vs Ubuntu).</p>
<p>Innymi słowy: nawet jeśli nasz kontener to prosty, statycznie skompilowany microservice napisany w Go to "ciągnie" za sobą całe Ubuntu (na przykład). Rozwiązaniem nie jest użycie małych obrazów Alpine Linux, no chyba że używamy też Alpine na maszynie hosta. Lepszym rozwiązaniem są <a target="_blank" href="https://github.com/GoogleContainerTools/distroless">distroless images</a> które pozbywają się większości zbędnych bibliotek z Debiana.</p>
<h3 id="jak-to-wyglada-na-innych-systemach-operacyjnych">Jak to wygląda na innych systemach operacyjnych?</h3>
<p>Cały czas mówiłem o Docker na Linux, bo to jest natywny system operacyjny dla Docker. Warto pamiętać że Docker na początku to była prosta nakładka na mechanizmy udostępniane przez Linux kernel takie jak namespaces, cgroups (control groups), union file systems. Obecnie <code>runc</code> zajmuje się tą warstwą, ale to integralna część Docker Engine.</p>
<p>Osobiście nie mam doświadczenia z Docker na systemach innych niż Linux. Z tego co wiem to działanie opiera się, w taki czy inny sposób, na wirtualizacji Linux kernel na tych systemach. We wczesnych latach sam odpalałem TinyCore Linux (zajmuje ledwo kilka MB) na VirtualBox żeby przetestować funkcje których jeszcze nie było w stabilnej wersji kernel. Lata minęły, ale zasada pozostaje taka sama.</p>
<p>Jeśli chodzi o Windows 10 to Microsoft sporo inwestuje w "developer experience". Tak naprawdę WSL i WSL2 polega na wciąganiu niemal całych dystrybucji Linuksa, które stają się integralną częścią systemu, jak inne aplikacje. To powoduje że Docker powinien działać dobrze na Windows, bo Microsoft ma w tym interes żeby przyciągnąć programistów.</p>
<p>Pamiętajmy też o tym że większość systemów w chmurze Microsoft Azure działa w oparciu o Linux. Zatem ma sens żeby takie same narzędzia działały w chmurze i na laptopie programisty. Czy to znaczy też, że Microsoft zainwestuje w natywne kontenery na Windows, żeby efektywnie działały w Azure? Szczerze nie mam pojęcia, ale chętnie dowiem się o produkcyjnych użyciu Docker na Windows.</p>
<p>Co do Apple to nie widzę żeby byli zainteresowani rynkiem programistów, mimo że parę lat temu na konferencjach programistycznych MacBook to był powszechny widok. Póki co słyszałem że <a target="_blank" href="https://webmastah.pl/docker-na-maca-ssie-przyspieszamy-synchronizacje-plikow-mutagenem/">Docker na MacOS ssie</a>, głównie ze względu na opóźnienia w synchronizacji plików. W dobie procesorów M1 w architekturze ARM dochodzi jeszcze problem cross-kompilacji na x86 i ARM. Jestem bardzo ciekaw jak ten temat się rozwinie.</p>
<h3 id="jakie-sa-alternatywy-dla-docker">Jakie są alternatywy dla Docker?</h3>
<p>Jak już wspomniałem Docker jest kamieniem milowym, bo spopularyzował pojęcie konteneryzacji. Nie była jednak an pierwszą technologią (istniały już jails na FreeBSD czy chroot w Linux) ani jedyną. Konkurentów pojawiło się całkiem sporo, część z tych technologii została już zapomniana, inne przejęte i wdrożone jako część większego rozwiązania.</p>
<p>Tak naprawdę konkurencyjne technologie konteneryzacji to całe ekosystemy zarządzane przez firmy technologiczne lub fundacje:</p>
<ul>
<li>rkt z CoreOS (deprecated), przejęty przez Red Hat</li>
<li>cri-o od Red Hat, wraz z Podman + buildah to nowa generacja narzędzi</li>
<li>containerd "wyciągnięty" z kodu Docker, zarządzany przez CNCF (Cloud Native Computing Foundation)</li>
</ul>
<p>Warto wspomnieć o <code>kaniko</code> od Google, które pozwala budować obrazy Dockera bez root (podobnie jak <code>buildah</code> od RedHat). Skoro Kubernetes też wywodzi się z Google, wcale nie zdziwię się jak Google wypuści alternatywę do <code>containerd</code> zgodną z CRI.</p>
<p>Tak naprawdę to są osobne ekosystemy.
Widać tutaj że duzi gracze jak Red Hat czy Google mają wiele do ugrania z tortu wdrażania Kubernetes. Z kolei Mirantis ma tylko Docker.
Po co używać Docker, jak można wymienić na nowsze lżejsze komponenty?</p>
<h2 id="co-to-oznacza-dla-mnie">Co to oznacza dla mnie?</h2>
<p>IMHO to zależy od roli oraz tego jak głęboko siedzimy w Docker jako takim. Chodzi mi tutaj przede wszystkim o wykorzystywanie Dockera do granic możliwości. Często niezgodnie z dobrymi praktykami, bo takie dopiero się tworzyły jak technologia z jednej strony upowszechniała się, a z drugiej dorastała.</p>
<p>Przez lata Docker Engine rozrósł się i wyewoluował w modułową architekturę. Powstały różne implementacje takich komponentów jak logowanie. Pozwoliło to na standaryzację oraz uproszczenie architektury aplikacji. Wystarczyło że każda aplikacja logowała na linuksowe strumienie <code>stdout</code> oraz <code>stderr</code> a Docker zajmował się zbieraniem i lokalnym storage logów. Udostępniał też interfejs w postaci polecenia <code>docker logs</code> do odczytu tych logów.</p>
<p>To bardzo wygodne dla programistów mieć jedno narzędzie do przeglądania logów z aplikacji napisanych w Java, Node, PHP czy innych językach. Z kolei dla ludzi zajmujących się utrzymaniem systemów (IT Ops) istotne są też inne rzeczy jak: gwarancja czy nie stracimy logów, ich retencja, jak szybko logi zapełnią dysk. To zupełnie inny zestaw problemów, których Docker nie rozwiązuje.</p>
<h3 id="jedna-maszyna-kontra-klaster">Jedna maszyna kontra klaster</h3>
<p>To co świetnie sprawdza się w przypadku jednej maszyny w klastrze już niekoniecznie. Przykład to <code>docker service logs</code> które jest odpowiednikiem przeglądarki logów dla Docker Swarm (orkiestratora klastra wbudowanego w Docker). Niestety w tym wypadku często widzimy logi nie po kolei, co jest prawdopodobnie spowodowane różnicami czasu między poszczególnymi maszynami w klastrze.</p>
<p>Użycie NTP do pewnego stopnia niweluje problem rożnicy czasów, ale nie jest remedium. W przypadku logów i ich kolejności lepiej użyć centralnego  agregatora, który może nadać timestamp w momencie odbioru logu. Jednak to rozwiązanie to już zupełnie inny kaliber, choć zazwyczaj konieczny w systemie rozproszonym jak klaster.</p>
<p>Reasumując: świetne jest to że Docker wiele rzeczy upraszcza i standaryzuje. Niestety te uproszczenia sprawdzają się tylko jeśli działamy na jednej maszynie (jak logi). Kiedy wchodzimy na poziom klastra sprawy się mocno komplikują i te same uproszczenia zaczynają uwierać.</p>
<p>Żeby wyjaśnić niedopasowanie rozwiązań Dockera posłużę się nomenklaturą frameworka Cynefin.
Mamy problem systemu rozproszonego, który w swojej naturze jest złożony (Complex) i próbujemy aplikować rozwiązania z dla natury skomplikowanych systemów (Complicated).</p>
<p>Innymi słowy: rozwiązania wybrane przez Docker które sprawdzają się na jednej maszynie niekoniecznie są dobre jeśli działamy w kontekście klastra gdzie mamy wiele maszyn.</p>
<h3 id="programista">Programista</h3>
<p>Z perspektywy programisty technicznie zmieni się niewiele. Nadal będziemy budować obrazy Dockera, bo są one zgodne ze standardem OCI (Open Container Initiative). To powoduje że każdy zgodny CRI będzie w stanie uruchomić te obrazy, czy to lokalnie czy w klastrze. </p>
<p>Najważniejsze zmiany moim zdaniem są w sposobie myślenia o kontenerach. Pora odejść od analogii kontenera jako "lekkiej maszyny wirtualnej" i wziąć pod uwagę założenia aplikacji Cloud-Native. Najważniejsze rzeczy to jeden kontener to jeden proces oraz to że zasoby, takie jak pliki, są ulotne. </p>
<p>Pora zacząć myśleć o kontenerze jako instancji aplikacji odpalonej gdzieś tam w chmurze. Wziąć pod uwagę że będzie wiele kopii tej aplikacji i nigdy nie wiadomo na jakiej maszynie taki kontener wyląduje. Konsekwencją jest to że nie można polegać na lokalnych plikach, bo nowa instancja kontenera nie będzie miała dostępu do tych które zapisała poprzednia.</p>
<p>Druga kwestia to jeden proces w kontenerze. Zarządzanie czasem życia kontenera, load balancing pomiędzy instancjami trzeba zostawić orkiestratorowi (jak Kubernetes). Byłem świadkiem problemu przy migracji bazy danych część kontenerów nie dostała nowego adresu bazy, bo w kontenerze zamiast bezpośrednio Node był odpalony PM2 (process manager dla Node) i restart kontenera nie miał pożądanego efektu.</p>
<p>Jeśli docelowym środowiskiem deploy jest Kubernetes to polecam też zainteresować się rozwiązaniami które pozwalają w wygodny sposób odpalać aplikacje na lokalnym klastrze Kubernetes. Mam tu na myśli narzędzia jak Skaffold (od Google), Draft (Microsoft), Tilt czy KubeVela.</p>
<p>Docker Compose byłby spoko jeśli docelowym środowiskiem jest Docker Swarm, bo używają tego samego formatu plików YAML. Z kolei Kubernetes też niby używa YAML, ale to zupełnie inna bajka. 
To jest bardzo dynamicznie rozwijający się rynek, na którym będę poszukiwał czegoś dla siebie.</p>
<h3 id="sre-it-ops">SRE / IT Ops</h3>
<p>Dla SRE, IT Ops (czy jak nazywa się ludzie którzy utrzymują infrastrukturę) sprawa jest bardziej skomplikowana jeśli Docker Engine jest używany w Kubernetes. Być może wystarczy użyć <code>containerd</code> jako implementacji CRI (Container Runtime Interface) i po sprawie. W tym wypadku wiele zależy od tego ile zależności od Docker Engine "przeciekło" do infrastruktury.</p>
<p>Przykładem niech będzie Docker-in-Docker (DinD) wykorzystane do budowania obrazów na serwerze Continuous Integration (CI). Już w <a target="_blank" href="https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/">2015 roku DinD na CI</a> było rozpoznane jako zła praktyka, ale zanim ta wiedza zdążyła zostać przyjęta, praktyka pokazała że DinD to było tzw. quick-win w środowisku CI w klastrze.</p>
<p>Oczywiście firma Mirantis bardzo chętnie wysłucha naszych rozterek względem uzależnienie skomplikowanego systemu CI od Docker-in-Docker czy innego legacy. W końcu zobowiązała się utrzymywać <code>dockershim</code> i właśnie na tym zarabia. Jestem tylko ciekaw jak dużo kosztuje takie zdjęcie problemu z głowy.</p>
<p>Akurat jestem tym osobiście zainteresowany, bo pracuję przy złożonym CI który używa DinD do budowy obrazów Dockera. Zdaję sobie sprawę że rok 2021 to czas przygotowania tranzycji, czy to do innego CRI czy do <code>dockershim</code>. Obawiam się że ta druga opcja będzie oznaczała zdanie sią na łaskę firmy Mirantis, co może być ryzykiem na poziomie strategicznym. Zobaczymy.</p>
<h3 id="konsultant">Konsultant</h3>
<p>Dla konsultantów taka tranzycja to świetna wiadomość. Po latach beztroskiego używania Dockera w zespołach deweloperskich nadchodzi czas porządków, nauczania i aplikowania dobrych praktyk. Wszystko po to by ułatwić przejście na bardziej restrykcyjne środowiska jak Kubernetes.</p>
<p>Ja osobiście po 6 latach używania Docker (w tym 3 lata Swarm) uczę się nowych runtime, orkiestratorów i zarządzania klastrem. To są trendy które dopiero zaczynają się pojawiać w orbicie zainteresowań korporacji, nie licząc gigantów technologicznych.</p>
<p>Z drugiej strony firmy takie jak Mirantis czy VMWare mają żywotny interes żeby wdrażać i utrzymywać klastry, pobierając za to sowitą opłatę. Tak samo wszyscy dostawcy chmury: AWS, Azure, GCP oferujący hostowany Kubernetes. Dość powiedzieć że niezależny dostawca Linode od 2019 roku oferuje Linode Kubernetes Engine (LKE).</p>
<h2 id="podsumowanie">Podsumowanie</h2>
<p>To czy jest się czym przejmować, że Kubernetes odchodzi od Docker? I tak, i nie. Wydaje mi się że biznes chmurowy będzie rósł i coraz więcej firm będzie migrować do chmury. W takim środowisku konteneryzacja aplikacji i skalowanie horyzontalne (na wiele maszyn) jest naturalnym kierunkiem. Być może będziemy musieli żyć z komplikacjami które niesie używanie klastrów.</p>
<p>Jeśli skalowanie horyzontalne jest niezbędne to Kubernetes wiele rzeczy upraszcza, mimo że sam w sobie wydaje się skomplikowany. W istocie jest rozbudowany, bo problem który który rozwiązuje (alokacja zasobów w klastrze) jest złożony w swojej naturze (tzw. essential complexity). W tym wypadku Kubernetes daje nam podstawowe narzędzia i nomenklaturę żeby poradzić sobie z tą złożonością.</p>
<p>To tylko kwestia czasu kiedy pojawią się rozwiązania upraszczające Kubernetes. Wszyscy dostawcy chmury już oferują usługę zarządzanego klastra Kubernetes, co zdejmuje sporo zadań operacyjnych z barków IT Ops. Dzięki temu że konfiguracja Kubernetes jest w postaci deklaratywnych plików YAML możliwe będzie zbudowanie narzędzi które pozwolą "wyklikać" klaster. Pokuszę się o stwierdzenie że Kubernetes YAML będzie dla klastrów tym czym HTML był dla World Wide Web.</p>
]]></content:encoded></item></channel></rss>