Self-Service Cloud Guide

Part of Urban Dynamics series of Guides to help decision makers understand technical topics.

Urban Dynamics guides are meant to serve as in-depth material that can be read in ~10-20 minutes and easily referenced for those who want to seriously understand a topic.

Introduction

Self-Service clouds are a solution that empowers all parts of a business to move quickly to changing needs while ensuring the proper top-level guard rails and visibility exist. We at Urban Dynamics are strong believers that a tool is only as good as it is useful, and making a cloud self-service is how to make it the most useful to the most people. It benefits a business by adding efficiency, agility, security, and standardization to their cloud environments.

TL;DR

Your customer success team wants to use a new cloud data analytics product to crunch usage details about your newest SaaS product. The team manager goes to an internal web UI, logs in, and creates a dedicated cloud environment for this. It has all the security, data, and compliance controls baked in -- such as disabling storing data in other countries. They then request to connect this with the cloud environment holding the usage data for the new SaaS product. The owner of that data approves it. This takes less than an hour, took zero meetings across your multi-thousand person company, was fully visible by your security team, and resulted in zero people being mad about anything. That's the self-service cloud experience.

Business Specific and Cloud Agnostic

A cloud being self-service isn't dictated by the cloud you use. AWS, GCP, Azure, Digital Ocean, OpenStack, VM Ware, something else, or all the above -- it doesn't matter. What matters is making it tailored to your business. A self-service cloud for a global SaaS company will look very different from one for a European bank. Often key value adds of a self-service cloud are stitching things together in a standard way.

A couple of obvious examples of this are for multi-cloud or hybrid cloud needs. Both scenarios require solving authentication, networking, and other integration needs. These are problems that must be solved in a standard, well engineered way to avoid operational, security, and compliance issues. Making this self-service means this is automated and can be deployed at the click of a button.

Taking Infrastructure-as-Code to the Next Level

A standard building block used in self-service cloud environments is an infrastructure-as-code (IaC) solution such as Terraform, OpenTofu, or Pulumi. These allow for everything you build out in the cloud or other infrastructure to be automated. For a self-service experience where anyone can request infrastructure this is must have. It is how an engineer can define a security policy or anything else for a cloud system once and allow everyone else to use it. The setup of that is not captured as knowledge or human steps but as code (hence "infrastructure as code"). While there is no "right answer" for which IaC solution to use, it should be platform agnostic (so no AWS Cloud Formation) as it is used to drive the self-service experience across infrastructure as it evolves over time.

Sometimes a boring UX is the best UX

We must remember that a tool only adds value if people use it. A self-service cloud system should be easy to use by as many people as possible which means a boring user experience (UX) via an easy to use web UI. This user experience should be tailored to your business's needs and have both clear goals of what we want users to do and what we want them to not do.

Anti-goals can be just as, if not more, informative than goals as a lot of times the business controls required to standardize, secure, etc. are actually accomplished by preventing people from doing everything except the few things they're supposed to. But, since you are reading this guide you probably looking for something more concrete than that which is why the next section "Architecture" will dive into what this looks like.

System Qualities for Max Leverage

Let's be honest here, there is no standard turn key solution to a self-service cloud. It is connecting a lot of massive components and complex platforms to the unique needs, people, and processes that are your business. So the definition of success is business leverage. The best architecture is the architecture that benefits your business the most.

That being said, there are a few key qualities we would expect any implementation to have:

  1. Easy to use web UI: As covered above, make it easy and boring so that the most people can get the most value from it.

  2. Integration with standard SSO: All staff log in the same way they log in to anything else at work.

  3. Access Control: There's a standard way we control who can do what.

  4. Approval Flows: There's a standard way people can request things and have them approved.

  5. Automated Deployments: Once internal people with the right permissions have logged in to the web UI and gotten approval (if needed) on the requested infrastructure then the system should immediately build, configure, deploy, change, or otherwise execute that infrastructure for that person in a way they can easily track. The infrastructure-as-code (IaC) system we mentioned above is the heart of this.

  6. Deployment Queuing: Making changes to infrastructure is complicated and a lot of times you really don't want multiple things making changes to the same infrastructure at the same time. To solve this deployment queuing should be implemented. Without this, my changes to the core network and your changes to the core network could run at the same time and leave the core network in a broken state.

  7. Centralized Pattern Catalog: This is often the driving force for wanting self-service infrastructure in the first place. A lot of smart people spent a good amount of time figuring out how to deploy networks, VM's, containers, databases, etc. in a way that checked every box needed by the entire company. By having a centralized catalog for this we let everyone leverage these patterns and save themselves the time of inventing their own solutions.

  8. Auditable: You can see and query the history of everything done by everyone. This is not often the motivating need for doing self-service infrastructure but is a very nice quality that falls out of it. One example use of this is incident response can rapidly check for correlations between production issues they're handling and infrastructure changes. If a firewall rule was updated at 4:51pm and a feature of your app stopped working at 4:51pm there's a pretty good chance they're related and the response team can see and take action on that in seconds as opposed to DM'ing half the company asking if anyone made a change at that time.

  9. No Circular Dependency: This is a technical way of saying "the self-service infrastructure system shouldn't depend on itself to work". Engineers will be tempted to use it to manage the infrastructure it runs on. The problem is that when they inevitably make an error and break something (we are all human) they can't recover it because the way to recover it was to use it... but it's broken, hence why we are trying to recover it. See the circular dependency? Some of the best companies in the world have been bitten by circular infrastructure dependencies. Remember that time in 2021 when Facebook had to use power tools to get into its own data centers to manually recover core network components that were dependencies for literally everything across the company including the physical building security and the network that could have been used to recover them? Not a good time.

Many Environments with Small Scopes

A common mistake Urban Dynamics has seen when clients attempt to solve this problem is having one "mega environment". This is effectively taking the old-school IT design philosophy you would see for a bank or for manufacturing plant or college setup ~20 years ago and implementing it for your self-service cloud. This results in things like a single giant network with a VPN and tons of firewalls and networking rules and Excel sheets and nightmares of those Excel sheets. Do not do this.

Instead, what we want are many small environments. At scale, this means there will be hundreds or thousands of them. This is better, we promise. A 19 year old intern is going to inevitably create a new cloud staging environment even though they shouldn't be able to but were able to because their manager setup their permissions wrong. This was done because they are losing their mind because they have six teenage interns. When this happens, and said intern starts doing dumb things in it and breaks the network routing table, this will be fine because it's completely disconnected from the hundreds of real environments actually powering your business.

A key quality we want from each of these isolated environments is that each will be setup the same. Each will be isolated. Each will have the same security controls. A response team can navigate each roughly the same way. And when they do respond to a fire in one it should be contained to that. Obviously, things in environments will need to talk to one another, but this should be done as the exception (only things that need this connectivity have it enabled) over the default (everything can talk to everything and you have to work to prevent most of it).

Architecture Templates

Everything above has been high level and some readers by this point are surely wanting to know "what this looks like in practice". For those readers, here are the two "templates" Urban Dynamics usually starts with based on which best aligns with our client's needs. These are still going to be heavily customized for a particular implementation but should help give an idea.

  1. Simple Engineer Only Solution: Do you have less than 100 people who need to use a self-service cloud and they're all technical people? Then building on top of an existing platform that provides SSO, approval workflows, etc. is going to be your best bang for your buck. This will probably be an engineering specific platform (such as GitHub) but that's ok because your goal is letting your engineers maximize their time and avoid overhead on the infrastructure they use.

  2. Bespoke Enterprise Solution: Do you have hundreds or thousands of people who need to interact with infrastructure? Are they all varying levels of technical, from senior cloud engineers to business analysts? Do you have complex enterprise needs for approval management, cost management, compliance, audit, etc? Then you need a bespoke enterprise solution that does all of this for your particular scenarios. It will require a small dedicated senior team to build, own, and evolve. It will be a lot of software code and automation. It will be deployed and run by your business. This is not "cheap" but it is the price of creating the leverage that makes hundreds or thousands of people 5% more efficient at their jobs and reaping the security, compliance, and other benefits of standardized infrastructure across your company. You will also never have the "we can't do business initiative X because our vendor solution doesn't support it" that many enterprises paint themselves into a corner via hastily purchased solutions.

Business Buy In

We have experienced implementing self-service infrastructure for clouds before and the payoffs are very real. However, business understanding and buy-in is key. Creating a self-service infrastructure system has very direct centralized hard costs and very indirect widespread benefits. Without proper understanding and buy-in, a self-service infrastructure effort — cloud or otherwise — has a risk of being prematurely killed because a business will clearly see the costs and be blind to the benefits. Going into this there should be some quantification, even if loosely, of the following:

  1. The amount of resources going into creating and managing infrastructure across the business.

  2. The resource inefficiencies from poorly implemented infrastructure by groups who do not have this as a core skillset.

  3. The cost of making business policy changes to infrastructure, such as adding or removing a cloud provider, data platform, networking service, etc.

  4. The risk carried from operations having to deal with many bespoke environments across different teams and units in the business.

  5. The risk carried from a lack of centralized security standards and controls across all company infrastructure.

With this, there's a clear value measurement to be had from self-service infrastructure and pursuing it should have a cost that is acceptable for the benefits it provides as measured by the list above.

Managing Evolving Standards

As your self-service cloud grows, gains users, and increases in scope there will inevitably be the need to make "breaking changes". A couple of examples of breaking changes would be "we used to allow you to use a certain cloud service but due to new terms of service we are ending our usage of it" or "we are no longer doing business in this country and therefore will be removing it from available regions to run cloud services".

While breaking changes should not occur regularly, when they do it's important you have a clear pattern in place for people to follow. Who is impacted? How long do they have to make the change? What's the clear migration plan? Will there be additional resources provided to assist with this? Questions like these must be answered to avoid large amounts of friction as the execution of them is going to be owned by the team that owns the self-service infrastructure. The more friction the more time and cost that central team incurs and the less resources they can spend on their standard road map.

Breaking changes are inevitable as the standards that govern the self-service cloud evolve. Having a clear pattern for what you do when that occurs is key. Have this be something the team that owns the self-service infrastructure is well prepared to handle.

Documentation is Key

Remember the goal of a self-service cloud is that people can "just use it themselves" and that mentality should extend to how people find it and are onboarded. People can't use tools they don't know about or don't know how to use. Good documentation is the key to letting people "help themselves" through this. With good documentation, people can find your self-service cloud solution, go through a tutorial on getting started with it (for both managers and engineers), a guide for standard use cases, how to ask for help, etc.

If this is a topic that interests you, feel free to check out our blog post Writing to Make Things Real: How to get people to value documentation.

Part of Self-Service Infrastructure

A self-service cloud is generally the first step in self-service infrastructure. For major enterprises — like Fortune 500 companies — their needs for self-service infrastructure will push past a single cloud and instead span numerous platforms. This means a single self-service system where teams can utilize many clouds (like AWS, GCP, & Azure), many network solutions (like Cloudflare, Akamai, & MegaPort), internal resources (such as a global backbone, resources in company data centers, etc), and other the enterprise uses.

Conclusion

This completes the Urban Dynamics guide on self-service clouds! We covered a lot in roughly 20 minutes of reading from the business case for self-service infrastructure, how it's powered by infrastructure-as-code, the qualities it needs to have, using it to power many small environments, business buy-in, and more. If you found this useful then share it, post it, and send it to others. Self-service clouds are a key solution many enterprises need to maximize their velocity while satisfying business wide requirements at scale.

This guide is meant to flesh out self-service infrastructure and allow readers to understand and talk about it. However, implementing a system like this is the real work to be done to experience the benefits of a self-service infrastructure system. If this is of interest to then please reach out using the button below to talk with us, we’re always happy to help this facing these complex needs.

References

Want to read more on this topic, here’s a few links to get started.

  1. Terraform by Hashicorp, an infrastructure-as-code tool: https://www.terraform.io

  2. OpenTofu, an open source fork of Terraform: https://opentofu.org

  3. Pulumi, an infrastructure as code tool: https://www.pulumi.com

  4. AWS cloud platform: https://aws.amazon.com

  5. Google cloud platform: https://cloud.google.com/

  6. Azure cloud platform: https://azure.microsoft.com/en-us

  7. MegaPort network platform: https://www.megaport.com