IBM UrbanCode Deploy standardizes and automates deployment processes, speeding up and simplifying how you deploy application components both internally and externally. However, as you rely more on these components and use them in production environments, you become more dependent on the IBM UrbanCode Deploy server’s availability and ability to perform. This article discusses how to design and configure a scalable server with high-availability failover and disaster recovery capabilities.
- High availability versus disaster recovery
- Server/Agent Communication
- Disaster Recovery
- Real world Deployments
- Knowing when to Scale
One key here is to distinguish the capabilities of high availability versus disaster recovery. In this context, we use the term high availability to refer to horizontal scaling across multiple servers and data centers. High availability provides both a means to increase capacity and to add fault tolerance when one or more servers is out of service or failing. By contrast, disaster recovery deals with a catastrophic failure in which the entire server or cluster of servers is unavailable. Most organizations require a disaster recovery plan as part of a business continuity plan. Even with the level of data center reliability and redundancy that is available today, a disaster recovery plan is still necessary to ensure that production applications can meet service-level agreements for uptime and availability.
In this article, I’ll cover the subsystems of an IBM UrbanCode Deploy installation and how to set each up for high availability. Then, I’ll cover how to plan and prepare for disaster recovery.
To start this discussion, here is a high-level architectural picture to help set some context. What is key is that this solution is purposely built for handling large scale deployments (in the 1000’s of deployment targets) so it has been designed accordingly.
The following architectural diagram shows the structure of the parts of the IBM UrbanCode Deploy server. The major subsystems are the configuration web tier, the workflow engine, and the server/agent communication. These subsystems access the database and file storage through the artifact and file management subsystems. We’ll take a quick look at how each of these can be scaled up and how that is done.
The configuration sub-system is the user interface of the server and what clients access to configure and trigger deployments. This tier of the application exposes a set of RESTful APIs that are shared by the web, command-line client, and API users for interacting with the system.
Being a web-based solution, this sub-system can be scaled to handle more concurrent access and improve throughput by scaling horizontally across clustered servers. An HTTP load balancer distributes traffic between the servers.
The following diagram shows a cluster of servers that are set up in this way. This solution requires that you configure the servers for clustering. The performance of the cluster is dependent on the performance of the underlying shared database and file system.
The workflow sub-system is responsible for orchestrating deployments across all the deployment targets, including distributing tasks and and processing updates from the agents as they perform the deployment steps.
As the number of managed agents and concurrent deployments increase, the amount of processing time and bookkeeping increases. A single server can handle hundreds of these transactions concurrently, and it is possible to continue to add processing capability to the server to support this, but at some point it is more practical to share the load across multiple servers.
The following diagram shows how scaling the workflow engine subsystem allows us to spread the load across the cluster to increase throughput. In this case, JMS mesh technology shares messaging data across the servers. One caveat here: we cannot use a load balancer to simply point to a shared URL as in the HTTP load balancer case. Because the connections from agent to server are persistent, a round robin DNS style solution is the best approach.
Of course, the IBM UrbanCode Deploy environment contains far more agents than servers, and the number of incoming persistent connections to each server can get large. To manage the strain on the server, you install agent relays to provide local connection endpoints to servers. Agent relays eliminate the need for each agent to have a direct connection to the server and therefore reduce the number of persistent connections and threads that are held open on the server. Having fewer agents directly connected to the server can also simplify security rules and is analogous to the jump server in production that many organizations use to perform deployments to DMZ or other non-trusted environments.
The following diagram shows a cluster of servers that uses agent relays to moderate connections from agents. The agents connect to the agent relays, which connect to the servers through the JMS mesh. The red line represents a firewall; one agent relay makes connections through the firewall, which means that the agents that use this agent relay do not need to open their own connections through the firewall.
Leveraging shared persistent connections between servers and agent relays in this way, a single server thread or a cluster or servers can manage thousands of agents or many more.
The artifact sub-system is the final key area we’ll talk about here. This sub-system handles the versioning and storage of deployable artifacts. Its performance depends heavily on the performance of the shared file system, and this file system is also a crucial part of a disaster recovery solution. For example, the file system should be on a fast SAN storage system to minimize latency in sending plugins and artifacts to agents. Obviously, this file system should be reliable and backed up regularly.
As stated already, the performance of this file system is crucial to scaling the solution in a clustered environment. It is best to use fast SAN storage for the file system in order to minimize the latency in grabbing and serving plugins and artifacts to agents.
Versions of IBM UrbanCode Deploy 6.1 and later have a new feature that can increase the performance of the file system and reduce latency in server-agent communication. Agent relays cache artifacts and plugins and provide these resources to agents instead of retrieving them from the file system each time. As a result, agents get resources faster and load on the server is reduced. For information about artifact caching, see http://www-01.ibm.com/support/knowledgecenter/SS4GSP_6.2.0/com.ibm.udeploy.install.doc/topics/t_agent_relay_cache_setup.html
Note: Artifact caching adds a new security requirement for agents communicating with agent relays. In this case, the server and agents are already using JMS on port 7916 and usually the HTTP Proxy Port (20080), and the new artifact caching adds a new port to the agent relay HTTP proxy port +1 (20081).
To prepare for the worst possible case, we assume that nothing from the original server has survived a serious event. Of course, we always minimize the risk of this scenario by spreading the primary cluster, database, and shared file-system over two or more data centers. However, it’s also necessary to have a disaster recovery plan.
When we have a DR event, what we need to bring the server back online and functioning is a relatively short list:
- The database
- The asset repository file system
- The configuration directory of the IBM UrbanCode Deploy installation
- A new server or cluster to host the server
- Security rules in place for traffic ( HTTP/HTTPS, JMS, JDBC, licensing )
- A DNS switchover
In practice, production systems should have at least nightly database backups and keep transaction logs to replay forward to failure. The file system should be synced or duplicated as close to real time as possible; most SAN devices already provide tools for duplication. When it’s time to start new servers, the new servers can be pre-prepared VM copies of production servers or built ad-hoc, and in each case, the backup server configurations provide quick setup, including items like the correct SSL keys for client communication and for decrypting encrypted properties in the database. One thing that I have seen over looked too many times is a license to run the server; our best plans to bring up the server are in vain without a license, so be sure that your backups include a DR copy of production licenses.
Security rules can be a sticky point as well, so ensure that you have the required rules to get from your various endpoints to both the production server and to the DR site server. Finally, some kind of name switch-over using your organization’s name management solution is necessary to move traffic from the disabled server to the new server.
Testing your disaster recovery plan is a good idea. You can test your disaster recovery plan in a disconnected network segment, so you can simulate a real DNS-related cutover and ensure that services do in fact re-connect without having to update every agent/agent relay. If you test the DR plan on the production network instead, there are some caveats. For one thing, you really won’t be able to test the global DNS switchover. Also, the server knows its own URL, and depending on what actions you are testing, it may respond with this URL and inadvertently re-direct tests to the production server. In this case, go to Systems > Settings and ensure that the URLs are correct for your testing environment.
The following diagram shows how an IBM UrbanCode Deploy architecture could look fully scaled up. This diagram includes each of the high-availability tactics mentioned in this article, including clustered servers, agent relays, and artifact caching.
One note on using round robin DNS versus a load balancer: In most cases, a load balancer handles the HTTP traffic to the servers, and a round robin DNS handles the JMS traffic. It is possible to do both of these tasks with only a load balancer, but you must understand how to enable round robin persistent connections, and you can’t attempt any kind of SSL off-loading. For that reason, using a load balancer for JMS traffic is officially unsupported, but it is possible to make round robin DNS work with your choice of load balancer.
This diagram (example deployment #1) shows a typical large-scale deployment. As part of a disaster recovery plan, it includes a cold standby server that takes over if the production cluster fails.
This diagram (example deployment #2) shows a large-scale deployment that uses more high-availability features, including clustered servers, redundant agent relays, and artifact caching. This type of high-availability deployment is becoming more common as enterprise customers depend more on their servers.
A single IBM UrbanCode Deploy server and a few agent relays can support hundreds of servers with dozens of daily deployments with little to no tuning required. However, achieving enterprise scale, performance, and reliability ultimately rely on implementing clustering. Making the decision to cluster should be based on multiple factors, the first being a vision or plan, and the second being the current state.
First, plan to ensure that you have capacity to meet your deployments’ needs along with availability, performance, and reliability requirements in your business continuity plans. For example, if you have availability requirements for your production web portal, then the IBM UrbanCode Deploy server should have the same requirements. This planning should include the expected usage today, expected growth for the next 6 months, and for the next year. I recommend that this is part of at least a yearly reconciliation activity to ensure you are staying on target. Most organizations do this anyway when they are doing yearly budgeting.
The second part is reacting to a changing environment; this is a DevOps world and your solution today could be dramatically different in 3 months if you start producing a range of new products or shift workloads across different technologies. In this dimension, ensuring that you have standard application monitoring is crucial. You can start with simple metrics like CPU utilization, RAM utilization, disk utilization, disk IO, net IO, database growth, and database CPU utilization. The load characteristics of your deployment are dependent on the deployment workflows that you build and use in your organization, so on this front your mileage may vary. Still, standard server high-water marks can be your roadmap along with the understanding of the component subsystems discussed above to help you identify what bottlenecks and scaling issues you may be facing.
Understanding how to grow your IBM UrbanCode Deploy solution and keep your end users’ needs satisfied is key to a successful deployment. Support for out of the box clustering, connectivity to clustered databases, and building the solution with best practices and proven components can be combined to make IBM UrbanCode Deploy a real heavyweight in enterprise-level deployment automation. Keeping the server running and performing is only part of the puzzle, but it is a pretty important one that we want to make sure we have a solid plan for.