In a blog post a few months ago, I announced that my team had released the new “global” IBM Cloud Console (formerly Bluemix Console) allowing all public regions of the IBM Cloud platform to be managed from a single location: This took us from four addresses (one for each of the four public IBM Cloud regions at the time) to a single geo load-balanced address. Users would now always get the UI served from the geographically closest (and healthy) deployment, resulting in improved performance and enabling failover to greatly increase availability. In this post, I’ll dig a bit deeper so that you can gain insight into what it would take for you to build similar solutions with your own IBM Cloud apps. In particular, I’ll discuss:

  • High-level features of our architecture and the enabling third-party products we used
  • Things to think about while building a health check to determine when failover should occur
  • Considerations for coding apps so they are enabled to run in multiple data centers
  • Using the architecture to smooth the transition to new technology

Hybrid Architecture With Akamai and Dyn

The IBM Cloud Console has long been fronted by two offerings from Akamai:

  • Akamai Kona for web application firewall (WAF) and distributed denial of service (DDoS) protection.
  • Akamai Ion for serving static resources via a content delivery network (CDN), finding optimal network routes, reusing SSL connections, etc.

With the implementation of the global console, we added the Dyn Traffic Director (TD) product to the mix. All requests to are sent to the Akamai network, and then Akamai acts as a proxy. Akamai does a DNS lookup to determine the IP address to forward the request to by using a host name which has been configured using a Dyn TD. The Dyn configuration is setup to spread traffic based on geo location to one of six IBM Cloud data centers. This is shown in the diagram below.

Global Console High-level Architecture

Since Akamai also offers its Global Traffic Management (GTM) product for load balancing, it may seem a bit strange to be using this hybrid solution. But, we had a an existing contract with Dyn, and we simply decided to leverage that instead of adding GTM to our Akamai contract. This has worked quite well for us.

Importance of Health Check

Every 60 seconds or so, Dyn checks the health of each console deployment by probing a health check API written by my team. If the health check cannot be reached or if it responds with a 40x or 50x error, then Dyn marks the associated data center as unhealthy and takes it out of the rotation. This is what is meant by a “failover.” At this point, requests that would go to the unhealthy deployment, are instead sent to the next closest deployment. In this way, the user never knows there was an issue and continues working as if nothing has happened. Eventually, when the health of the failed deployment recovers, Dyn will put it back in the rotation and route traffic to it.

In the diagram below, the Tokyo deployment of the console is unhealthy, and traffic that would normally go there starts flowing to Sydney.

Global Console High-level Architecture With Failed Data Center

Clearly, the algorithm used by the health check plays a very important role in the overall success of the architecture. So, when building your own health checks, you should think carefully about the key components that influence the health of your deployments. For example, Redis is one of our absolutely critical components because we use it to store session state. Without Redis, we cannot maintain things like a user’s auth token. So, if one of our deployments can no longer connect to its local Redis, then we need to failover.

On the flip side, there may be other dependencies that are not nearly as critical. For example, if our Dallas console deployment cannot connect to the API for Cloud Foundry (CF) in Dallas, the majority of the console functionality will continue to work. Other console deployments probably can’t connect to the API either, so there is probably not much point in failing over.

Finally, the health check can be very helpful for making proactive failovers easy. For example, we made our health check configurable so we can force it to return an error code. We have made use of this on several occasions such as when we knew reboots were required while patches for Meltdown and Spectre were being deployed in SoftLayer. We elected to take console deployments in those data centers out of the rotation until we knew those data centers (and our deployments within them) were back online.

Impact to Microservice Implementation

As described in a previous post, each console deployment contains a set of roughly 40 microservices running behind a reverse proxy. In our original implementation, our microservices tended to be tied to APIs in the region they were deployed to. For example, our Dallas deployment could only manage CF resources in Dallas, our London deployment only CF resources in London, and so on. This is illustrated in the pre-Dyn diagram below where microservices in the three data centers only talk to the “backend” within the same region.

Microservices Tied to One Backend

This worked fine for us when we had a separate URL for each console deployment and users knew they had to go to the London console URL to manage their London resources. However, this architecture was not conducive to the goals of global console where we wanted the UI to be served from the geographically nearest data center and for it to continue to be accessible even if all but one deployment failed. In order to accomplish this, we needed to decouple the microservices from any one specific region and enable them to communicate with equivalent APIs in any of the other regions based on what the user was requesting. This is shown in the diagram below.

Microservices Communicate With Different Backends

Of course, an astute reader might point out we’d be even better off if all of the backend APIs provided their own globally load balanced endpoints. Then a console microservice would be able to always point at the same host names no matter where deployed. And, indeed, we do have many APIs in the IBM Cloud ecosystem that are moving in that direction.

Smoothing Migration from Cloud Foundry to Kubernetes

This architectural update has been great for us in many ways, and has given us much more flexibility in determining where to deploy the console throughout the world. It has also had the added benefit of making it easy for us to roll-out deployments running on different technologies without end users ever knowing.

Historically, the console has run on Cloud Foundry on the IBM Cloud, but we are nearly done with a migration to Kubernetes (managed by the IBM Cloud Container Service). We have been able to add Kubernetes deployments into the rotation simply by updating our Dyn configuration. This has allowed us to vette Kubernetes fully before turning off our CF deployments entirely. This is represented in the diagram below showing Dyn load balancing between two CF deployments and three Kubernetes deployments.

Load Balancing Between CF and Kubernetes Deployments


We’re excited by the improvements in performance and reliability we’ve been able to provide our customers with the global console. I hope some of the lessons and insights that my team has gained in the process will help your efforts as well.

Related Resources

For additional material on this subject, please see Configure and run a multiregion Bluemix application with IBM Cloudant and Dyn by colleague Lee Surprenant.


In June, I had the honor of attending the Cloud Foundry Summit Silicon Valley 2017 conference in Santa Clara, CA. My two submissions related to Bluemix UI architecture were selected, and I got the chance to present them as part of the conference’s Cloud Native Node.js track. In this post, I’ll briefly describe my talks as well as share some general takeaways from the conference.

Topic 1: Microservices Architecture of the Bluemix UI

The full title of my first topic was To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on Cloud Foundry. The intent of the talk to was to trace my team’s journey migrating the Bluemix UI from a monolithic app to a microservices architecture.

The Bluemix UI (which runs on Cloud Foundry) is the front-end to Bluemix, IBM’s open cloud hosting platform. The original implementation as a single-page, monolithic Java web app brought with it many demons, such as poor performance, lack of scalability, inability to push small updates, and difficulty for other teams to contribute code. Over the last 2 years, the team has been on a mission to slay these demons by embracing cloud native principles and splitting the monolith into smaller Node.js microservices. The effort to migrate to a more modern and scalable architecture has paid large dividends, but has also left behind a few battle scars from wrestling with the added complexity cloud native can bring. The team had to tackle problems in a wide variety of areas, including: large-scale deployments, continuous integration, monitoring, problem determination, high availability, and security.

In the talk, I went on to discuss the advantages of microservice architectures, ways that Node.js has increased developer productivity, approaches to phasing microservices into a live product, and real-life lessons learned in the deployment and management of Node.js microservices across multiple Cloud Foundry environments.

If you’d like to see the full presentation, check out the slide deck below:

Or, if you prefer video, you can watch the talk on YouTube:

Topic 2: Monitoring Node.js Microservices

My second topic was called Monitoring Node.js Microservices on Cloud Foundry with Open Source Tools and a Shoestring Budget. During the migration described in my first talk, we learned that while microservice architectures offer lots of great benefits, there’s also a downside. Perhaps most notably, there is an increased complexity in monitoring the overall reliability and performance of the system. In addition, when problems are identified, finding a root cause can be a challenge. To ease these pains in managing the Bluemix UI, we’ve built a lightweight system using Node.js and other opensource tools to capture key metrics for all microservices (such as memory usage, CPU usage, speed and response codes for all inbound/outbound requests, etc.).

In this approach, each microservice publishes lightweight messages (using MQTT) for all measurable events while a separate monitoring microservice subscribes to these messages. When the monitoring microservice receives a message, it stores the data in a time series DB (InfluxDB) and sends notifications if thresholds are violated. Once the data is stored, it can be visualized in Grafana to identify trends and bottlenecks.

In the presentation, I described the details of the Node.js implementation, real-world examples of how this system has been used to keep the Bluemix UI running smoothly without spending a lot of money, and how the system has acted as a “canary in the mine shaft” to find problems in non-UI subsystems before the relevant teams even knew there was an issue!

The slide deck for the presentation is available below:

And, you can also watch it on YouTube:

Takeaways from the Conference

This was my second trip to CF Summit, and in both cases it was a great experience. In my first trip in 2015, I gave a talk with Brian Martin when my team was basically just getting started on our journey to microservices. Then, I was a little naive about what we were getting into, but this time around I was far more battle-hardended and had more in-depth knowledge and experiences to share.

One thing I noticed in the questions afterward this time is that there were more people who came up to me and asked questions specific to their own journeys re-architecting monoliths. This tells me there are a lot of organizations struggling with what to do with their legacy code bases and that they are hungry for guidance. Of course, post talk questions are far from scientific. But, I found it interesting nonetheless.

One thing I emphasized to these folks was to not underestimate the need for robust monitoring as they build out their own microservices. As I went into far more detail in my second talk, I think this was the biggest mistake we made when we started the Bluemix UI migration.

Oh, yeah… Wally World

I stayed at the Hilton across the street from the Santa Clara Convention Center where the conference was held. From my room, I had a great view of Levi’s Stadium and California’s Great America.

Every morning I’d look out from my window and see the vast parking lots for both facilities sitting empty:

Empty Parking at Great America

And, each day I kept hoping the Griswold’s would come driving up in their family truckster and see the park was closed, just like Wally World was in 1983. Smiley Face

We have been listening to your feedback on the Bluemix UI and have used that to design a brand new user experience (UX) that we believe will streamline your workflows. The new experience is now live for your immediate use. When you visit the Bluemix UI, you can choose to opt-in for the new experience via a “Try the new Bluemix” link in the header bar:

Try New Bluemix Screenshot

In this blog, we’ll walk you through the new taxonomy organizing your resources, the redesigned catalog, the updated flows for creating new compute resources, the reorganized app details page, and more!

All Category Cards Screenshot

Original blog post co-authored with Amod Bhise.

Bluemix Updates: First Anniversary Celebration!

It’s hard to believe it’s already been a year since we announced the general availability of Bluemix. But, in honor of our first anniversary, we’ve got some big news to share. The exciting updates that went live late last week include:

  • Official release of IBM Containers making it easier to deliver production applications across hybrid environments
  • Addition of service keys to facilitate connecting to services from outside of Bluemix
  • Usability improvements to the Bluemix UI’s header and dashboard
  • Enhancements to documentation.

Bluemix Updates: IBM Containers in Catalog

Bluemix Updates: Cinco de Mayo!

It’s been a couple months since my last Bluemix Updates blog. But, the team has kept working, and on the eve of Cinco de Mayo I thought it was time to share some great new features and functions that have recently gone live. These include:

  • Unveiling of a new and improved Pricing page.
  • Introduction of a Labs section in the Bluemix Catalog.
  • Overhauled Bluemix Docs, including the ability to leave inline feedback.
  • Improvements to SSL certificate support for custom domains.
  • Ability to communicate with live IBM representatives via text chat and video chat.
  • Addition of Korean to the list of translations for the Bluemix UI and the Bluemix Docs.
  • Enhancements to catalog services (e.g., API Management and IBM Insights for Twitter) and boilerplates (e.g., Node-RED).
  • Updates to IBM Eclipse Tools for Bluemix.
  • Miscellaneous usability improvements and fixed defects.

Bluemix Update: New Pricing Page