Boyle’s 4th Law - Response Time Matters!

By Mike Boyle October 10, 2014, 9:14 a.m. EDT 6 Min Read

In today’s rapidly changing technology environment, where customer portals from Insurance and Financial Services companies are becoming more and more alike, the one thing that differentiates them the most, and can drive a potential customer to them, is response time. It is, simultaneously, the least measured and most important factor in all online transactions. Response time measurement should be easy, but it’s really not. In a world where technical architectures are the byproduct of 30+ years of acquisitions, integrations and mergers, it is very difficult to understand the pain that your customers sometimes are going through when they access your website.

All of this is being complicated by the explosion of hybrid computing environments. They are marked by the virtualization of servers and desktop environments through products like Citrix and VMWARE. This has led to a large number of private clouds, public clouds, and companies that are creating server farms with tens of thousands of machines on the back end and, in some cases, hundreds of thousands of users on the front end. That is a lot more complex then when we had a mainframe and some 3270 green screens.

The advent of these new age environments is helping to foster a new age of creativity and an explosion of new companies; we talked about this type of computing in one of my prior entries.

In that article, we talked about the power that was being given to people to start new companies through hardware and software environments being offered by Amazon, Google, EMC, Cisco, IBM and many others.

While these technologies have enabled inventive programmers to make the most of their hardware from a capacity management perspective, they also add complexity to the process of figuring out what is going wrong when they have a problem. They also often mask where the problems are occurring, and that leads to an increase in the Mean Time to Resolve (MTTR). This means more downtime and bad response time. And for those of us who have managed data centers or large application portfolios, there are fewer sentences that strike more fear in the heart of a CIO than “We don’t know where the problem is!” That is something that is happening more and more and that’s scary.

But help is on the way, I’d like to take you through a situation that happened at a major financial services firm recently. The problem seemed almost insurmountable and at the end of their transition, the results were remarkable.

One day someone in data center ops noticed that one of the servers was starting to spike from a performance perspective, the amount of memory available to drive critical software was rapidly vanishing, and this problem was spreading to other servers. Long story short, their entire Web portal application environment ended up frozen and this affected applications servicing their branch offices, back office, agent and customer portals and it even affected the servers supporting their mobile technology platforms. What a disaster.

They held a huge “blame storming” call that lasted several hours. They consulted more than 50 different monitoring tools they had in their production environment. None of them told them what they needed to know or what was causing the problem. After several hours of madness they decided to reboot the servers that were affected and, magically, the problems disappeared. They were heroes for about a month until the whole situation repeated itself once a month for two years! They went from heroes to zeroes because they could not diagnose the problem with all of the great tools and people that they had. This seriously affected the credibility of their IT organization.

After more than two years of having these mysterious problems essentially disable their enterprise, they installed a new, lightweight agent that identified and diagnosed the root problem within two hours of completing the installation. It pointed squarely to a piece of software from a vendor who had been protesting their innocence for over a year. The vendor fixed their misconfigured server and the problem disappeared.

After several months of having this new software in production they also found that they had a significant side benefit, their MTTR had dropped by more than 75 percent. This meant that they now solve production issues much faster and, in some cases, even before users notice any problem whatsoever. What a difference from where they were previously.

What was the mystery software they put in place to figure it out? It turns out that this company looks at problems from an entirely different perspective. It looks at what is happening to response time in an environment: end-to-end across the entire application topology and across all of their hybrid infrastructures. This software is from a company called AppEnsure and it is at the vanguard of a new series of companies that are creating software to help manage what is our increasingly complex environment. Companies like Splunk, MoogSoft, Blue Stripe, Aternity, AppDynamics and a variety of others are taking advantage of disruptive technologies to provide fast and nimble monitoring.

The company using AppEnsure’s technology has said that since the installation of this toolset, they have saved more than half a million dollars in six months by being able to see where things are starting to go wrong, before users are affected, and it helps them to solve the problems faster, all without the huge conference calls. In the end their ROI on the purchase was more than 1,200 percent.

Because today’s monitoring environments have been pulled together from bits and pieces of dozens of different tools that measure hundreds of different things, but these tools do not always tell you what is wrong.

AppEnsure has developed an entirely new way to define the construct for monitoring. How they do it is simple and yet exceptionally complex at the same time. They strip off header information from the IP layer 2-7 headers and build a self-defining map of all endpoints within their environment. This means that there is no need to configure the system to understand your network and application topology because it is self-configuring. Most tools of this nature require weeks, months, or years of extensive configuration, and the associated expense, and then you need to maintain them and then the Configuration Management Database (CMDB) does not match what is in the actual environment because changes are often made without being captured in the CMDB. AppEnsure actually can tell you what should be in your CMDB and can give you a complete listing of all of your components so that you can compare it to what you think you have.

To make it even better, as this technology captures information about the environment, it builds a baseline of the response-time performance from point to point inside the environment, and when there is a significant deviation from that, it sends alarms to the application teams and to the development operations teams letting them know that an issue is developing, where the problem is, and then it suggests causes and potential fixes.

This is certainly a tool to meet the complexities that our distributed computing environment dishes out every day. When major financial services organizations are adding roles called director of stability, you know that this is a serious issue. The increase in components in our computing environments has made it much more difficult to find and kill problems.

Mike Boyle is CEO at Perseus Technical Strategies LLC.

Readers are encouraged to respond to Mike using the “Add Your Comments” box below.

This blog was exclusively written for Insurance Networking News. It may not be reposted or reused without permission from Insurance Networking News.

The opinions of bloggers on www.insurancenetworking.com do not necessarily reflect those of Insurance Networking News.

Mike Boyle

CEO, Perseus Technical Strategies, LLC