Unbelievable – over a year since my last post. It has been an busy time. I was the VP product strategy at Nolio, which was acquired by CA about 6 months ago, and I have spent the last 6 months at CA . Nolio was a good opportunity to apply case management thinking to a specific business problem (Release Automation), and build the tools for it. So DevOps is now my hobby, along with adaptive case management – but besides that I am working on something completely new and different - LectureMonkey.
LectureMonkey makes it easy to use an iPhone to capture lectures for later review. All lecture content (audio, whiteboards, and presentations) is easily and automatically captured and stored in your personal Dropbox. The lecture is stored as a series of images with synched audio – no expensive video. Our format ensures that lectures are compact and lecturers don’t have to worry about their lectures being on YouTube™ or Facebook™ without their permission.
The whole class can join in to enhance lectures through shared bookmarks and comments. Lectures are shared with classmates only through LectureMonkey’s managed share capability.
Our message to students is simple – less stress, help friends, better grades. Once you start using LectureMonkey you’ll see that you can concentrate on the lecture, studying for exams is less stressful and classmates can help each prepare. Our goal is to help you and your classmates get better grades with less stress.
It also works very well in a corporate setting – away to document and review all those planning and brainstorming sessions. We think that we now have a minimum viable product (MVP in lean startup speak). Please download the app and try it – I’d really like to know what you think. It is going to be fun.
There is a whole set of new operational pressures on IT operations at the application layer. Business are betting more and more on their applications, users with always available platforms (i.e. mobile) mean that applications really do need to work 24×7, virtualization is making the underlying infrastructure elastic and easily available , and of course agile development enables features to be developed faster and in smaller increments.
All these are putting new pressures on IT operations at the application layer. DevOps is one growing trend that is starting to address these issues. DevOps is related to AppOps but it isn’t AppOps – nor does it replace the need for AppOps. DevOps is the process of streamlining the dev to ops lifecycle for applications, but AppOps is specifically the operational side of application management. I think that we’ll be seeing a lot more companies starting to use the term AppOps, as way to describe their IT operations at the application layer – since there doesn’t seem to any better term around. AppOps has two separate, but related, components:
- Application Release Operations – all the operational work needed to make sure production applications and features are deployed in a timely and robust fashion. This goes way beyond just release automation – since the automation component is just a small part of the operations surrounding a release. Release Operations also includes all the remediation and maintenance operations for making sure imperfect applications and unexpected events don’t cause catastrophic application failure.
- Application Monitoring Operations – the monitoring needed to make sure that application problems are discovered as quickly as possible.
In many cases it is up to the AppOps folks to notice a problem and then (working with dev) figure out a quick workaround (CAPA) to the problem to make sure things continue to run, then it is up to dev to come up with a longer term fix which gets deployed in the next release.
There has been very little focus on Application Services CAPA (Corrective Action\Preventative Action) – or maybe a better term would be COPO (Corrective Operations\Preventative Operations). I am not sure why Application Services CAPA gets so little attention – maybe because it is the unglamorous daily work of ensuring everything works correctly, as opposed to getting a new release deployed.
It is pretty clear from the data in my previous post, hardware efficiency isn’t the reason for moving to the cloud. In fact there seems to me more hardware “waste” in the cloud than in data centers. So why is the move to the cloud considered inevitable?
The reason that companies are adopting cloud computing (and have already adopted it precursor – virtualization) is to enhance the speed and agility of their IT departments, not hardware cost reduction (Speed, Agility, Not Cost Reduction, Drive Cloud by Esther Shein in Network Computing). The same is also apparent if you look at where cloud computing is being adopted first – scalable websites and test\dev.
Scalable website tend to be built on stateless applications that don’t have rigorous consistency requirements, but do need massive scale. These types of applications are a perfect fit for today’s cloud. Add to that the fact that the actual scale needed by these websites is very elastic and you have a perfect cloud use case. Dev\Test is also a good fit – the ability to quickly build and tear down environments, where agility and elasticity take precedence over performance and hardening.
The challenge for the cloud will be traditional production and transactional applications. In these applications the application layer has a set of requirements that make achieving the elasticity of the cloud much more of a challenge. The only way the cloud will make real inroads into these environments is when the agility and elasticity of the stateless cloud can be achieved for traditional applications.
I have been reading a really interesting cloud blog by Huan Liu where he does uses various techniques to measure different aspects of public clouds (especially Amazon).
In his posting “Host server CPU utilization in Amazon EC2 cloud” He has found that Amazon utilizes only a percentage of CPU on their servers (his findings point at a 7.3% CPU utilization rate). As he points out, this is a lot lower than what most data centers achieve. The reason is that in order to try and solve the “noisy neighbors” problem they don’t over commit CPU or memory, which means that there is a tendency to reserve CPU for the worst case scenario for each instance hosted on the server.
On the other hand, many production applications have a general behavioral profile like the one he shows:
and it is clear that they need peak CPU for only a limited period everyday (and usage has a pattern).
So the dilemma is – over-commit resources and possibly hurt your customers, or under commit resources and make less profit. I believe that one answer to that dilemma is application behavior analysis.
An application behavior profile would benefit cloud providers in two ways. The first is that the algorithm that assigns virtual machines to physical machines could use a behavior profile to try and allocate anticorrelated applications to the same physical machine. The second is to use an application’s behavioral profile to enable it to “return” CPU when not needed, and use the behavioral profile to “lock in” CPU when needed.
I work a lot with emerging enterprise software companies. I have come to believe that every emerging enterprise software company needs to have a “free” version of their product (which is an anathema to most enterprise software companies – no matter what the size). My reasoning is that a free version of a product makes sure that the company knows what it takes to make a product that they can sell:
Value: Value doesn’t just mean that you have a product that solves a problem; it means that you have a product that solves somebody’s problem. In other words there is a specific person that will benefit from using your software, and it provides them with enough value that they are willing to do what it takes to obtain and use your product. A free product makes sure you really understand who benefits from your product – because if people won’t use it for free, do you really believe that they will pay for it? A free version enables you to validate whether:
- You provide enough value relative to the effort involved in getting the product to work (see point 2).
- You understand who really benefits from the product, and you are trying to convince the right people to use it.
It used to be that a proof-of-concept was enough to demonstrate value, but the consumer internet has changed people’s expectations.
Ease-of-Use: It may be that your product does provide real value to somebody, but the effort to achieve that value is just too great. If they need a services engagement to install and configure the product before they can derive any real benefit in their job – you are in trouble. It is OK to rely on services for a complete enterprise wide rollout, but it isn’t OK that no one benefits before that. A free product ensures that you really know that someone is benefiting enough from your product (not to mention invaluable, direct product feedback).
Joy-of-Use: This is the nirvana of software. I don’t think that in an enterprise setting you can achieve Apple’s level of joy-of-use, i.e. where people play with their iPhone just because it is fun. For enterprise software I see this as an apropriate combination of 1 and 2, where a product provides enough direct value to someone’s work that they will spend effort needed to obtain and use your product. That is a good enough level of Joy-of-Use for “enterprise work”.
Value, Ease-of-Use, Joy-of-Use – It isn’t easy (and I have probably have heard almost every reason in the book about why it can’t\shouldn’t be done), but if you can’t figure out a free version of your product that delivers all three, you should be worried about whether the paid version of your product can actually make it.
My previous post was about the frontal lobe of cloud operations – the monitor that notifies when an application is not behaving correctly or as expected (i.e. an anomally is detected). This post is about what needs to be done when an accute anomally happens (usually meaning the either users or key resources will be effected by the problem) – and some real-time action needs to be taken to fix the problem.
In the fast paced world of cloud operations you essentially have one of two high level decisions to make – incrementally deploy more infrastructure to solve the problem or rollback to a previous version of the application. Either decision has impact on the business – rolling back means that your customers will lose functionality or features, and deploying more infrastructure means extra costs for the business.
There needs to be an additional “brain” that can both synthesize information from different systems, to make (and act on) the decision about rollback vs. additional resources. This is part of “SLA cost awareness” that I mentioned in my previous post – it needs to weigh the cost of rolback vs. extra infrastructure and also make some decisions about the efficacy of either course of action – whether to initiate a “flight” or “fight” response.
Once the response is decided, there needs to be a mechanism for implementing the decision. If the decision is ”flight” (aka rollback) – there needs to be a well defined process that enables rollback in a timely, non-disruptive fashion. If the decision is “fight” (aka deploy additional resources) – there needs to be a way to define exactly what resources need to be applied, where they should be applied and how to apply them.
Actually this additional brain isn’t only for emergency situations. It needs to provide the same type of capability in any stressful situation - whether caused by problem caused by an application anomally found by your APM, or because a new feature is being released. New feature release and upgrades are the mundane, but more frequent cause, of stress in the world of cloud applications and handling them well is the key success in cloud applications. More on that in my next post.
In my previous post on cloud operations, the image has a sense\respond loop between the APM (Application Performance Monitor) and the rest of the system. This is the frontal lobe of cloud operations – its job is to analyze the information coming in from the APM and translate it into an appropriate action. This is one key area that still needs a lot of work (and invention) – but will be one key differentiator between cloud based applications and traditional applications.
The reason is the inherent elasticity of the cloud. You can always get more – more capacity, more storage – but it will cost you. If you have ever been part of an IT performance war-room then you know that capacity is magic elixir that fixes everything. The cloud makes that elixir so simple to obtain, it can get transformed into a panacea. Sure you can go and allocate another dozen web\app servers if the current systems aren’t keeping up with demand, but once you do that you’ll need to pay for the extra capacity. It becomes an immediate additional expense, so just because you can do it doesn’t mean you should. These cost aware decisions will be a new role for operations, and will require a new type of SLA management – “cost aware SLA management”. Currently most SLAs focus on downtime (e.g. 99.9), and some focus on performance (x second response time) – but ignore the costs associated with maintaining the SLA. Once costs become more imediate and visible someone is going to have to manage them, and operations will be tagged with the job.
The problem is that APMs provide just too much information for humans to manage. There will need to be some sort of intelligent analysis distilling the information coming from the various APM systems and distilling the raw data into actionable information. I believe the only way to achieve that is through behavioral analysis of application and predictive analytics (I have been writing about this here). That is the only way to obtain the benefits that the cloud can provide, through intelligent systems that can make some decisions on their own (e.g. increase the number of servers to meet demand, within a predefined policy), and provide distilled, actionable information for operations when they can’t.
I have been spending a lot of time lately looking at the cloud from an operational perspective w.r.t. applications – I guess it would fall under the banner that some analysts would call DevOps and others would call AppOps. I see the difference between the two as either looking at applications from the perspective of everything that needs to be done before you can deploy an app, the other looks more at everything that needs to be done to deploy an app and monitor it afterwards. The line between the is really blurry – and as the cloud becomes more production oriented it will become even blurrier.
What I found is that actual enterprise production applications (not SaaS applications) are few and far between - so not a lot of attention has been paid to the lifecycle issues of managing enterprise production applications in the cloud. Dev and QA are the kings of cloud usage in the enterprise at the moment. I also found that as opposed to the NIST definitions of cloud computing – IaaS (Infrastructure as a Service), PaaS (Platform as a Service) and SaaS (Software as a Service) which seem to describe a nice progression of functionality for the cloud – the real world is much messier. SaaS came first and most SaaS providers didn’t build their applications on IaaS or PaaS, they built their own homebrew “Private PaaS” tailored to their specific application using a mixture of bespoke and off-the-shelf tooling. I think that enterprise production applications will look very similar – just that they will use off-the-shelf IaaS for infrastructure provisioning, and off-the-shelf PaaS for specific components in thier application stack.
As was I was learning all this I think I finally understood why the cloud matters – way beyond its value as a cheaper delivery model, or a way to save on infrastructure costs. Cloud will enable IT to work like an agile production line from dev to delivery. I use the term production line, but the cloud actually holds the promise of being able to provide much more than a physical production line –product lifecycles of days or hours, not months or years. I think as this picture becomes whole – it will drastically change the way we think about applications.
In my depiction below, I clumped together Infrastructure and Platforms, not because they aren’t important but because I wanted to focus on what most people are ignoring at the moment – what happens after the app is assumed to be ready for deployment. Using “classic” application delivery metaphors, that means understanding what happens after dev has finished and the app has moved into the realm of operations.
In my next few blogs I am going to spend more time describing this picture.
I know there is a lot of contention over what exactly cloud computing means. Some people use the metaphor as “compute as a utility” (like electricity) – which seems to be more of a long term “grand challenge” to me. I found a short description that I believe is right on the mark by James Urquhart - “Cloud computing is an application-centric operations model.” So for IT applications become king – and the whole of the IT will be focused around serving applications and application users. Even though sounds almost intuitive to most business folks (I mean what else is IT except a way to get applications to users?) it isn’t how most IT departments operate. In most IT departments – infrastructure is king, not applications. You want to deploy a new application or you need more resources for an existing application – well then wait a few months for the infrastructure folks to requisition, provision, integrate and provide you those resources. The cloud will play havoc with that model. There is a reason for this mismatch in paradigms. It is because for the infrastructure folks, cloud or not, compute and storage resources are not flexible or infinite – even though they may appear that way to the application folks. Infrastructure is a physical resource – and therefore not unlimited.
It clear that virtualization is the mechanism that most cloud providers will use to try to create the illusion of “unlimited available resources” at the application layer. Virtualization isn’t new to the data center (mainframes have been doing it forever, and VMWare has been around for quite a few years now). What is new in the public cloud infrastructure – the organization using the cloud is completely blind to physical infrastructure and its topology – all you get to see is your VMs and the stack above those VM.
That blindness means that you don’t know if your apps are running on machine with 10 other apps, or if your VM has just been migrated to another physical server. Or maybe your cloud provider has skimped a bit on infrastructure, and now physical machines need to host a few more virtual machines to meet the needs of some peak period. In a perfect world that wouldn’t matter – but the world isn’t perfect. Your apps will be affected by their physical neighborhood – for example take the “noisy neighbors” problem that I mentioned in “Noisy Neighbors, Amazon Cloud and the Mainframe”. All of a sudden, through no fault of your own, your production applications may start acting erratically. So now you’ll need to understand that your performance and SLA problems may be caused by things that you can’t see, and can’t access – and will never be able to access.
Going back to the “compute as a utility” metaphor – virtualization makes the cloud is very different than say, an electric utility. You wouldn’t be very happy if your refrigerator was working 5 degrees warmer because the guy down the block turned on his air-conditioner – but that is what can happen when virtualization is used in the cloud. Just like with the refrigerator – you won’t know about the problem until it it is too late – the food spoils, irate users are on the phone.
So as the cloud matures – the issues brought about by mass virtualization will need to be addressed – or we may find that virtualization killed the cloud.