As with any typical Application development, performance is mostly
conveniently ignored in all the phases of the development life cycle. In spite of
it being a key non functional requirement it mostly remains undocumented. It is
more so, as the development, test and UAT environments may not really represent
the real world production usage of the application as some of the performance
problems could not be spotted earlier. Even if the application is put to load
test, there are certain in the production environment, like data growth, user
load, etc, which may lead to performance degradation over a period of time.
While most performance problems could easily be spotted and
resolved, some could be a challenge and may require sleepless nights to resolve.
A structured approach may help addressing such issues within reasonably quicker
time frame. Here is a step by step approach which should work in most cases.
1.
Understand the production environment
It is important to understand the
production environment thoroughly so as to identify the various hardware & networking
resources and the middleware components involved in the application delivery.
In a typical n-tiered application, it is possible that there could be multiple
appliances and servers through which a requested passes through and get
processed before responding back to the user with response. Also understand which
of these components are capable of collecting logs / metrics or capable of
being monitored in real time.
2.
Understand the specific feedback from the end
users
Gather details like who noticed
the performance degradation, at what time frame, whether it is repeating at
pattern or just pulling the system down. Also understand if the entire
application is slowing down or some specific application components are not
performing. Also try to experience the problem first hand, sitting alongside an
end user or if possible use an appropriate user credentials to experience the
performance issue. The ‘who’ also matters as in certain circumstances, the
application slow down may be for a user associated with some specific role as the
amount of data to be processed and transmitted may differ based on the user role.
3.
Review available logs and metrics
Gather available logs and metrics
data collected by various hardware and software components and look for information
that could be relevant to the specific application, or more specifically the
set of requests that could demonstrate the performance issue. As Logging itself
could be performance overkill, it would be ideal to switch off the logs or to
set it to collect only minimal logs. If that be the case, configure or effect
necessary code change to achieve appropriate level of logging and then try to collect
the required details by re-deploying the application on to a production
equivalent environment.
4.
Isolate the problem area
This step is very important and
could be very challenging too. Take the help of developers and performance and
load testing tools, to simulate the problem and in the meanwhile monitor for
key measurement data as the request and response pass through various hardware
and software components.
By analyzing the data gathered
from the application end user or out of the first hand experience, and with the
available logs and metrics try to isolate the issue to a specific hardware or
software component. This is best done by doing the following step by step:
a.
Trace the request from the UI to the final
destination, which typically may be the Database.
b.
If the request could reach the final
destination, then measure the time taken for the request to cross various physical
and logical layers and look for any information that could cause the slow down.
If a hardware resource is over utilized, it could so happen that the requests
would be queued up or rejected after a time out. Look for such information in
the logs.
c.
Then review the response cycle and try to spot
the delays in the return path.
d.
Try the elimination technique whereby, the involved
component one after the other from the bottom is cleared of performance
bottleneck.
Experience and expertise on the
application and the infrastructure architecture could come in handy to spot the
problem area quickly. It is possible that there could be multiple problems whether
contributing to the problem on hand or not. This situation may lead to shift in
focus on different areas resulting in longer time to resolve the problem. It is
important to always stay in focus and proceeding in the right direction.
5.
Simulate the problem in Test /UAT environment
Make sure that the findings are
correct by simulating the problem multiple times. This will reveal much more data
and help characterize the problem better.
6.
Perform reviews
If the problem area has already
been isolated in any of the steps above, then narrow the scope of the review to
the components involved in the isolated problem area. If not, then the scope of
review is little wider and look for problem areas in every component involved
in the request response cycle. Code reviews to debug performance issues require
unique skills. For instance, looping blocks, disk usage, processor intensive
operations could be the candidates for a detailed review. Similarly, in case of
distributed application, look for too many back and forth calls to different
physical tiers could easily contribute to performance problem. Good knowledge
on the various third party components and Operating System APIs consumed in the
application may sometimes be helpful.
When the problem is isolated to a
server and the application components seem to have no issues, then it might be
possible that any other services or components running on the server might
cause load on the server resources there by impacting the application being
reviewed. If the problem is isolated to Database server, then look for dead
locks, appropriate indexes etc. Sometimes, lack of archival / data retention policies
could result in the database tables growing in a much faster pace leading to
performance degradation.
7.
Identify the root cause
By now one should have identified
the specific application procedure or function that could be the cause of the
problem on hand. Have it validated by doing more simulations and tests in environments
equivalent to production.
8.
Come up with solution
It is just not over yet, as root
cause identification should be followed by a solution. Sometimes, the solution
to the problem may require change in the architecture and might have a larger
impact on the entire application. An ideal solution should prevent the problem from
recurring and at the same time it should not introduce newer problems and
should require minimal efforts. Alternatively if the ideal solution is not a
possibility with various constraints, a break-fix solution should be offered so
that the business continues and also plan for having the ideal solution
implemented in the longer term.
Hope this one is useful read for those of you in production
support. Feel free to share your thoughts on this subject in the form of
comments.
Excellent approach!!. Thank you sir, for putting this together.
ReplyDelete