Sandro Zamboni: Techniques for troubleshooting

During all these years, which I am directly involved with environments and technology professionals, either within a classroom or in day to day work I realize some confusion on the part of analysts infrastructure for troubleshooting problems. don't believe a master in the matter concerned, despite being one of the bases of my work as a consultant; but realize that this deficiency is related to lack of vision on troubleshooting flow …. i.e. the analyst does not know where to begin and for this reason often ignores critical steps in the process of troubleshooting, logging procedures aimlessly and without needs, often causing other problems beyond those already mapped.

The resolution of problems in production environments is critical for the downtime of the service and or resources (such as applications, asset management, users, etc.), directly impacting on business. for this reason, the analyst should solve a problem more quickly and effectively, preventing a considerable impact on the productivity of users ' activities. Nowadays many companies invest in monitoring products, methodologies, processes, specialized training of professionals, as well as other various devices so that this impact is reduced and or non-existent.

A vital concept for troubleshooting is the concept of incident and problem, so commonly ITIL discussed and other methodologies. this concept consists primarily define incident as an isolated fact, something not widespread and or recurring (ex: A user has opened a call for a configuration problem on your Outlook) and problem relates to something widespread and or applicant (ex: Multiple users open called for the same problem in their Outlook's or a user always the same error reporting). So let's get straight to the point! …

1. Investigating the occurrence. is paramount raising information concerning problem, classic questions as: what time the problem occurred? what has changed on your computer and or the environment just before this period that could have caused the error (an installation and software update, or change any configuration of system or network, etc.)? … Anyway, gather as much information as possible; When we are "by interrogating" a user, it is essential to communicate with the same with terms that are understood by him, because technical terms will only confuse the user stopping us from getting the information you need.

2. Checking the breadth of the error. at this point we can verify that this is an incident or a problem, if customer is an error, network or server must check if the error occurs with a single user, if it occurs with a set of users (if it occurs with more than one user, check out what these resources has in common … is an application, VLAN group or switches, OR in ad, etc) or is referring to an application or service specific network So we can evaluate. on that point and how we will act to resolve the error.

3. technical survey of the error. In this step, we will search the messages and characteristics of the error in question (error POPUPS, log's application concerned, system logs, etc.), as well as the impact generated. This is important for "away scot-free" this information with information collected previously and in conjunction with data spanning problem. the product of this information is essential for a correct search incident resolution and or problem (e.g. Outlook user is presenting the XYZ error but this error may be due to a setting Outlook and failure or server in the Organization, etc…). We currently use diagnostic tools provided by the manufacturer and or third parties.

4. application of the solution. Good, we come to a crucial point for resolving a problem, the application of the solution before the resolution itself a problem and or incident, we verify some important points:

a. what impact the implementation of the solution. It is essential the mapping of the impact on the application of possible solutions to the problem or incident; some of them may entail on reboot a server backup, restore, rewriting some parameter of an application and or network service, change the permissions of a user, etc. the mapping is important for us to organise for when we will be able to apply the solution (ex: a solution that involves a boot server in production may result in their implementation outside of business hours). we must expect to correct an error, when the correction process impact at the moment is greater than the impact of the error.

b. which solution to implement. Depending on the issue in question, we can count on more than one solution (some of them definitive and other palliative). But there is always the best solution to be applied; Depending on your situation, we are forced to run a stopgap solution or simply a solution that is not the best. This must be assessed in light of the impact of each solution X urgency (SLA) for the resolution of the problem. However, if it is not possible to apply the best solution at the time, you should schedule the application to which a workaround does not become a definitive solution.

c. how to apply the solution. The process of implementing a solution is very important. implementation of the solution must be clean and without errors; the most suitable is searching for a step-by-step (preferably of manufacturer of the active application or solution with error) without errors at the time of the execution process. it would be interesting depending on the criticality of the environment, test the fix in an environment of approval before applying the same production. shouldn't be generalists in order to simplify the application (e.g. the documentation says that to solve a problem the user needs a specific permission on A FILE and to "simplify" you just assigning permission to ENTIRE FOLDER), with this we can significantly increase the impact caused by application of the fix, and cause problems related to security settings, best practices, etc.

5. verification of the results. After applying the fix, we should check the following points:

a. If symptoms of the problem or incident disappeared. We can verify the disappearance of the symptoms of the error by checking event logs (application, system, etc.), no incidence of error popups, resource utilization that before not worked, etc.

b. monitoring solution. Verify that there is a recurrence of error after applying the solution, check the performance of the application and solution, or service that was showing errors and feedback from users and analysts involved.

Described above some basic points that should be followed to resolve problems and incidents to be in our work and clients or course each situation can present quite particular and that solving contains several strands as disaster recovery, contingency, etc… but the development of these activities must be performed in an organized way, safe and aligned with a good technical background about the technologies involved.

Sandro Zamboni

Saturday, November 20, 2010

Techniques for troubleshooting

No comments:

Post a Comment