Using Machine Learning to Find Root Cause of App Failure Changes Everything

It is inevitable that a website or app will fail or encounter problems from time to time, ranging from broken functionality to performance issues or even complete outages. Development cycles are too fast, conditions too dynamic, and infrastructure and code too complex to expect flawless operations all the time. When a problem does occur, it creates a high-pressure urgency that sends teams scurrying to find a solution. The root cause of most problems can usually be found somewhere among millions (or even billions) of log events from a large number of different sources. The ensuing investigation is usually slow and painful and can take away valuable hours from already busy engineering teams. It also involves handoffs between experts in different aspects or components of the app, particularly with the use of interconnected microservices and third-party services which can cause a wide range of failure permutations.

Finding the root cause and solution takes both time and experience. At the same time, development teams are usually quite short-staffed and overworked, so the urgent “fire drill” of dropping everything to find the cause of an app problem stalls other important development work. Using observability tools, such as APM, tracing, monitoring, and log management solutions, helps team productivity, but it's not enough. These tools still require knowing what to look for and significant time to interpret the results that are uncovered.