What are some of the valuable yet lesser known practices that R&D teams use when building their applications?
Distributed systems make it hard to debug an application in production. Errors coming from one service could be causing trouble elsewhere, and the pursuit of answers in application logs can take hours and days, if not weeks. Particularly when you’re correlating information to logs, while also trying to lock down on the problematic code and application states. R&D teams typically want to find and fix issues quickly and move on to developing new features.
To get a closer look at how modern development teams find a balance, we’ve decided to get in touch with Chris Creel and Mykel Alvis from the Cotiviti R&D team. In this post, Chris and Mykel share their experience from a healthcare industry perspective. Let’s take a look at how they’re bringing issue resolution times down to a minimum.
— Takipi (@takipid) February 3, 2016
The Underlying Architecture: Scala, Graphs and Distributed Systems
Cotiviti chose to use Scala for building the next version of their pattern detection solution, which is developed on top of a cloud computing service with an open source database. The graph database is distributed and allows fast processing of queries on a large scale. It is written in Scala and supports queries that fit within Cotiviti’s core business operations of identifying inconsistencies in healthcare data, reporting back not only whether certain pieces of data fit a pattern, but also the reason they fit the pattern.
Since each piece of healthcare data can be comprised of many different features, providing an accurate result in real time can be extremely complex. The graph database Cotiviti R&D is using simplifies this process.
From Weeks to Minutes: The New Issue Resolution Workflow
To support their need to maintain a fast-paced development cycle, the Cotiviti R&D team has established an issue resolution workflow with OverOps. Chris Creel, VP R&D, said, “From the outside, OverOps’ solution looks a little bit like magic. Just today I explained what we do with OverOps to somebody and the person was startled by the capabilities. When I explained that OverOps’ solution traps exceptions and gives context around those exceptions, including environment, state, variables, and even the code, the person just couldn’t believe that it was possible.”
Chris continued, “The ability to support our clients through a system that detects not only errors within our code, but also errors in the third-party runtime engine of our code, has the potential to reinforce client confidence in our solutions. The ability to detect, aggregate and report on this type of information is something that most people who operate complex systems don’t even know exists.”
The OverOps dashboard – Live analysis of exceptions and log errors on production JVMs
Let’s take a closer look at how this development workflow looks like.
1. New code is deployed to production
New code is committed by the developers with new features and bug-fixes. This triggers a request to the CI/CD system to release that code to the respective repositories. Based on the new code and the process through which Cotiviti deploys the new solution upgrade, a new version of the server is built with the most recent and best version of the software.
2. OverOps is hooked up to the new servers
The new servers go online on machines where OverOps is installed by the CI/CD system, and each of the servers is started with the OverOps native JVM agent argument. Once a JVM is hooked up to OverOps, it starts to track and analyze all the exceptions and log errors that are occurring in that environment, even if the environment encompasses multiple machines. In addition, multiple instances of the same event are aggregated to provide a high level understanding of any issues.
3. When a new event pops up, it is reported in Slack
Using OverOps’ Slack integration, new events are automatically reported to a designated Slack channel. Each event notification includes the machine and the JVM where the event occurred, its stack track and a link to its analysis inside OverOps. This effectively eliminates the time lag between when an error occurs and when someone is assigned to handle it. In addition, each alert also contains all the data the assigned developer needs to find a solution.
Chris stated, “Because of the fidelity of the data, and the amount of content that is delivered to us when we receive an alert from OverOps, our developers are able to very quickly address any issues. As a result of the ability to act quickly, the Cotiviti R&D team’s impact is much higher.”
OverOps’ Slack integration: providing direct access to new events from Slack
4. Zooming in on the real root cause
Going beyond the reporting, each event is associated with its full context, including the stack trace with the actual code at every frame in the stack, and the variable values at the moment of error. All of these details are available with no need to reference log files. In addition, each event has multiple recorded instances that let you see the data around the first occurrence of the situation and the number of subsequent occurrences.
The OverOps event analysis view (click on the image for a full screen view or try it live)
5. Fix the issue and continue with the next cycle
When an issue is fixed, you can mark it as resolved in OverOps. If the same issue recurs following the next deployment, OverOps will alert you that the error has resurfaced. “Because our application is under heavy development, under various circumstances there might be a lot of information to sort through. We use OverOps primarily in the development stream because it helps minimize the number of exceptions we get from the application,” said Chris.
No sensitive user data ever leaves the server
“OverOps has so far provided us with great functionality. We can filter out sensitive personal data to comply with our HIPAA obligations. OverOps makes security and privacy a top priority. We can trust that our data is safe, minimizing the worry about whether or not we have exposed our clients to a bad event,” Chris commented.
Chris continued, “As someone who has worked in the healthcare information sector for a number of years, I am always concerned about ensuring compliance. Making sure that my developers can continue the amazing work they do, while minimizing the potential of exposure of customers’ PHI (Protected Health Information), is of utmost importance.”
It is rare to have the opportunity to see up close how specific workflows are being handled in other companies, particularly the types of workflows that dramatically improve issue resolution times. When every failure can lead to a significant impact, using workflows and new technologies allow Cotiviti to be on the leading edge of the kind of traits expected from modern applications. We would like to thank Chris Creel and Mykel Alvis for giving us an inside view into the work of their team.
The Cotiviti R&D Group
Cotiviti is a leading provider of analytics-driven payment accuracy solutions. The Cotiviti R&D team is exploring the next generation of Cotiviti’s platform – reliably supporting rapid pattern detection on a larger scale and introducing new capabilities into the platform. Currently hiring Scala devs in Atlanta: check it out!