Java logs data crunch: How GitHub’s top Java projects use logs?
One of the things we like most at OverOps is crunching data and learning new things. Yes, we’re that fun at parties. If you feel the same, then you’ve reached the right place.
Inspired by Google’s developer advocate Felipe Hoffa, and his tabs vs spaces post, we used Google BigQuery and GitHub’s database for this post. The top 400,000 repositories by number of stars they were given in 2016, with some SQL on top.
A huge shout out goes to Tzofia, Chen and Niv from our R&D team for their help in making this data crunch happen.
[This blog post is included as chapter 1 of our free Guide to Java Logging in Production. Download the full eBook here.]
First, Some Background
The starting point for this research is the GitHub archive, and its datasets on Google BigQuery. We wanted to focus on qualified Java projects, excluding android, sample projects, and simple testers. A natural choice was to look at the most starred projects, taking in the database of the top 400,000 repositories.
We ended up with 15,797 repositories with Java source files, 4% of the initial dataset. But it didn’t stop there. Looking at the number of logging statements, we decided to only focus on projects with at least a 100 different statement, and only those who use the standard Logback / Log4j / Log4j2 / SLF4J levels: TRACE, INFO,DEBUG, WARN, ERROR and FATAL. JUL, java.util.logging, was ignored for the purpose of this post. More details on that later.
This left us with 1,313 Java project data vectors to play with. We believe this to be a fairly representative sample of what we were trying to achieve.
A more in depth walkthrough, with the actual queries and the raw data is available at the end of this post.
If you like where this is going, be sure to check out some of our previous data oriented posts, like The Top 100 Java Libraries in 2016 – After Analyzing 47,251 Dependencies, and The Top 10 Exception Types in Production Java Applications – Based on 1B Events.
The Average Java Log Level Distribution
And the winner is… INFO, making up 29.7% of the logging statements in the average project. Followed by DEBUG with 28.1%, and ERROR with 22.6%.
Closing the list are WARN (14%), TRACE (5.2%), and FATAL (0.4%).
Check your own logging distribution, and let us know how your logging statements look compared to this average. We’re curious to see how this distribution looks like in heavy duty Java applications who are not necessarily on GitHub.
On our own codebase, we got TRACE (0.55%), INFO (33.42%), DEBUG (7.92%), WARN (10.85%), and ERROR (47.26%).
Production vs. Development Logging
For the average Java application, there are 35.5% unique logging statements that have the potential to be activated in production, and 64.5% statements that are only activated in development. That’s almost DOUBLE. And not only that, naturally, log levels below WARN happen much more often, throw this into production and the difference would be much bigger.
Excessive logging can produce a lot of overhead in terms of both storage, and performance, which quickly translates to lots of money – since log management tools would charge you by the log’s volume. In one of our previous research posts, we also found out that 3% of the top unique log events, produce on average 97% of the total log size.
btw, regardless of which logging levels you’re using in production, you can get the last 250 DEBUG, TRACE and INFO statements for any issue in production with our tool.
Actual Log Levels in Use
Another cool thing the data allowed us to look at was the popularity of different log levels. How often is TRACE used compared to other levels? We’re not big fans of TRACE but looks like 55% of the projects use it:
Another interesting insight is that FATAL is only used by 20% of the projects. INFO, DEBUG, WARN, and ERROR are kings.
Examining the Data
Looking at the results from the GitHub data crunch on Google BigQuery, we first ended up with 15,797 repositories with Java source files. 60% of those, didn’t use logging at all, so they weren’t relevant to this research. These repositories mostly included test projects, small experiments, utilities, learning materials, etc.
We broke those down to projects by number of logging statements:
Another criteria for the research was that we only use projects with a considerable amount of logging baked in. Based on this breakdown, we decided to focus on projects with 100+ logging statements.
Digging in further, we looked into java.util.logging levels versus Logback / Log4j / SLF4J and decided to focus on the latter:
The 1,313 Java projects we used for the research were those that had at least a 100 logging statements, excluding those that were pure JUL, java.util.logging (FINE, FINER, FINEST, SEVERE, etc.).
How Did We Get the Data? Google BigQuery
To reach the first set of Java projects and the breakdown of their logging, we completely relied on Google BigQuery and the GitHub archive database.
First stop, getting the Java projects out of GitHub’s top 400,000 projects by stars:
This basically gave us all repos with their respective Java source files.
And… only took 11.2s, to process 134 GB. W00t. BigQuery magic right there.
Next step, getting the contents of those source files, excluding ones with android packages and android repo names. We noticed there were some Arduino projects, so we excluded those as well:
7.2s, for 12.8 GB.
Hold on tight, the final query is a bit crazy. Including all the regex to extract the count for the different logging levels:
16.9s, for 18.1 GB.
The final spreadsheet with all the data and some additional calculations is available right here. There’s lots of data to look at, and more conclusions to be drawn. We’ll probably do another follow up post with additional insights, so follow @overopshq on twitter and stay tuned for additional updates!
If you find more interesting insights, please let us know in the comments section below, and we’ll publish those in the follow up post.
We’ve grown accustomed to digging in log files to find out how our applications behave in production. But it doesn’t have to be like this. Generally speaking, log files suck. Tons of unstructured text, sometimes the information you’re looking for wasn’t even logged, and then there’s the debugging paradox, adding logging statements and hoping the error that sent you on that chase would happen AGAIN. But… Java debugging doesn’t have to look like this:
There’s another way.