How to Implement Enterprise User Management with Java Single Sign-On SAML Support (and Stay Alive)

 ● 06th Sep 2017

7 min read

How we added Single Sign-On SAML support and debugged it in production

Teams may have tens of services and applications they use to complete their tasks. Each service requires them to sign up and remember complicated passwords, when eventually the internal owners of each service have to manually handle login credentials and account settings.
That’s where SAML comes in, aiming to help companies and services have better authentication processes, without having to manually sign up for each and every one. The following post will cover how you can implement SAML Single Sign-On in your company, and the benefits you’ll gain of it.

What’s SAML and what is it good for?

SAML, Security Assertion Markup Language, is an open standard data format for exchanging authentication and authorization data between companies and service providers. It’s a security protocol similar to OpenId, OAuth (on which we also wrote about right here), Kerberos and others.
One of the main use cases SAML addresses is Single-Sign-On (SSO) across services, to offer a simple login experience through other services. For example, the experience of using a Google account to sign up for Stackoverflow. By applying SSO, companies can use the protocol for access control.
This method provides great value for both the user and the company. Users don’t have to deal with and remember passwords for various applications, while still providing them the information and services they need. Companies can have a process to identify internal users easily, providing them with the needed data. Also, in the case an employee leaves a company, it’s easier to revoke one main account without having to go through and search for multiple accounts for each service this user has had access to.

Implementing SAML

SAML is used to define 2 parties in the authentication process:

  • Service provider – The service to which the user wants to log into
  • Identity provider – The user who wants to access the service

These 2 providers establish trust by passing XML metadata files from one to the other. This is a one time configuration step. Afterwards, when a user tries to access the service, SAMLRequest and SAMLResponse are XML strings that are sent between the two providers and do the actual authentication. This is a one time process, which allows the users to do SSO for their desired application.

It’s not the SAM(L)E for everyone

This process may sound easy, but that’s not always the case. While most companies work with a single Identity Provider (IdP) which can be tested easily, the service provider might need to support a variety of different vendors.
That’s one of the issues we came across here at OverOps when implementing SAML support for the tool that we’re building. We have different customers, ranging from small startups with 5 employees and up to enterprise companies with thousands of users. While we support logging in using password and also OAuth using Google or Github, we wanted for companies with existing SAML solutions to be able to log-in seamlessly to our service. And so we needed to support multiple Identity Providers, like: Okta, PingOne, SSOCircle, OneLogin and others.
That’s why we decided to use the opensaml project, which should support all. SAML is a protocol with many details and many possible vendors, and each company can configure it differently with their own quirks and nuances – and we want to support all of them.

How did we do it? Testing in production

When we first implemented the new SAML feature, we rolled it out in beta and tried to test it for as many vendors as we could. We knew that we might come across some differences during this process, and wanted to see which issues might rise.
For most companies this process worked flawlessly, but we did come across some issues. Some companies used a different encoding for their strings, other companies didn’t pass the standard attributes (first name, last name etc…) with the standard attribute names, and some companies had custom implementations.
Since we use OverOps for our own production environment, we were able to easily see the SAMLResponse that were failing for some customers and what caused them to fail. Reproducing the error in our development environment was pretty straightforward because of that, and we created unit tests with the given input.
This is what we mean when we say that OverOps provides a proactive approach. Not only did we know that some of our customers were experiencing issues with the new feature, we saw the error’s root cause before they even contacted us to report that an error happened in the first place.

In this image we can see the SAMLResponse which we were able to analyze and see trend has subsided, using the graph.

With OverOps, we were able to filter and monitor trends to deduce new insights. For example, we were able to see the trend of the exception associated with this login failure, and make sure that it doesn’t resurface after we deployed a fix.

Implementing SAML in OverOps

To get started we used Dead Simple SAML 2.0 Client as a reference, since the first vendor we wanted to support was Okta. The author of this repo, Martin Laporte, had implemented a special case for handling it and also gave a dead simple use of SAML that worked. And quickly we created a customized version to fit our own needs.
Also, since SAML solutions don’t usually need to support multiple types of vendors, it is common to put the IdP metadata in an XML file and just load it. But since in our case we wanted to support the option of configuring new IdP metadata online for every company, along with loading the right metadata for the right company, we stored it in an internal database where the company name is the key, and the metadata is the value.
So our code roughly looks like:

Then we can use StringReader to pass a reader to SamlClient.fromMetadata:

And finally, we call decodeAndValidateSamlResponse with the encoded SAMLResponse that we talked about earlier:

What decodeAndValidateSamlResponse does is runs this code, which basically checks if the response adheres to the standard format of SAMLResponse and performs base64 decoding on it, parsing and validation steps.
Each of these steps might fail since the response can be received from different vendors with different setups. This is why we wrapped it in try clauses with multiple catch clauses and their appropriate exceptions:

Not everything is perfect

These integrations include A LOT of technical details, and recording each session would add huge amounts of noise to your logs. On some cases, you can’t even know what you should log, since the errors are completely unexpected and can only surface in production. This is why OverOps was essential to solve the error we encountered in a timely fashion. To see how it works, check out this live demo.
Also, while our case is specific to SAML, it’s representative of challenges companies have with developing large systems that have many different moving parts. We as a company want to support as many tools, plugins, integrations and other abilities as we can, and while the customer is always right, their input can be wrong. Or at least unexpected.

Final thoughts

Implementing SAML is just a small step in the ever growing ecosystem of elements companies want, need or should integrate with. We can spend time trying to anticipate the different edge cases, but that’s not practical and in most cases – you can’t foresee the future. That’s why you need tools that will help you cope as your products grow, detecting and identifying root cause when errors happen – before your customers complain about it.

Shimon is a software engineer at OverOps working on the microagent team. When he's not busy coding, you will hear him singing in perfect pitch - most likely something from the Hamilton soundtrack.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More