Apache Ranger and AWS EMR Automated Installation and Integration Series (2): OpenLDAP + EMR-Native Ranger

In the first article of this series, we got a full picture of EMR and Ranger integration solutions. From now on, we will start to introduce concrete solutions one by one. This article is against “Scenario 1: OpenLDAP + EMR-Native Ranger.” We will introduce the architecture of solution, give detailed installation step descriptions, and verify installed environment.

1. Solution Overview 

1.1 Architecture

ArchitectureIn this solution, OpenLDAP plays the authentication provider, all user accounts data store on it, and Ranger plays the authorization controller. Because we select the EMR-native Ranger solution, which strongly depends on Kerberos, a Kerberos KDC is required. In this solution, we recommend choosing a cluster-dedicated KDC created by EMR instead of an external KDC. This can help us save the job of installing Kerberos. If you have an existing KDC, this solution also supports it.

Apache Ranger and AWS EMR Automated Installation and Integration Series (1): Solutions Overview

System security usually includes two core topics: authentication and authorization. One solves the problem of “Who is s/he?” and the other solves the problem of “Does s/he have permission to perform an operation?” In the big data area, Apache Ranger is one of the most popular choices for authorization, it supports all mainstream big data components, including HDFS, Hive, HBase, and so on. As Amazon EMR rolls out native ranger (plugins) features, users can manage the authorization of EMRFS(S3), Spark, Hive, and Trino all together. For authentication, an organization usually has its own centralized authentication infrastructure, i.e., Windows AD or OpenLDAP; however, for most big data components, Kerberos is only supported authentication mechanism, so users usually need to integrate Windows AD/OpenLDAP and Kerberos together to unify authentication.

We will focus on how to implement automated installation and integration for Amazon EMR and Apache Ranger. This series is composed of four articles. Each article will introduce a completed solution against different technology stacks.