Apache Ranger is an open-source authorization solution that provides access control and audit capabilities for Apache Hadoop and related big data environments. It is an integral part of on-prem Hadoop installations, commercial solutions like Cloudera Data Platform, cloud managed Hadoop offerings like Amazon EMR, GCP Cloud Dataproc, or Azure HDInsight.
The plugin-based architecture enables support for Hive, HBase, Kafka, Impala, Kylin, NiFi, Spark, Presto, etc. Also, the plugin method allows Ranger beyond the Hadoop ecosystem, for example secure AWS S3 objects. Yet, there are no open source connectors for modern data platforms like Databricks, Snowflake, BigQuery, Azure Synapse, or Redshift.
I have taken an interest in data governance for engineers and have observed a lot of disruption to data access when migrating from hadoop platforms to modern cloud data architectures such as Databricks, Synapse or Snowflake. The data access patterns have become more complex as the cloud unlocks more use cases and workloads.
Ranger is an open source project by Hortonworks for the Hadoop ecosystem created back in 2014 that provides policy management around accessing a resource (file, folder, database, table, column etc.) for a particular set of users and/or groups. To contrast impacts of data access governance in the cloud, I created a set of policy scenarios from use cases and acceptance criteria collected and generalized from cloud expansion projects with sensitive data environments. This article explores different access scenarios and showcases implementation steps in Apache Ranger to understand the impacts on this project from cloud adoption readiness.
Policy scenarios use the TPC-DS dataset. TPC-DS models the decision support functions of a retail product supplier. The supporting schema contains vital business information, such as customer, order, and product data that is intended to mimic much of the complexity of a real retail data warehouse.
The TPC-DS schema models the sales and sales returns process for an organization that employs three primary sales channels: stores, catalogs, and the Internet. The schema includes seven fact tables:
In addition, the schema includes 17 dimension tables that are associated with all sales channels.
The TPC-DS is created and maintained by the Transaction Processing Performance Council. More details may be found at tpc.org.
My objective is to show policy maintenance as a measure of scale. For each scenario, I count the following aspects:
Acquire an EC2 instance via Amazon via https://portal.aws.amazon.com/
Select Installation Type: Docker - Follow the instructions to install Docker on the EC2 instance
Deploy HDP to Docker container
Other Policy Scenarios | Result |
---|---|
3a - Grant permission based on AND Logic | Unable to create policy |
3b - Minimization Policies | Unable to create policy |
3c - De-identification | Unable to create policy |
3d - Policies with Purpose | Unable to create policy |
The number of cumulative policy changes climbing to 733 for the scenarios reflective of cloud adoption were higher than I expected for the first 11 policy scenarios, and a number of scenarios were not possible which will limit data access for those requirements.
From my perspective, the number of policy changes is dangerously high and imposes following risks:
1. Misconfiguration
According to the latest Verizon’s 2021 Data Breach Investigations Report, misconfiguration is one of the top threats, often accounted for over 70% of all security errors. The high number of Ranger policy changes leaves too much space for misconfigurations and human error.
2. Time and effort needed
Policy changes, collecting approvals, policy validation, periodic reviews is a time consuming task, which might become a bottleneck for companies with a high number of data assets and/or complex organizational structure.
3. Engineers (de)motivation
Jeff Magnusson wrote a brilliant piece on why Engineers Shouldn’t Write ETL. I agree with the author that everybody wants to be the “thinker”, deliver business value. I haven’t yet met an engineer who would enjoy working on access policy changes.
4. The fear of saying “no” to the business
4 of the 15 policy scenarios were not possible to implement in Ranger. Clever engineers will always find ways to overcome solution limitations, and there are ways “to hack” Ranger to meet all the requirements. However, it has to be a conscious decision on what type of customization is allowed for an enterprise security and access management, how one validates and certifies it afterwards.
The central Total Protection Co. data science team has just created their new retail data warehouse mart and entitled the TPC-DS. Now it’s time to start getting it in the hand of employees across the company.
24 tables from the TPC-DS dataset
All users are in the existing Active Directory group:
As good stewards of their data, the team wants to mask any data across all tables that is classified as Personal Identifying Information by an automated service.
Only individuals who have been manually approved by the admin team should be able to access PII data.
All tables that contain PII. According to Google’s Data Loss Prevention service, this includes customer, call_center tables.
All data suspected to be PII should be made null.
Users with override permission are manually approved and do not have an group.
Because no group exists, the Ranger policy must hardcode the exempted users from the PII requirement. A recommended option would be to create a dedicated group for users with PII access.
After reviewing the policy, the team has decided some information would be valuable to allow through the masking policy.
All tables that contain an email address. There are 2 instances of this in the TPC-DS tables: customer and promotion tables.
The same logic as prior scenarios should be applied.
Authorized users should be able to see email domains only, without the username.
Ranger does not allow nesting of columns and instead relies on the “ordering” of policy rules.
Two new users have been assigned to the team and need to be authorized to access all PII data.
Tpcds database tables with PII.
The users are not part of an AD group that contains them.
Two new users should be able to see all PII data and email domains only, without the username.
For each of the masking policies created, add the users to the “Allow” and “Exclude from Deny” condition selection box. A recommended option would be to create a dedicated group for users with PII access.
Scenario | 1a | 1b | 1c | 1d |
---|---|---|---|---|
Policies created | 1 | 2 | 2 | 0 |
Policies edited | 1 | 0 | 1 | 5 |
User attributes created | 1 | 4 | 4 | 0 |
User attributes edited | 0 | 0 | 2 | 7 |
Tags created | 0 | 1 | 1 | 0 |
Tags applied | 0 | 1 | 2 | 2 |
A retailer wants to share a dataset of store performance, broken out by store. However, they do not want store managers to see the performance of their peers.
Each group has an existing Active Directory group:
group | access | condition |
---|---|---|
store-1 | select | ss_store_sk = <store_1_sk> |
store-2 | select | ss_store_sk = <store_2_sk> |
store-3 | select | ss_store_sk = <store_3_sk> |
… + 9 | ||
central-office | select | < no conditions > |
Create an Access Policy to Allow 12 Store Groups to Access tpcds database:
Create an Access Policy to Allow 12 Store Groups to Access store_sales tables:
Create a Row Level Filter Policy to Allow Each Store Group to Access to their store_sales table, respectively:
It is required to “unpack” user groups, because the AD groups do not map exactly to the values registered in the database. In Ranger, the user needs to hardcode the mapping of an individual “group” to a “where clause”, resulting in a policy with 51 conditions (although potentially many more).
A store manager has recently departed from the company, and her store will be managed by another store manager in the interim period.
Same data from Scenario 2a.
Same attributes from Scenario 2a.
In addition to the Scenario 2a requirements, the interim store manager should be able to see both the original and new region, without changing the dataset.
The updated Row Level policy would also entitle the original store manager from store-1 to gain the access to the store-2 sales data.
The data team has added a second page to the store dashboard showing employee satisfaction. As in Scenario 2a, each manager should only have access to their store or region’s records.
Besides the “store_sales” table, we want to apply the same policy to all tables containing the “store_sk” key (have different column names).
Same as prior policy scenario.
Requirements are identical to Scenarios 2a but applied to all 3 tables.
You have to create new, separate policies for each table (not even duplicated, because the keyed-on column name changes).
The organization has decided to undergo an organizational restructuring, and they are adding in a new layer of “regional managers” with authorization into multiple stores.
Same as scenarios 2a-2c.
The organization has added another layer of groups, so that they now have.
group | access | condition |
---|---|---|
region-1 | select | ss_store_sk in (<store_1_sk>, <store-2_sk>, …) |
… + 5 more |
Update Access to tpcds Database Policy by adding 6 Region AD Group
Update Access to 3 Store Tables Policies by adding 6 Region AD Group
Update Store Sales Low Level Filter Policy by adding 6 Region AD Group
Update Store Returns Low Level Filter Policy by adding 6 Region AD Group
Update Store Metadata Low Level Filter Policy by adding 6 Region AD Group
Once again, we are tripling the amount of policy work that needs to be done in Ranger by having to update multiple policies. We are also continuing to embed user authorizations into a SQL table, rather than manifesting them as user attributes in our policy management system, which reduces visibility into why a certain user may or may not have the appropriate access.
Your company has tripled in size and is expanding into new countries. A new internal policy has been passed that prohibits users from seeing record-level information on individuals in countries outside of their own country (unless explicitly authorized).
This new regulation applies to all tables in the schema that contain a customer reference. This includes six transaction fact tables and the customer dimension table:
A new AD group has been added to individual users indicating their country.
The Ranger approach will require the creation of 7 new policies, each uniquely tailored to the column name(s) of the data source.
group | access | condition |
---|---|---|
country-usa | select | c_birth_country = ‘USA’ |
country-germany | select | c_birth_country = ‘Germany’ |
country-mexico | select | c_birth_country = ‘Mexico’ |
… + 5 | ||
central-office | select | < no conditions > |
sr_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='UNITED STATES')sr_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='JAPAN')
group | access | condition |
---|---|---|
country-usa | select | <XX_customer_sk> in (select c_customer_sk from db.schema.customer where c_birth_country = ‘USA’) |
… + 7 | ||
central-office | select | < no conditions > |
sr_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='UNITED STATES')sr_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='JAPAN')
ss_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='UNITED STATES')ss_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='JAPAN')
group | access | condition |
---|---|---|
country-usa | select | <XX_col1_customer_sk> in (select c_customer_sk from db.schema.customer where c_birth_country = ‘USA’) AND <XX_col2_customer_sk> in (select c_customer_sk from db.schema.customer where c_birth_country = ‘USA’) |
… + 7 | ||
central-office | select | < no conditions > |
cr_refunded_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='UNITED STATES')AND cr_returning_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='UNITED STATES')cr_refunded_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='JAPAN')AND cr_returning_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='JAPAN')
cs_bill_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')AND cs_ship_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')
wr_refunded_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')AND wr_returning_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')
ws_bill_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')AND ws_ship_customer_sk in (selectc_customer_skfrom customerwhere c_birth_country='xxx')
Mismatches between the AD group name (country-mexico) and the actual value in the backing table (“MEXICO” vs. “MX” vs. “Mexico”) could cause issues.
New data teams have been established in the global office, and all policies should be updated to allow these teams to access all store records and all customer records.
The data is the same as in previous scenarios.
There is a new AD group:
Users on the analytics team should have access to all records by default.
The Ranger’s approach does not seem like a scalable solution. It requires a dozen policy rework.
With an analytics team in place with broad access to data, they have created dozens of derivative datasets based on the original data products. As they plan to re-expose these new tables into data products for store management, all policies need to be reapplied.
All users/groups in previous scenarios are present.
Existing policies should be applied on derived datasets.
As a data set scales, Ranger requires new policies to be created and the complexity is increasing.
Scenario | 2a | 2b | 2c | 2d | 2e | 2f | 2g |
---|---|---|---|---|---|---|---|
Policies created | 3 | 6 | 15 | ||||
Policies edited | 1 | 3 | 5 | 4 | 14 | 2 | |
User attributes created | 37 | 28 | 30 | 81 | 140 | 310 | |
User attributes edited | 1 | 14 | |||||
Tags created | |||||||
Tags applied | 2 |
The data team has uploaded human resources data which is considered Personal AND Business Confidential. Existing groups for “May access personal data” and “May access sensitive data” already exist.
Schema:
Size: 10,000 records Tags: [Personal, Business Confidential]
Only users in both these groups should have access to the new human resources data.
To limit expensive queries that rapidly escalate compute cost in the cloud, the finance team requires a policy to limit data access to 25% of data in a table for all users as more consumers are accessing data with different tools where there is limited control of what queries are being generated.
Create a policy to enforce this for all users unless they are part of the analytics team.
Apache Ranger cannot limit data access per data volume usages in a table. A suggestion is to create a tool or application which will store the table in cache and capture/update the real-time data consumption rate or data usage. A plugin similar to Hive Plugin in Ranger is required to integrate with the cached data into Apache Ranger. Then a policy can be created in Apache Ranger to enforce this table access for all users per the predefined limit of captured real-time data consumption rate (e.g., 25 % of data) unless they are part of the analytics team.
The legal team has concerns about insider attacks on customer data and requires guarantees against linkage attacks, and needs to see sufficient proof you have demonstrated this. For any combination of quasi-identifiers (QI) such as c_birth_country and c_birth_year in the [customer] table, how would you mask those values using k-anonymity such that the least populated cohort is greater than 4. The k-anonymity of a data source is defined as the number of records within the least populated cohort, which means that the QIs of any single record cannot be distinguished from at least k other records. In this way, a record with QIs cannot be uniquely associated with any one individual in a data source, provided k is greater than 1.
Masking Policies in Apache Ranger do not support calculating and using k-anonymity for some value of k. I would suggest to create a User Defined Function (UDF) in Hive database to calculate the k-anonymity on the c_birth_country and c_birth_year columns in the customer table. Then a masking policy can be created against the customer table. We can apply the k-anonymity value by invoking the UDF function in Policy Conditions on the masking policy to provide guarantees against linkage attacks.
Customer data can only be used by the analytics team for the purpose of fraud detection for [orders] shipped to less common countries. The policy requires that purpose must be acknowledged by the user prior to accessing data.
Apache Ranger Resource Based Access Control Policies cannot support any acknowledgement function prior accessing data. I would suggest to create a User Defined Function (UDF) in Hive database to redirect users to a Web site or a pop up message box to acknowledge the accessing the data, and allow their access to the table when the acknowledgement is made. A Hive View with the embedded UDF function over the Customer table in the Hive database is required. When a data access policy is created, the policy will be applied on the Hive View and the data access will only be granted to the analytics team. When analytics team members access the Hive view, the view will redirect the members to a Web site or a pop up message box for the acknowledgement prior to access the data.
Apache Ranger played a key role in Hadoop adoption across various industries. Companies used Ranger to protect data assets, ensure compliance. As showcased in the above scenarios, Ranger could meet some of the basic requirements and other scenarios required unnecessary repetitions and overengineering, or were not possible to implement altogether.
Yet, the world and technology does not stand still:
These trends put a lot of pressure on data security, privacy, access management. Apache Ranger was a powerful tool for Hadoop, yet over complex access policy management makes me question its place in the modern data stack.
Other projects