IBM BigInsights Big SQL v4.2 introduces a new feature called Impersonation. There are certain use cases that apply to Impersonation, but some do not. A thorough review of the entire data life-cycle from production to consumption is needed before modifying the security model for accessing the data. We explore some usecases here with regards to Impersonation.
In Hive versions 0.12 and older, there was very little support for database-like security, where the table data in HDFS could be owned by a service owner like the hive user, and grant/revoke could be securely used to control authorizations for tables. This support came only in later versions of Hive, which are included in IBM BigInsights v4.x.
For IBM BigInsights v3.x, the only viable option for setting up authorization controls for access via Hive is to enable Impersonation by setting the Hive configuration property hive.server2.enable.doAs to true. This is the default setting when installing IBM BigInsights v3.x and v4.x on-prem.
In Big SQL versions prior to 4.2, the security model is very much like a secure RDBMS, where all the data belongs to the database server and all authorization control happens in the database. Under this model, it is recommended to set the Hive warehouse /apps/hive/warehouse directory in HDFS to be owned by the bigsql user and set Hive to inherit permissions by setting the configuration property hive.warehouse.subdir.inherit.perms to true.
This is a very secure mode of operation that applies to RDBMS-like usage. If all the data is produced and consumed by Big SQL, the non-Impersonation security model is pretty secure and works well. Also, if the tables in Big SQL will be used in Hive as well, then this security model works if the Hive warehouse directory is read/write/executable by the hadoop group that the bigsql and hive service users belong to. This is covered in detail at https://developer.ibm.com/hadoop/docs/biginsights-value-add/big-sql/maximize-hadoop-data-security-ibm-infosphere-biginsights/.
In Hadoop systems where data is shared among various services like Spark, Oozie, Hive, HBase, Big SQL, etc., it becomes difficult to adhere to an RDBMS-like security model, since the authorizations now need to be setup in all those services. That brings the security setup down to the lowest common denominator, which is the underlying HDFS storage layer. In such cases, data could be produced by one service, consumed by another service and so on. One could also setup a chain of multiple services, taking as input data produced by another service. For example, you could have an ETL job producing data from outside of Hadoop, do some cleansing of it using Spark, and use the Big SQL service for advanced analytics.
For such use cases, it makes sense to setup all the services to impersonate the end users and setup all authorization controls in HDFS. Since the actual read/write operations are performed by the services impersonating as the end users, HDFS can become the one central place where all the authorization control happens.
Refer to best practices for setting up permissions in HDFS with Impersonation enabled in Big SQL.
HDFS has a POSIX style permissions model, including the ability to setup ACLs for rwx at the user or group levels. By relinquishing authorization controls to HDFS via the enabling of Impersonation, you are stuck with file or directory level granularity, which at most maps to the partition level. You therefore lose enhanced security controls like row/column-level security, label based security, view-based security, etc. that are allowed by Big SQL.