Upgrade to Big SQL 4.2 with Impersonation
This article explains the steps involved in enabling Impersonation behavior in Big SQL 4.2 with schema and tables created without impersonation. The process is applicable while switching Impersonation on as well as on upgrade from a lower version where schema and tables would have been created without impersonation.
Why Upgrade to 4.2?
IBM BigInsights v4.2 has quite a few features that make it a treat to move to the latest and greatest. It has support for Express Upgrade from v4.1, which is the procedure used to bring all the services down for a brief period, make the upgrade from Ambari, and follow some steps for each value-add package to finish the quick upgrade.
One of the new features introduced in IBM BigInsights Big SQL v4.2 is Impersonation. It allows Big SQL to securely access data in HDFS and Hive on behalf of the connected user.
What is Impersonation in Big SQL?
When users connect to Big SQL, any operation they perform is actually performed by the bigsql user, which is the Big SQL service owner. If Impersonation is enabled, then Big SQL performs the operations in HDFS and Hive as the connected user. This allows for authorization control on the underlying HDFS directories to be restricted to the end users and not having to additionally include rwx permissions for bigsql user.
Make sure to have considered the requirements for and implications of enabling Impersonation from this article and other information online.
How to enable Impersonation after Upgrade?
Follow the steps for upgrading to v4.2.
After the upgrade is successful, edit the $BIGSQL_HOME/conf/bigsql-conf.xml file on the Big SQL Head Node. Update the property bigsql.alltables.io.doAs to true. This enables Impersonation behavior in Big SQL on the next restart.
There is also a property called bigsql.impersonation.create.table.grant.public, which can be set to true in the $BIGSQL_HOME/conf/bigsql-conf.xml file. This is to enable automatic grants to public while creating hadoop tables for insert, delete, update, and select operations from the Big SQL perspective, so that these operations are authorized only in HDFS.
It is very important to note that after switching Impersonation on in Big SQL, in order to allow for impersonated access to any existing tables, we need to make some changes to Hive and HDFS. We first need to identify which schema and tables reside in the cluster and then identify which permissions are granted to which users/groups and then replicate those in HDFS and Hive to make sure that the switch to Impersonation behavior is seamless.
After making the changes as described below, Big SQL should be restarted and all the tables should be synchronized using the HCAT_SYNC_OBJECTS procedure as described in ‚ÄúHow to enable Big SQL impersonation or switch impersonation on or off‚ÄĚ.
Setup Hive Warehouse Permissions
The Impersonation documentation lists a couple of best practices to set up permissions for the warehouse directory. Depending on your situation, update permissions on the Hive warehouse directory (typically hdfs://apps/hive/warehouse), so that users allowed to create schemas in Big SQL (and Hive) will be able to create corresponding directories under the warehouse directory.
If Big SQL Impersonation is enabled, it is also recommended that Hive Impersonation be enabled.
The first step would be to identify the schemas in Big SQL. With the non-impersonated Big SQL, connect as the bigsql user and run this SQL to list the schemas with their HDFS locations –
SELECT SCHEMANAME, HIVESCHEMA, LOCATION FROM SYSHADOOP.HCAT_SCHEMAS;
You might want to chown the schema location in HDFS to reflect the owner of the schema:
hdfs dfs -chown salesadmin /apps/hive/warehouse/sales.db
hdfs dfs -chown marketingadmin /apps/hive/warehouse/marketing.db
If Hive is also enabled to use Impersonation, then Hive Metastore authorization would need to be setup to use Storage Based Authorization. This would let Hive look at the HDFS permissions to identify the schema owner.
Setup Schema Permissions
If CREATEIN, ALTERIN, and DROPIN permissions have been granted on schemas to users/groups, then for the same to work with Impersonation, the corresponding read/write privileges need to be given in HDFS to the schema location. For this, you can either set the group memberships or add HDFS ACLs, or a combination of both.
Update group membership:
hdfs dfs -chgrp sales /apps/hive/warehouse/sales.db
or combine with the chown operation above:
hdfs dfs -chown salesadmin:sales /apps/hive/warehouse/sales.db
Schema could be setup to only allow read/execute permissions at schema level in HDFS:
hdfs dfs -chmod 750 /apps/hive/warehouse/sales.db
Then add ACLs for those that CREATEIN and ALTERIN have been granted:
hdfs dfs -setfacl -m ‚Äúuser:salesadmin:rwx‚ÄĚ /apps/hive/warehouse/sales.db
The next step would be to identify tables in Big SQL. With the non-impersonated Big SQL, connect as the bigsql user and run the following SQL to list the relevant details:
select TABSCHEMA, TABNAME, HIVESCHEMA, HIVETAB, OWNER, LOCATION from SYSHADOOP.HCAT_TABLES
This would list the schema and table names in Big SQL and Hive, the tables owners, and HDFS locations. For each of the tables, you might want to chown the table location and its contents in HDFS to reflect the table owner:
hdfs dfs -chown -R sales1 /apps/hive/warehouse/sales.db/region1
hdfs dfs -chown -R sales2 /apps/hive/warehouse/sales.db/region2
The -R option above will change owners for all partitions and data files under the table location.
Setup Table Permissions
If you have performed grants in Big SQL to allow other users/groups to perform certain operations on the tables, then for the same to work with Impersonation, the corresponding read/write privileges need to be given in HDFS to the table location and all its partitions and data files. For this, you can either set the group memberships or add HDFS ACLs, or a combination of both.
Note: If there are roles and grants done to roles, they cannot be directly mapped to HDFS ACLs. In such case, ACLs will need to be added for all the members of the role.
To identify grants in Big SQL, connect as user bigsql and execute this SQL:
select * from sysibmadm.privileges order by objectschema, objectname
Or, for each schema/table that you want to collect the grants in Big SQL, execute the following:
select * from sysibmadm.privileges where objectschema = ‚ÄėSALES‚Äô and objectname = ‚ÄėREGION1‚Äô
Update group membership:
hdfs dfs -chgrp -R sales /apps/hive/warehouse/sales.db/region1
or combine with the chown operation mentioned earlier:
hdfs dfs -chown -R sales1:sales /apps/hive/warehouse/sales.db/region1
Depending on the operations allowed, use chmod to allow read/write permissions in HDFS:
hdfs dfs -chmod -R 750 /apps/hive/warehouse/sales.db/region1
And add ACLs:
hdfs dfs -setfacl -R -m ‚Äúgroup:marketing:r-x‚ÄĚ /apps/hive/warehouse/sales.db
hdfs dfs -setfacl -R -m ‚Äúgroup:marketing:r-x‚ÄĚ /apps/hive/warehouse/sales.db/region1
Tables that were created from Big SQL before the upgrade when Impersonation was not enabled, are owned by the bigsql user. In order to enable any ADMIN operations in Hive for the real owner in Big SQL and any other users /groups that may have a CONTROL authority, they need to be given the ADMIN privilege in Hive.
Note about Views and Stored Procedures
Views could have been used to provide an additional layer of security on top of tables. Under non-impersonation mode, you could create a view and grant select on it to other users, instead of granting select on the underlying table to all the users. This allows for controlling permissions at a granular level, while still protecting the entirety of the data in the table.
Under Impersonation mode, we will perform the read operation as the connected user. So even if you have select privilege on the view, the connected user also needs the appropriate read permissions on the data located in HDFS.
The same also applies to stored procedures. Any operation that a stored procedure does will additionally have authorization checked by HDFS during the read/write action.
Thanks to Jyoti Jalvi for testing the commands listed above and other inputs for this blog article.