Introduction

Data security is not a new concept, and companies have always invested heavily in implementing strong security controls to protect and safeguard enterprise data. However, with the advent of Big Data technologies, where we are dealing with an endless flow of massive amounts of varied data, security concerns are posing a greater challenge because these Hadoop systems were not historically conceived for trusted environments and security was not a design factor. Dealing with Big Data is a challenge in itself, and when we think of securing this data, our concerns multiply.

Security infrastructure around Big Data is evolving rapidly to address both internal and external threats, and Hadoop has evolved some general capabilities across different distributions to support security protocols such as Kerberos, LDAP, and basic access control methods within Hive, MapReduce, and HDFS. However, the level of regulations that are needed to secure your Hadoop environment varies with different industry sectors and there are definite gaps in this area, be it the need for RDBMS-like fine-grained access control, data encryption, or security compliance features.

IBM Big SQL offers a rich set of enterprise security features to protect your Hadoop data. In this article, we look at different ways of enhancing the security of your Hadoop cluster hosted on the IBM BigInsights platform with Big SQL.

This article is intended for system administrators who want to maximize security controls for their BigInsights Hadoop cluster by using appropriate authentication and authorization controls that are based on the principle of least privilege.

The principle of least privilege requires that a user be able to access only the information that is necessary for a specific and legitimate purpose. This article shows you how to do this for IBM Big SQL.

Authentication

End-User authentication: On any distributed system, authentication is the basic level of security control that validates a user’s identity before allowing access to the system. Big SQL uses the underlying operating system security by default, or other configured security systems on the cluster, to perform end-user authentication. InfoSphere BigInsights leverages pluggable authentication modules (PAMs) and provides support for the following authentication types through PAM: LDAP, Kerberos, or the flat-file authentication that is based on operating system security mechanisms.

Because most clients might already have an LDAP server for authentication needs within their enterprise, you could easily configure BigInsights to use any existing corporate LDAP server for verifying user attributes or user and group memberships within a BigInsights cluster, or to set up a new LDAP server for end-user authentication.

Service-to-service authentication: The Kerberos network protocol is designed to provide reliable authentication over open and insecure networks, where communication among hosts has a high probability of being intercepted. InfoSphere BigInsights has support for service-to-service authentication through Kerberos. If you plan to enable Kerberos, you need to have a working Kerberos key distribution center (KDC) set up before installation, and then you can use the installation wizard to configure Kerberos service-to-service authentication for your cluster.

Planning your BigInsights installation: As with any other product, it is important to define users, groups, and roles according to your organization structure and job functions prior to installation. The following sections will help you to understand the user and group definitions that you get from a default BigInsights installation from a security perspective, and include some useful tips on restricting access to follow the principle of least privilege. Differences in user and group definitions between the InfoSphere BigInsights 3.x and 4.x releases are outlined below.

Restricting HDFS superuser access

Table 1 lists the different service users that are created as part of BigInsights installation. You might notice that all the service users in a default installation of BigInsights 3.x are created with the BigInsights administration group biadmin as the primary group. The biadmin group is also configured as the HDFS/GPFS superuser group by default. Thus, all BigInsights service users hold HDFS superuser privileges by default and have direct file system access to Hadoop data files. They can read from or write to any file in the HDFS/GPFS. From a security perspective, it would be advisable to restrict superuser group access to just those users who genuinely need this level of file system access.

Table 1: System-reserved user names for services that are installed as part of InfoSphere BigInsights 3.0

Service Owner Group UID GID
biadmin biadmin 200 123
Hdfs biadmin 201 123
Mapred biadmin 202 123
Zookeeper biadmin 203 123
Hbase biadmin 204 123
Bigsql biadmin 123
Hive biadmin 205 123
Oozie biadmin 206 123
Monitoring biadmin 220 123
httpfs biadmin 221 123
Console biadmin 223 123
Catalog biadmin 224 123
Alert biadmin 225 123
Orchestrator biadmin 226 123

Ideally, you should determine beforehand which users and groups will be required to perform HDFS administration, BigInsights administration, or Big SQL administration tasks and configure accordingly during installation. However, if you choose to install with default options, you can still remove the biadmin group from the dfs.permissions.superusergroup property in the hdfs-site.xml file to revoke HDFS superuser privileges from all the BigInsights service users, including user bigsql. You can create a separate user group to handle superuser activities and add this group to the dfs.permissions.superusergroup property in the hdfs-site.xml file.

If you have GPFS set up, there is a different way of configuring the superuser group. GPFS defines the superuser group as the group that owns the /var/mmfs/bi directory on your BigInsights installation. To prevent all service users, including the bigsql service owner, from being cluster superusers, do not let biadmin own the /var/mmfs/bi directory.

This does not apply to BigInsights 4.0 and higher, because there is no concept of a biadmin group in those releases, and service users do not have superuser access. Due to a current BigSheets restriction in BigInsights 3.x, the service user console must be part of the Hadoop superuser group.

Restricting Big SQL Admin access

On a default installation of BigInsights 3.x, the bigsql user is also created with biadmin as the primary group. The primary group of the bigsql user is configured by default to be the Big SQL administrator group (SYSADM_GRP) in the Big SQL administrator group (SYSADM_GRP) in the Big SQL database manager configuration file. This means that all users in the biadmin group automatically have SYSADM authority, which is the highest level of authority in the Big SQL database. Therefore, all service users would also have Big SQL Administrator privileges by default, which is neither necessary nor desirable from a security perspective. It’s important to restrict this access to a select number of users whose job function requires them to perform Big SQL administration tasks. It is therefore a good idea to create a Big SQL user prior to BigInsights installation with passwordless SSH. That user can be part of a different primary OS group, say bigsql, made up of only the Big SQL user and any other user who needs to perform Big SQL administration activities. You would then update this group as the SYSADM_GRP in the Big SQL database manager configuration file.

Although the bigsql user would now have a different primary group, it’s still recommended that the bigsql user have biadmin group membership as well. This is because Big SQL shares the Hive warehouse directory with the hive service user, and all the files and directories within this Hive warehouse directory are owned by the biadmin group by default. The Big SQL and Hive service users need to share this common group that owns the files and directories under the common workspace (that is, the Hive warehouse directory) to be able to seamlessly operate on the data files there.

db2 update dbm cfg using SYSADM_GRP 'bigsql'

By making the bigsql user part of a separate group (bigsql), you are restricting users who could perform Big SQL administration tasks. Ideally, bigsql should be the only user under the bigsql group, unless you want to configure multiple users with Big SQL instance owner capabilities.

With InfoSphere BigInsights 4.0 and higher, there is no biadmin group anymore, and service users are not part of the Big SQL administrator group by default.

Authorization

The next level of security pertains to objects and functions within the Big SQL database. There are several layers of authorization within Big SQL, each of which brings in a deeper level of security control. Authorities, privileges, and roles are a familiar way of providing access control through grant and revoke operations in the relational database world, and you can use this approach in Big SQL too. Big SQL provides a very sophisticated and mature security infrastructure, and depending on your Hadoop cluster usage, you can choose to configure Big SQL to perform all access control if your only access method is Big SQL. If you have users who also need to access your Hadoop data through Hive or other interfaces, then you might need to configure accordingly. Hive security authorization is disabled by default, and it is recommended that you keep it disabled.

Let’s look at securing data access through Big SQL. It is important to remember that Big SQL uses the Hive warehouse directory (/biginsights/hive/warehouse/) as the default workspace for all its operations (as configured during BigInsights installation), and it holds the necessary Hadoop file system-level permissions to seamlessly operate in this directory. It creates all the schemas and data files under this directory. If you choose to create a separate Big SQL administrator group instead of the biadmin group, as recommended earlier, it is important to ensure that the Big SQL administrator group has appropriate HDFS/GPFS privileges to operate in this particular path. The default permissions for the Hive warehouse directory are read and write permissions for the owner and the group, and read permission for all others. This means that regular users can also work in this directory space; therefore, by default, access control to the Hive warehouse directory is not very secure. It is recommended that you restrict Hadoop permissions and ownership of this directory as appropriate for your environment.

Authorities determine what particular type of action can be taken against the entire database; for example, connecting to the Big SQL database, which requires CONNECT authority, or creating tables in a particular database, which requires CREATETAB authority.

Privileges are permissions that are associated with specific database objects such as tables or views. For example, you need the SELECT privilege on a table to access data from that table.

Big SQL also provides database roles for the simplified management of authorities and privileges in Big SQL. A database role is a database object that groups together one or more privileges or database authorities and that can be granted to users, groups, or PUBLIC. Role membership is managed by a special authority level named SECADM, and this authority can be delegated to others by using the WITH ADMIN OPTION clause on the GRANT (role) statement. Authorities, privileges, and roles can all be efficiently administered by using GRANT and REVOKE statements.

On a default installation of BigInsights 3.x, the bigsql user is also created with biadmin as the primary group. The primary group of the bigsql user is configured by default to be the Big SQL administrator group (SYSADM_GRP) in the Big SQ PUBLIC is a special group within Big SQL that can be thought of as a group for all users (both present and future) on the server. By default, PUBLIC holds the following authorities and privileges on a Big SQL database:

  • CONNECT
  • CREATETAB
  • IMPLICIT_SCHEMA
  • EXECUTE WITH GRANT OPTION on all functions (SYSFUN and SYSHADOOP)
  • BIND and BINDADD

What Big SQL database authorities and privileges do users get by default?

A Big SQL database is set up in a nonrestrictive mode with many authorities granted implicitly to PUBLIC during installation. This means that any user who can successfully authenticate to the system inherits some authorities and privileges by default, and that user can do things such as connect to the database, create a new schema, create new objects, execute different system functions, and so on. In this way, any user on your cluster can quickly start exploring and using Big SQL soon after installation, without having to wait for explicitly granted ability to connect and work with the database. From a security perspective, you would want to restrict this access, and you can easily revoke these authorities from PUBLIC by using the REVOKE statement. In the following example, users are prevented from being able to connect to the database. After the CONNECT authority has been revoked from PUBLIC, no user can connect to the database unless the user is explicitly granted the CONNECT authority or belongs to another user group that holds this or a higher authority.

db2 "REVOKE CONNECT ON DATABASE FROM PUBLIC"

Any user who can connect to a Big SQL database also has IMPLICIT_SCHEMA authority by default. With this authority, a connected user can create an object by specifying a non-existing schema, and the schema (owned by SYSIBM) is implicitly created as part of object creation. PUBLIC can create new database objects within any such schema.

If you want to control the creation of schemas and restrict the creation of table objects to certain schemas, it is recommended that you revoke the IMPLICIT_SCHEMA authority from PUBLIC.

When you revoke the IMPLICIT_SCHEMA authority from PUBLIC, there are only three ways that a schema object can be created.

  • Any user can create a schema by using their own authorization name on a CREATE SCHEMA statement.
  • Any user with DBADM authority can explicitly create any schema that does not already exist, and can optionally specify another user as the owner of that schema.
  • Any user with DBADM authority has IMPLICIT_SCHEMA database authority, so that they can implicitly create a schema with any name when they are creating other database objects. SYSIBM becomes the owner of the implicitly created schema and PUBLIC can create objects in that schema.

Any user who can successfully authenticate also holds CREATETAB authority as part of the PUBLIC group. And the table creator automatically gets CONTROL authority on the table, thus allowing any user to create, insert into, select from, or even drop the table. You can revoke CREATETAB from PUBLIC and control table creation by restricting it to users with explicitly granted CREATETAB authority.

As mentioned, PUBLIC also holds the EXECUTE WITH GRANT OPTION privilege by default, which means that any user who can connect to a Big SQL database can also execute system functions and query data from various system views, including those in the SYSHADOOP schema, which are specific to Big SQL databases.

You can obtain a consolidated list of privileges that are held implicitly by PUBLIC by running the following SQL statement:

$ db2 "SELECT PRIVILEGE, cast(OBJECTNAME as varchar(25)) as OBJECTNAME,

 	 	 cast(OBJECTSCHEMA as varchar(15)) as OBJECTSCHEMA, OBJECTTYPE

 	 	 from SYSIBMADM.PRIVILEGES where AUTHID='PUBLIC' and OBJECTSCHEMA='SYSHADOOP'

 	 	 ORDER BY OBJECTNAME"
PRIVILEGE OBJECTNAME OBJECTSCHEMA OBJECTTYPE

 	 	----------- ------------------------- --------------- ------------------------

 	 	SELECT APPLICATIONS SYSHADOOP VIEW

 	 	EXECUTE BIGSQL_DDL SYSHADOOP PROCEDURE

 	 	EXECUTE BIGSQL_DDL_OLD SYSHADOOP PROCEDURE

 	 	EXECUTE BIGSQL_ENV SYSHADOOP FUNCTION

 	 	SELECT BIGSQL_PROPERTIES SYSHADOOP VIEW

 	 	EXECUTE BIGSQL_SELECT SYSHADOOP PROCEDURE

 	 	READ CATALOG_SYNC_MODE SYSHADOOP GLOBAL VARIABLE

 	 	WRITE CATALOG_SYNC_MODE SYSHADOOP GLOBAL VARIABLE

 	 	READ COMPATIBILITY_MODE SYSHADOOP GLOBAL VARIABLE

 	 	WRITE COMPATIBILITY_MODE SYSHADOOP GLOBAL VARIABLE

 	 	READ FEATURE_VECTOR SYSHADOOP GLOBAL VARIABLE

 	 	WRITE FEATURE_VECTOR SYSHADOOP GLOBAL VARIABLE

 	 	EXECUTE HCAT_CACHE_SYNC SYSHADOOP PROCEDURE

 	 	SELECT HCAT_COLUMNS SYSHADOOP VIEW

 	 	EXECUTE HCAT_DESCRIBESCHEMA SYSHADOOP FUNCTION

 	 	EXECUTE HCAT_DESCRIBETAB SYSHADOOP FUNCTION

 	 	SELECT HCAT_SCHEMAPROPS SYSHADOOP VIEW

 	 	SELECT HCAT_SCHEMAS SYSHADOOP VIEW

 	 	EXECUTE HCAT_SYNC_OBJECTS SYSHADOOP PROCEDURE

 	 	SELECT HCAT_TABLEPARTS SYSHADOOP VIEW

 	 	SELECT HCAT_TABLEPROPS SYSHADOOP VIEW

 	 	SELECT HCAT_TABLES SYSHADOOP VIEW

 	 	SELECT HCAT_TABLESERDEPROPS SYSHADOOP VIEW

 	 	EXECUTE HMETA_SCHEMA SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_SCHEMAPROPS SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_TAB SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_TABCOLS SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_TABPARTS SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_TABPROPS SYSHADOOP FUNCTION

 	 	EXECUTE HMETA_TABSERDEPROPS SYSHADOOP FUNCTION

 	 	READ PROPERTY SYSHADOOP GLOBAL VARIABLE

 	 	WRITE PROPERTY SYSHADOOP GLOBAL VARIABLE

 	 	EXECUTE SQL150611000715418 SYSHADOOP FUNCTION

 	 	EXECUTE SQL150611000715619 SYSHADOOP FUNCTION

 	 	 34 record(s) selected.

The HMETA_*, SQL*, BIGSQL_DDL, and BIGSQL_DDL_OLD functions and FEATURE_VECTOR and PROPERTY global variables that are listed in this output are for internal use by Big SQL only.

It is recommended that you restrict access to the HCAT_SYNC_OBJECTS and HCAT_CACHE_SYNC functions by revoking the EXECUTE privilege on these objects from PUBLIC and granting it only to users or groups that require it. For example:

REVOKE EXECUTE ON PROCEDURE SYSHADOOP.HCAT_SYNC_OBJECTS FROM PUBLIC RESTRICT;

 	 	GRANT EXECUTE ON PROCEDURE SYSHADOOP.HCAT_SYNC_OBJECTS TO GROUP <DATAADMINS;

User impersonation for authorization control

User impersonation means that a user can perform activities on behalf of another user. A simple example is the “sudo” function in Linux. This is not a new concept and is used widely across different products.

Big SQL does not use impersonation for authorization. This means that all HDFS or GPFS access within Big SQL, including all read or write operations from or to HDFS or GPFS, are done by user bigsql and not by the connected user. This currently applies to both system-managed and external tables.

Unlike Big SQL, Hive impersonation is enabled by default in the BigInsights 3.0 environment, and thus Hive can run queries as the connected user and create files or directories that are owned by end users on HDFS or GPFS based on user privileges. There are definite advantages and disadvantages associated with impersonation. On BigInsights 3.x releases (comprises Hive versions 0.12, if the underlying data needs to be accessed by Big SQL and Hive, then it is recommended to secure the data by having this setting as is; in other words, to run with Hive impersonation enabled so that permission checking happens at the HDFS or GPFS level.

For BigInsights 4.0 and higher, BigInsights comes with Hive 0.14 which has better authorization controls. So, it is not required to rely on HDFS for authorization checks. Since both the services – Hive and Big SQL – can perform authorization, authorization controls should be setup at service level. This way, the underlying data could be made accessible only to the service owners (‘hive’ and ‘bigsql’) and made available by the service owner only to the users authorized to access the data. All services that have real authorization should turn off impersonation. Hence, recommendation is to turn off Hive impersonation on BigInsights 4.0 or higher, by setting the ‘hive.server2.enable.doAs’ parameter to ‘false’.

On disabling impersonation, also make sure to use the Sql Standard Hive Authorization by setting these parameters in hive-site.xml –

hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

hive.security.authorization.enabled=true

hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator

Separation of duties with Big SQL built-in authorities

Separation of duties is a classic method of enforcing the principle of least privilege by grouping users according to job function and granting only the minimum required privileges. When separation of duties is done well, it can really help to safeguard your cluster from various kinds of security breaches, especially from insider threats.

Big SQL can effectively perform separation of duties by using a set of built-in authorities that simplify the assignment of different roles to users based on their job requirements.

Sample DBA use case

Traditionally, the database administrator (DBA) role is granted all sorts of authorities and privileges so that the DBA can do whatever is required to fix any type of database issue. Consequently, the DBA often has access to sensitive data and can grant privileges or authorities to other users. DBAs can even grant certain privileges or authorities to themselves. This can lead to compromised data security, because anyone with the DBA role could misuse the privileges or authorities that they hold.

As mentioned earlier, Big SQL is backed by mature database technology that can efficiently handle the separation of duties. For example, DBADM is an administrative authority for the Big SQL database that can be granted or revoked only by the security administrator; that is, someone who holds SECADM authority. This authority cannot be granted to PUBLIC.

GRANT DBADM ON DATABASE TO USER mydba;

When you grant DBADM authority, DATAACCESS and ACCESSCTRL authorities are also granted to the user. This means that the user now has access to all of the data in your database tables and can grant or revoke permissions for other users as well. If your DBA is expected to adhere to database administration functions only, you can revoke the DATAACCESS and ACCESSCTRL authorities that are implicitly granted with DBADM, as shown in the following example:

REVOKE DATAACCESS, ACCESSCTRL ON DATABASE FROM USER mydba;

Alternatively, you can grant DBADM authority without DATAACCESS and ACCESSCTRL authorities, as shown in the following example:

GRANT DBADM WITHOUT DATAACCESS ON DATABASE TO USER mydba;

 	 	GRANT DBADM WITHOUT ACCESSCTRL ON DATABASE TO USER mydba;

Similarly, every role should be created to match job responsibilities, with privileges or authorities that are assigned only to users or groups that have a genuine need for them.

There are both instance-level and database-level authorities, but given that InfoSphere BigInsights comes with a single Big SQL instance, we will focus on database-level authorities. The following section describes the built-in authorities; apart from these, you can easily create your own custom roles with specific privileges or authorities to cater to specific requirements.

Built-in database authorities

  • SECADM. Create a user with SECADM authority to perform security administration for Big SQL databases.
  • DBADM administers a Big SQL database. Cannot be granted to PUBLIC.
  • ACCESSCTRL is for users who need to grant or revoke authorities and privileges except for SECADM, DBADM, ACCESSCTRL, and DATAACCESS authority, which can be granted or revoked by SECADM only.
  • DATAACCESS is for users who need to access any data in the Big SQL database.
  • SQLADM is for users who monitor and tune Big SQL queries.
  • WLMADM is for users who manage Big SQL workloads.
  • EXPLAIN is for users who need to explain query plans. EXPLAIN authority does not give access to the data itself.

View-based access control

Traditionally, database views have been used to control access to data by hiding or masking sensitive columns or providing a filtered set of rows. A view and its underlying table can have totally different access control policies. This provides a static way to control access and a primitive form of row- and column-level access control. You can also exploit views in Big SQL as a simple method of enforcing fine-grained access control.

However, there are certain drawbacks with using views in this way:

  • Queries must be modified to use view names instead of table names.
  • Views cannot be parameterized.
  • You might have to create many views to cater to different scenarios.

Big SQL provides a completely new way to control data access at the row or column level that is built into the product in the form of row- and column-based access control (RCAC), sometimes referred to as fine-grained access control (FGAC).

Row- and column-based access control (RCAC)

Big SQL provides a state-of-the-art, built-in fine-grained access control method called RCAC, which complements the table privileges model. RCAC is an additional layer of data security that controls access at the row level, column level, or both, and it is defined and managed by SECADM. With RCAC, you can be assured that users can access only the data that is required for their work. It also overcomes the drawbacks that are associated with views, discussed earlier. Defined RCAC rules apply to all the users of the database, irrespective of their authority levels. Moreover, no matter how you access the data, be it through query tools, applications, or reporting tools, everything is subject to RCAC rules enforcement. And unlike views, these are declarative, and no application changes are required to take advantage of this additional security layer.

The following items are examples of row-level access control based on customer rules:

  • A doctor can see the medical records of only his patients.
  • A manager can see rows that represent only his employees.

The following items are examples of column-level access control (data masking) based on customer rules:

  • A teller can see only the last 4 digits of a credit card number.
  • A customer service representative can see only last 4 digits of a social security number.

Example 1: Row-based access control

This example is based on the following data:

SELECT "*" FROM BRANCH_TBL


 	 	EMP_NO FIRST_NAME BRANCH_NAME

 	 	------- ------------ -----------

 	 	1 Steve Branch_B

 	 	2 Chris Branch_A

 	 	3 Paula Branch_A

 	 	4 Craig Branch_B

 	 	5 Pete Branch_A

 	 	6 Stephanie Branch_B

 	 	7 Julie Branch_B

 	 	8 Chrissie Branch_A

 	 	

Note: The following steps 1, 2, and 3 are done by a user with SECADM authority.

  1. Create and grant access and roles:
    CREATE ROLE BRANCH_A_ROLE
    
     	 	GRANT ROLE BRANCH_A_ROLE TO USER newton
    
     	 	GRANT SELECT ON BRANCH_TBL TO USER newton
  2. Create permissions:
    CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL
    
     	 	FOR ROWS WHERE (VERIFY_ROLE_FOR_USER (SESSION_USER,'BRANCH_A_ROLE') = 1
    
     	 	AND
    
     	 	BRANCH_TBL.BRANCH_NAME = 'Branch_A')
    
     	 	ENFORCED FOR ALL ACCESS
    
     	 	ENABLE
  3. Enable access control:
    ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL
  4. Select as Branch_A user:
    CONNECT TO TESTDB USER newton
    
     	 	SELECT "*" FROM BRANCH_TBL
    
    
     	 	EMP_NO FIRST_NAME BRANCH_NAME
    
     	 	------ ------------ -----------
    
     	 	2 Chris Branch_A 
    
     	 	3 Paula Branch_A 
    
     	 	5 Pete Branch_A 
    
     	 	8 Chrissie Branch_A 
    
     	 	4 record(s) selected.
    
     	 	

Example 2: Column-based access control

This example is based on the following data:



 	 	SELECT "*" FROM SAL_TBL


 	 	EMP_NO FIRST_NAME SALARY

 	 	------ ------------ -----------

 	 	1 Steve 250000

 	 	2 Chris 200000

 	 	3 Paula 1000000

 	 	
    1. Create and grant access and roles:
      
      
       	 	CREATE ROLE MANAGER
      
       	 	CREATE ROLE EMPLOYEE
      
       	 	GRANT SELECT ON HADOOP.SAL_TBL TO USER socrates
      
       	 	GRANT SELECT ON HADOOP.SAL_TBL TO USER newton
      
       	 	GRANT ROLE MANAGER TO USER socrates
      
       	 	GRANT ROLE EMPLOYEE TO USER newton
      
       	 	
    2. Create permissions:
      
      
       	 	CREATE MASK SALARY_MASK ON HADOOP.SAL_TBL
      
       	 	FOR COLUMN SALARY RETURN
      
       	 	CASE WHEN VERIFY_ROLE_FOR_USER (SESSION_USER,'MANAGER') = 1
      
       	 	 THEN SALARY
      
       	 	 ELSE 0.00
      
       	 	END
      
       	 	ENABLE
    3. Enable access control:
      
      
       	 	ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL
      
       	 	
    4. Select as an EMPLOYEE:
      CONNECT TO TESTDB USER newton
      
       	 	SELECT "*" FROM HADOOP.SAL_TBL
      
       	 	
      
      
       	 	EMP_NO FIRST_NAME SALARY 
      
       	 	------- ------------ -----------
      
       	 	1 Steve 0
      
       	 	2 Chris 0
      
       	 	3 Paula 0
      
       	 	3 record(s) selected.
      
       	 	

      Select as a MANAGER:

      CONNECT TO TESTDB USER socrates
      
       	 	SELECT "*" FROM HADOOP.SAL_TBL
      
       	 	
      
      
       	 	EMP_NO FIRST_NAME SALARY 
      
       	 	------ ------------ -----------
      
       	 	1 Steve 250000
      
       	 	2 Chris 200000
      
       	 	3 Paula 1000000 
      
       	 	3 record(s) selected.
      
       	 	

Additional configuration parameters that can be used to increase Big SQL security

Table 2: Database and database manager configuration parameters and their effects on Big SQL security

Action Effect
Set the database manager configuration parameter discover to DISABLE. Prevents Big SQL clients from making discovery requests to detect the Big SQL instance.
Set the database manager configuration parameter discover_inst to DISABLE. Prevents anyone who is using DB2 Discovery from detecting the Big SQL instance.
Set the database configuration parameter discover_db to DISABLE. Prevents Big SQL information from being returned to a client when a discovery request is received at the server.
Set the database manager configuration parameter catalog_noauth to NO. Specifies whether users can catalog or uncatalog the Big SQL database or nodes without SYSADM authority.
Set the database manager configuration parameter federated to NO. Disables support for applications submitting distributed requests for data that is managed by Big SQL.
Set the database manager configuration parameter fed_noauth to NO. Disables the bypassing of federated authentication at the Big SQL instance.

Securing access to external data

Big SQL, like Hive, has the concept of external tables. External tables in Big SQL are tables that are managed outside of Big SQL. The data could be anywhere on the Hadoop file system outside of the Big SQL workspace in the Hive warehouse directory. This data can be owned by multiple owners like Hive, Big SQL, or even HBase. Big SQL creates a reference to this raw data in its catalog entries during external table creation (by using the LOCATION clause).

When you drop an external table, only the definition in the catalog is dropped, not the actual raw data on the underlying file system. This can create several data integrity issues, because such tables are not managed by Big SQL like the tables that are created in the Hive warehouse directory. Additional security controls are needed to prevent users from creating external tables on top of data that they are not authorized to access.

It is important to understand that access to Big SQL objects and functions within the Hive warehouse directory is enforced by Big SQL GRANT and REVOKE security policies. However, for external tables, access is enforced through a combination of Big SQL Security Administration and file system security management. Let’s look at the Big SQL statements that deal with external tables and are governed by both Big SQL GRANT or REVOKE security policies and file system-level access control. These additional security policies for external tables are controlled by a set of configuration properties that you can enable or disable as needed.

Steps to improve security for external tables by using the LOCATION clause

The following properties are added to the bigsql-defaults.xml configuration file during installation, and you have the option to change the default values.

      • bigsql.validate.createtab.location.permissions: This property applies to the creation of external tables, which refer to existing tables elsewhere on the Hadoop system. By default, this property is set to TRUE. The connected user needs to have either read+write+execute permissions on the specified HDFS location or hold DATAACCESS authority on the database. If this property is set to FALSE, a CREATE EXTERNAL table statement that specifies the LOCATION clause will not have this additional validation and proceeds strictly on the basis of database-level privileges to create the table. The same applies to ALTER TABLE statements that specify the LOCATION clause.
        • CREATE EXTERNAL HADOOP TABLE…LOCATION
        • ALTER TABLE…SET LOCATION
        • ALTER TABLE…ADD PARTITION
      • bigsql.allow.createtab.managed.location: This property applies to the creation of managed (not external) tables. When you specify a LOCATION clause with a managed table, Big SQL allows you to create the managed table in any location outside of the default Big SQL working directory. By default, this property is set to TRUE. If this property is set to FALSE, you cannot create a managed table. Avoid creating multiple managed tables in the same location, because this greatly increases the likelihood of data loss during a drop operation. Unless you have specific use cases that require the creation of managed tables outside of the Hive warehouse directory, set this property to FALSE.
        • CREATE HADOOP TABLE…LOCATION
      • bigsql.validate.droptab.location.permissions: This property applies to drop operations on managed tables within Big SQL. If this property is set to TRUE, the connected user needs to have either write+execute permissions on the table location and its subdirectories and files or hold DATAACCESS authority on the database. If this property is set to FALSE, dropping the managed table proceeds without this additional validation and the table data is deleted.
        • DROP TABLE
      • bigsql.validate.load.srcfile.permissions: This property applies to load operations that use the FILE option. By default, this property is set to TRUE, which means that the invoking user needs to have either read+execute permissions on the specified file location and its subdirectories and files or hold DATAACCESS authority on the database. If this property is set to FALSE, the load operation proceeds without this additional validation.
        • LOAD HADOOP USING FILE…

If the bigsql user does not have Hadoop superuser group membership, you need to ensure that the user has read+write+execute permissions on all external locations that are in use on your cluster.

Sticky bit considerations for HDFS

You can set the sticky bit on HDFS directories, which prevents any user other than a superuser, the directory owner, or the file owner from deleting, moving, renaming, or altering the files in an HDFS directory. Big SQL honors the sticky bit setting on HDFS directories.

Security compliance with Big SQL native auditing

You can use a variety of authentication and access control mechanisms to establish rules and controls for acceptable data access. However, data security needs do not end there. You also need to protect against unknown or unacceptable access behaviors, and for that you need to monitor data access for potential malicious activity.

The native auditing facility that is built into Big SQL 3.0 enables you to monitor and audit your Big SQL activities and generate audit records on who did what, where, when, and how.

      • Who: Authorization or user ID tracking
      • What: SQL statement text tracking
      • Where: Application and IP address tracking
      • When: Event timestamp tracking
      • How: Authorization checks tracking

The native auditing facility fully complies with all SOX requirements by auditing DDL, DML, and DCL changes, in addition to reporting any security exceptions.

You can use the native auditing facility to enforce primary security compliance and governance measures or consider the InfoSphere Guardium Data Security product if you are looking for a complete security compliance solution.

Data encryption

Encryption is another effective option for achieving data security, one that is especially important for Big Data systems that deal with structured, semi-structured, and unstructured data formats. To protect sensitive data, it is critical to encrypt both data in transit and data on disk. With encryption, you need a secret key or a password to decrypt and then access the information in readable form. The security of sensitive data can easily be threatened by intruders “sniffing” a vulnerable network. SSL is a popular way of providing secure network data transmission, and Big SQL provides encryption for data in transit with SSL support. It is recommended that you configure SSL to protect confidentiality and the integrity of data moving between a client and the Big SQL server.

You might also want to consider options for encrypting sensitive data on disk. InfoSphere Guardium offers an on-disk data encryption capability for both HDFS and GPFS, and it is fully supported with Big SQL 3.0 and higher.

General security guidelines for Big SQL

Security has always been about protecting different layers of access to your enterprise infrastructure, and this is equally true in the Hadoop world. We have discussed ways to implement different layers of security for your Hadoop data hosted on InfoSphere BigInsights, and these guidelines are summarized in the following list.

      • Begin with the following actions:
        • Assess your system security needs.
        • Classify different users based on their job roles and determine required authorities or privileges.
        • Understand the security threats that apply to your environment at different layers of access.
        • Compile a set of security measures that are needed to counter identified threats.
        • Define your organization’s security policies and business rules by following the principle of least privilege.
      • Secure the root account.
      • Edit the /etc/sudoers file to include only required users.
      • Restrict access to the Hadoop superuser group.
      • Restrict access to the BigInsights administrative group (biadmin) if using BigInsights 3.0.
      • Restrict access to the Big SQL administrator group.
      • Restrict access to the Big SQL log files.
      • Set up end-user authentication by using LDAP or Kerberos, and additionally exploit service-to-service authentication by using Kerberos.
      • Ensure that the Big SQL user has full access to the Hive warehouse directory (/biginsights/hive/warehouse) and any other external location that might hold data for Big SQL tables.
      • Ensure that Hadoop ownership and permissions for Big SQL data files under the Hive warehouse directory and any other external location are appropriately restricted.
      • Consider having non-default service ports and non-default user names to ward off casual security threats. You can choose to change the ports during installation.
      • Revoke implicit privileges on the Big SQL database from PUBLIC.
      • Leverage built-in Big SQL authorities, privileges, and roles to implement efficient separation of duties according to your organizational structure.
      • Explore the use of traditional view-based access control to restrict access to sensitive data.
      • Implement fine-grained access control at the row and column level with RCAC.
      • Protect your cluster by defining security policies for monitoring your database activities with the native Big SQL auditing feature.
      • Use SSL to move data securely across the network.
      • Consider InfoSphere Guardium to implement a total security compliance and governance solution.

Acknowledgements

The author would like to thank Bert Van Der Linden and Uday B. Kale for assistance with content and technical review, and Mick Legare for the RCAC example.

Useful links