Constraints are a very powerful feature that can greatly improve the performance of SQL queries. In Big SQL, all constraints that are associated with external tables (HADOOP or HBASE tables) are considered to be informational only (i.e. they are defined as NOT ENFORCED). If the data is guaranteed to adhere to the defined constraints, usage of informational constraints is highly recommended since the performance benefits of using constraints can be quite significant.
Type of Constraints
Constraints are associated with tables, and are either defined as part of the table creation process (using the CREATE TABLE statement) or are added to a table’s definition after the table has been created (using the ALTER TABLE statement). You can also use the ALTER TABLE statement to modify constraints. In addition to the constraint definition, attributes are associated with the constraint to specify if the constraint is enforced and whether the constraint is to be used for query optimization or not. Constraints can be dropped at any time; this action does not affect the table’s structure or the data stored in it.
There are several types of constraints:
- A NOT NULL constraint is a rule that prevents null values from being entered into one or more columns within a table.
- A unique constraint (also referred to as a unique key constraint) is a rule that forbids duplicate values in one or more columns within a table. For example, a unique constraint can be defined on the supplier identifier in the supplier table to ensure that the same supplier identifier is not given to two suppliers.
- A primary key constraint is a column or combination of columns that has the same properties as a unique constraint. You can use a primary key and foreign key constraints to define relationships between tables.
- A foreign key constraint (also referred to as a referential constraint or a referential integrity constraint) is a logical rule about values in one or more columns in one or more tables. For example, given a part and supplier table, where each part in the part table contains a column identifying the supplier id of that part, a foreign key constraint can be defined which link the supplier id in the parts table to the supplier id in the supplier table.
- A (table) check constraint (simply called a check constraint) sets restrictions on data added to a specific table. For example, a table check constraint can ensure that the salary level for an employee is at least $20,000 whenever salary data is added or updated in a table containing personnel information.
- A functional dependency constraint defines a functional dependency between columns and can be defined using the following clause:
(column-name,...) DETERMINED BY (column-name,...)
A functional dependency is not enforced by the database manager during normal operations such as insert, update, delete, or set integrity. The functional dependency might be used during SQL query compilation to optimize queries.
Constraints on external tables (HADOOP and HBASE)
All constraints that are associated with external tables (HADOOP or HBASE tables) are considered to be informational only (NOT ENFORCED). An informational constraint is a constraint attribute that can be used by the SQL compiler to improve the access to data. Informational constraints are not enforced, and are not used for additional verification of data; rather, they are used to improve query performance. In some cases, the performance benefits can be quite significant
IMPORTANT: Since the constraints are not enforced, if the table data violates the not-enforced constraint, incorrect results can occur, so it is critical for the data to adhere to the constraint definition.
The following example illustrates the use of informational constraints and how they work. This simple table contains information on applicants’ age and gender:
CREATE HADOOP TABLE APPLICANTS ( AP_NO INTEGER NOT NULL, GENDER CHAR(1) NOT NULL, CONSTRAINT GENDEROK CHECK (GENDER IN ('M', 'F')) NOT ENFORCED ENABLE QUERY OPTIMIZATION, AGE INTEGER NOT NULL, CONSTRAINT AGEOK CHECK (AGE BETWEEN 1 AND 65) NOT ENFORCED ENABLE QUERY OPTIMIZATION, );
This example contains two clauses that change the behavior of the column constraints. The first option is NOT ENFORCED, which instructs the database manager not to enforce the checking of this column when data is inserted or updated. The second option is ENABLE QUERY OPTIMIZATION which is used by the database manager when optimizing the SQL and can greatly improve the performance of SQL queries referencing this table.
If the table contains the NOT ENFORCED option, the behavior of insert statements may appear odd. The following SQL will not result in any errors when run against the APPLICANTS table:
INSERT INTO APPLICANTS VALUES (1, 'M', 54), (2, 'F', 38), (3, 'M', 21), (4, 'F', 89), (5, 'C', 10), (6, 'S', 70);
Applicant number five has a gender (C), for child. Also, applicant number six has both an unusual gender (S) for senior and has a value for age that exceeds the defined constraint. In both cases, the database manager will allow the insert to occur since the constraints are NOT ENFORCED. The result of a select statement against the table is shown below:
SELECT * FROM APPLICANTS WHERE GENDER = â€˜Câ€™; +-------+--------+-----+ | AP_NO | GENDER | AGE | +-------+--------+-----+ +-------+--------+-----+ 0 rows in results(total: 0.041s)
The database manager returned an unexpected answer to the query!!! Even though the value ‘C’ is found within the table, the constraint on this column tells the database manager that the only valid values are either ‘M’ or ‘F’. The ENABLE QUERY OPTIMIZATION keyword allowed the database manager to use this constraint information when optimizing the statement.
For Big SQL, the validation of the data is normally done by either an ETL tool or the data is sourced from another RDBMs which has the same constraints defined. If the data is not guaranteed to be consistent with the defined constraint, then there is a risk of unexpected results like above.
If there is concern that a query is returning incorrect results due to an informational constraint, the constraint can be changed to disable the QUERY OPTIMIZATION attribute using the ALTER TABLE statement, as shown in the following example:
ALTER TABLE APPLICANTS ALTER CHECK GENDEROK DISABLE QUERY OPTIMIZATION
Note: Another option is to drop the constraint. Dropping the constraint is the only option for primary key and unique constraints.
If the example query is re-issued, the database manager will return the following correct results:
SELECT * FROM APPLICANTS WHERE GENDER = 'C'; +-------+--------+-----+ | AP_NO | GENDER | AGE | +-------+--------+-----+ | 5 | C | 10 | +-------+--------+-----+ 1 row in results(first row: 0.400s; total: 0.401s)
In this situation, either the data should be changed to match the defined constraint or the constraint should be updated to ensure the data does not violate the constraint.
Another commonly used constraint is a unique or primary key constraint. For example,
CREATE HADOOP TABLE KEYS ( KEY INTEGER NOT NULL, CONSTRAINT KEY_CONSTRAINT UNIQUE(KEY) NOT ENFORCED ENABLE QUERY OPTIMIZATION, VALUE VARCHAR(10) );
If a query requires only distinct values from a column, then if a primary key or unique constraint is defined on that column, the Big SQL optimizer knows that the column only contains unique values and no operations are necessary in order to ensure this uniqueness. However, if a unique constraint was not defined on the column, then the optimizer would have to sort and aggregate the data during query execution in order to guarantee uniqueness. The performance difference between these two scenarios can be quite remarkable.
The best scenario for using informational constraints occurs when you can guarantee that the data adheres to the constraints specified. If the application already checks all of the information beforehand (like gender and age) then using informational constraints can result in significantly improved SQL query performance.
The database catalog tables can be used to help identify the constraints and the type of constraint that is defined on tables.
Given the examples used above, the following query indicates that for the two tables referenced, there are 3 constraints defined; two CHECK CONSTRAINTS (K) on the APPLICANTS table, and one UNIQUE CONSTRAINT defined on the KEYS table:
select varchar(constname,20) as constraint_name, varchar(tabname,20)as table_name, type from syscat.tabconst where tabschema='MCKNIGHT'; +-----------------+------------+------+ | CONSTRAINT_NAME | TABLE_NAME | TYPE | +-----------------+------------+------+ | AGEOK | APPLICANTS | K | | GENDEROK | APPLICANTS | K | | KEY_CONSTRAINT | KEYS | U | +-----------------+------------+------+ 3 rows in results(first row: 0.007s; total: 0.008s)
To get additional information about the definition of the CHECK CONSTRAINTS, the following select can be used:
select varchar(constname,20), varchar(tabname,20), varchar(text,25) from syscat.checks where tabschema='MCKNIGHT'; +----------+------------+-------------------------+ | 1 | 2 | 3 | +----------+------------+-------------------------+ | GENDEROK | APPLICANTS | GENDER IN ( 'M' , 'F' ) | | AGEOK | APPLICANTS | AGE BETWEEN 1 AND 100 | +----------+------------+-------------------------+ 2 rows in results(first row: 0.053s; total: 0.056s)
The following table describes some of the database catalog tables that contain information about constraints.
|Catalog views||Description||Query example|
|SYSCAT.CHECKS||Contains a row for each table check constraint||select constname, tabname, text from syscat.checks|
|SYSCAT.COLCHECKS||Contains a row for each column that is referenced by a table check constraint||select constname, tabname, colname, usage from syscat.colchecks|
|SYSCAT.KEYCOLUSE||Contains a row for each column that participates in a key defined by a unique, primary key, or foreign key constraint||select constname, tabname, colname, colseq from syscat.keycoluse|
|SYSCAT.REFERENCES||Contains a row for each referential constraint||select constname, tabname, refkeyname, reftabname, colcount, deleterule, updaterule from syscat.references|
|SYSCAT.TABCONST||Contains a row for each unique (U), primary key (P), foreign key (F), or table check (K) constraint||select constname, tabname, type from syscat.tabconst|
Constraints are a very powerful feature that can greatly improve the performance of queries, but extreme care must be taken when using informational constraints to ensure the data adheres to the definition of the constraint.
For more information on performance tips for Big SQL, see the Top 6 Big SQL v4.2 Performance Tips blog.