IBM Support

Big SQL v4.2: User-defined aggregate functions - Hadoop Dev

Technical Blog Post


Abstract

Big SQL v4.2: User-defined aggregate functions - Hadoop Dev

Body

With Big SQL v4.2, you can now create your own user-defined aggregate (UDA) functions in cases where a built-in aggregate function does not exist. Defining your own UDA consists of defining 4 component routines (INITIALIZE, ACCUMULATE, MERGE, and FINALIZE) along with an aggregate interface function.

The aggregate interface function is the entry point to your UDA and you define it by specifying the names of the component routines, the return data type of the UDA, and state variables. The state variables are used to store intermediate values during the processing of the aggregation.

The INITIALIZE component routine has the state variables as output parameters. It is responsible for initializing the state variables.

The ACCUMULATE component routine has a single row of the result set we are aggregating over as input parameters and the state variables as input/output parameters. It is responsible for accumulating the single row’s values into the state variables.

The MERGE component routine has one set of state variable values as input parameters and another set of state variable values as input/output parameters. It is responsible for merging the 2 sets of state variable values into a single set of state variable values.

The FINALIZE component routine has the final set state variable values as input parameters and the final result as the output parameter. It is responsible for computing the final result value based on the final set of state variable values.

An example of how to create a UDA to compute the sum of squares follows. The sum of squares is used is various contexts in statistics, such as variance, standard deviation, mean squared error, and residual sum of squares. The sum of squares is computed using the following formula:
formula
where n is the number of values.

The DDL for the 4 component routines and the aggregate interface function are shown below:

  CREATE OR REPLACE PROCEDURE sumsq_initialize(OUT sum DOUBLE, OUT sumofsquare DOUBLE, OUT count BIGINT)  LANGUAGE C  SPECIFIC sumsq_initialize  PARAMETER STYLE SQL  NOT FENCED  NO SQL  EXTERNAL NAME 'mylib!sumsq_initialize'    CREATE OR REPLACE PROCEDURE sumsq_accumulate(IN input DOUBLE, INOUT sum DOUBLE, INOUT sumofsquare DOUBLE, INOUT count BIGINT)  LANGUAGE C  SPECIFIC sumsq_accumulate  PARAMETER STYLE SQL  NOT FENCED  NO SQL  EXTERNAL NAME 'mylib!sumsq_accumulate';    CREATE OR REPLACE PROCEDURE sumsq_merge(IN sum DOUBLE, IN sumofsquare DOUBLE, IN count BIGINT, INOUT mergesum DOUBLE, INOUT mergesumofsquare DOUBLE, INOUT mergecount BIGINT)  LANGUAGE C  SPECIFIC sumsq_merge  PARAMETER STYLE SQL  NOT FENCED  NO SQL  EXTERNAL NAME 'mylib!sumsq_merge';    CREATE OR REPLACE FUNCTION sumsq_finalize(sum DOUBLE, sumofsquare DOUBLE, count BIGINT)  RETURNS DOUBLE  LANGUAGE C  SPECIFIC sumsq_finalize  PARAMETER STYLE SQL  NOT FENCED  NO SQL  CALLED ON NULL INPUT  EXTERNAL NAME 'mylib!sumsq_finalize';    CREATE OR REPLACE FUNCTION sumsq(DOUBLE)  RETURNS DOUBLE  SPECIFIC sumsq  AGGREGATE WITH (sum DOUBLE, sumofsquare DOUBLE, count BIGINT)  USING  INITIALIZE PROCEDURE sumsq_initialize  ACCUMULATE PROCEDURE sumsq_accumulate  MERGE PROCEDURE sumsq_merge  FINALIZE FUNCTION sumsq_finalize;    

The state variable is declared with the AGGREGATE WITH clause.

The following is the C code implementation of the component routines. In this example, the C code is in a file named mylib.c.

  #include <memory.h>  #include <sql.h>  #include <sqlca.h>  #include <sqlda.h>  #include <sqludf.h>  #include <string.h>    #ifdef __cplusplus  extern "C"  #endif  SQL_API_RC SQL_API_FN sumsq_initialize  (    SQLUDF_DOUBLE *outSum,    SQLUDF_DOUBLE *outSumofsquare,    SQLUDF_BIGINT *outCount,    SQLUDF_NULLIND *outSumNullInd,    SQLUDF_NULLIND *outSumofsquareNullInd,    SQLUDF_NULLIND *outCountNullInd,    SQLUDF_TRAIL_ARGS )  {    /* Initialize state variables */    *outSum = 0.0;    *outSumofsquare = 0.0;    *outCount = 0;      /* Set each state variables' null indicator to not null */    *outSumNullInd = 0;    *outSumofsquareNullInd = 0;    *outCountNullInd = 0;      return 0;  }    #ifdef __cplusplus  extern "C"  #endif  SQL_API_RC SQL_API_FN sumsq_accumulate  (    SQLUDF_DOUBLE *inInput,    SQLUDF_DOUBLE *inoutSum,    SQLUDF_DOUBLE *inoutSumofsquare,    SQLUDF_BIGINT *inoutCount,    SQLUDF_NULLIND *inInputNullInd,    SQLUDF_NULLIND *inoutSumNullInd,    SQLUDF_NULLIND *inoutSumofsquareNullInd,    SQLUDF_NULLIND *inoutCountNullInd,    SQLUDF_TRAIL_ARGS )  {    if (SQLUDF_NULL(inInputNullInd))    {      return 0;    }      *inoutSum += *inInput;    *inoutSumofsquare += (*inInput * *inInput);    ++(*inoutCount);    return 0;  }      #ifdef __cplusplus  extern "C"  #endif  SQL_API_RC SQL_API_FN sumsq_merge  (    SQLUDF_DOUBLE *inSum,    SQLUDF_DOUBLE *inSumofsquare,    SQLUDF_BIGINT *inCount,    SQLUDF_DOUBLE *inoutSum,    SQLUDF_DOUBLE *inoutSumofsquare,    SQLUDF_BIGINT *inoutCount,    SQLUDF_NULLIND *inSumNullInd,    SQLUDF_NULLIND *inSumofsquareNullInd,    SQLUDF_NULLIND *inCountNullInd,    SQLUDF_NULLIND *inoutSumNullInd,    SQLUDF_NULLIND *inoutSumofsquareNullInd,    SQLUDF_NULLIND *inoutCountNullInd,    SQLUDF_TRAIL_ARGS )  {    *inoutSum += *inSum;    *inoutSumofsquare += *inSumofsquare;    *inoutCount += *inCount;    return 0;  }    #ifdef __cplusplus  extern "C"  #endif  SQL_API_RC SQL_API_FN sumsq_finalize  (    SQLUDF_DOUBLE *inSum,    SQLUDF_DOUBLE *inSumofsquare,    SQLUDF_BIGINT *inCount,    SQLUDF_DOUBLE *outResult,    SQLUDF_NULLIND *inSumNullInd,    SQLUDF_NULLIND *inSumofsquareNullInd,    SQLUDF_NULLIND *inCountNullInd,    SQLUDF_NULLIND *outResultNullInd,    SQLUDF_TRAIL_ARGS )  {    *outResultNullInd = 0;    *outResult = *inSumofsquare - (*inSum * *inSum / *inCount);    return (0);  }  

It is compiled into a binary library using the following commands:

  gcc -m64 -fpic -I/home/bigsql/sqllib/include -c mylib.c -D_REENTRANT  gcc -m64 -fpic -shared -o mylib mylib.o -Wl,-rpath,/home/bigsql/sqllib/lib64 -L/home/bigsql/sqllib/lib64 -ldb2 -lpthread  

You need to place the binary library mylib into the directory /home/bigsql/sqllib/function on all nodes. You can also use Java to implement the component routines, but in general C routines will perform better.

Here is a simple query showing the UDA in action:

  select * from tab1    C1                        ------------------------    +1.00000000000000E+000    +2.00000000000000E+000    +3.00000000000000E+000    +4.00000000000000E+000      4 record(s) selected.      select sumsq(c1) from tab1    1                         ------------------------    +5.00000000000000E+000      1 record(s) selected.  

The following figure shows the roles that each of the component routines plays toward computing the final result.
uda_blog_drawing

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259889