This blog is co-authored with Roy Cecil, IBM.

Introduction

The Hive service in IBM Open Platform with Apache Hadoop (IOP) is configured to include HiveServer2 (also known as HS2), and HS2 provides support for open API clients like JDBC/ODBC. HS2 is an improvement over HiveServer which is based on the Apache Thrift project. In an IOP installation, Spark service comes with its own Thrift Server , which is again based of HiveServer2. They share the same hive-site.xml in the installation (i.e., /usr/iop/current/spark-thriftserver/conf/hive-site.xml), and the Spark Thrift Server provides JDBC/ODBC services for accessing the Spark SQL engine.

Spark SQL is a module in Spark and serves as a distributed SQL engine, allowing it to leverage YARN to manage memory and CPUs in your cluster, and allowing end-users to query existing Hive databases and other datasets. Thrift Server allows multiple JDBC clients to submit SQL statements to a shared Spark engine via a Spark SQL context, so your application can leverage a managed connection pool (if implemented) and can exploit cached results for better performance.

Managing Thrift Server in IOP

Default configuration of Thrift Server found under the “Spark” component serves as a good starting point to test your JDBC connection. We would recommend increase its memory and CPU core allocation if you are required to support concurrent JDBC connections and large data sets. Thrift Server supports resource configuration just like the “spark-submit” shell that comes with a Spark distribution via the following key parameters:

$SPARK_HOME/sbin/start-thriftserver.sh --help
Usage: ./sbin/start-thriftserver [options] [thrift server options]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)
 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.

By default, Thrift Server does not use a lot memory or CPU cores, and it is good enough for handling simple SQL queries via a JDBC client against a small set of data, e.g., “select count(*) from call_center;” on a few GB of data. If you want to run serious JDBC applications, i.e., with multiple concurrent users, with complex queries and on large datasets, we recommend you increase the memory and CPU allocation. To do so, you should:

1. stop the Spark ThriftServer from the Ambari console. And start the custom spark-thrift server as below.

2a. If using default Spark in IOP

 cd /usr/iop/current/spark-thriftserver/sbin 
 start-thriftserver.sh 
 --driver-memory 10g 
 --verbose 
 --master yarn 
 --executor-memory 16g 
 --num-executors 20 
 --executor-cores 8 
 --conf spark.hadoop.yarn.timeline-service.enabled=false 
 --conf spark.yarn.executor.memoryOverhead=5120 
 --conf spark.driver.maxResultSize=5g 
 --conf spark.sql.crossJoin.enabled=true 
 --conf spark.rpc.askTimeout=700  
 --conf spark.sql.broadcastTimeout=800 
 --conf spark.kryoserializer.buffer.max=768m

2b. If using a custom Spark build

 cd $SPARK_HOME/sbin
 start-thriftserver.sh 
 --driver-memory 10g 
 --verbose 
 --master yarn 
 --executor-memory 16g 
 --num-executors 20 
 --executor-cores 8 
 --conf spark.hadoop.yarn.timeline-service.enabled=false 
 --conf spark.yarn.executor.memoryOverhead=5120 
 --conf spark.driver.maxResultSize=5g 
 --conf spark.sql.crossJoin.enabled=true 
 --conf spark.rpc.askTimeout=700  
 --conf spark.sql.broadcastTimeout=800 
 --conf spark.kryoserializer.buffer.max=768m

The above commands are based on a 4-datanode IOP cluster that has the following specs:

    2 x 12-core Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors
    128 GB RAM
    10 x 1.9-TB SATA Disks

We allocate 90% of memory and CPU cores to YARN. So among 4 data nodes, a total of 460GB memory (4 X 90% X 128GB) is allocated to YARN containers, with the remaining 52GB (4 X 13G) memory allocated to other IOP components such as OS overhead, Hive, Zookeeper, YARN Node Manager, Ambari, etc. And YARN gets to manage 84 CPU cores (hyper-threaded to 168 vCores which Spark uses to assign to executors). With that, we came up with the following Spark configuration:

    20 Spark executors with 21GB each (16GB Spark memory and 5120MB YARN container overhead) = 420GB
    20 Spark executors with 8 vCores each = 160 vCores

And the demand meets the supply. You need to adjust these values per your cluster.

The other parameters, e.g., spark.rpc.askTimeout, spark.sql.broadcastTimeout, spark.kryoserializer.buffer.max, etc. are tuned with larger-than-default values in order to handle complex queries. You can start with these values and adjust accordingly to your SQL workloads.

3. The Thrift Server log shows the following entry when you know it has been started successfully:

Log location:
tail -f /var/log/spark/spark-spark-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-host460.ibm.com.out

On successful start you should see in the log:
6/07/04 22:45:18 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
16/07/04 22:45:18 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3130ca39{/sqlserver,null,AVAILABLE}
16/07/04 22:45:18 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@184bb873{/sqlserver/json,null,AVAILABLE}
16/07/04 22:45:18 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@58ad0586{/sqlserver/session,null,AVAILABLE}
16/07/04 22:45:18 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@45635ae1{/sqlserver/session/json,null,AVAILABLE}
16/07/04 22:45:18 INFO server.Server: jetty-9.2.z-SNAPSHOT
16/07/04 22:45:18 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7c7f7882{/,null,AVAILABLE}
16/07/04 22:45:18 INFO server.ServerConnector: Started ServerConnector@170ea60{HTTP/1.1}{0.0.0.0:10001}
16/07/04 22:45:18 INFO server.Server: Started @19152ms
16/07/04 22:45:18 INFO thrift.ThriftCLIService: Started ThriftHttpCLIService in http mode on port 10001 path=/cliservice/* with 5...500 worker threads

Thrift Server transport mode

There are two different modes the Thrift Server can handle JDBC connections: binary and http. And both modes can also be secure and non-secure (we will leave that topic for another blog). In this blog, we will use the non-secure mode to demonstrate its operations.

To change the transport mode in Thrift Server, modify the configuration in hive-site.xml

<property>
  <name>hive.server2.transport.mode</name>
  <value>http</value>
</property>

Note: The default value is binary, and the default port for http is 10001; for binary, 10015.

When using a custom Spark build, copy the hive-site.xml to the custom $SPARK_HOME/conf ( where $SPARK_HOME is the location where the custom Spark build is copied on the node where Spark Thrift Server is configured to run in the cluster). And change the setting hive.server2.transport.mode as above.

Different components in IOP may require one mode or the other to connect via the Thrift Server, e.g., Knox requires http mode. This typically is based on how the application is written. Spark distributions, for example, supply a JDBC client tool called Beeline which allows you to run SQL queries in either mode.

Direct – Binary Transport Mode

    beeline -n spark -p sparkpasswd -u "jdbc:hive2://host460.ibm.com:10001/mydb"

Direct – HTTP Transport Mode

    beeline -n spark -p sparkpasswd -u "jdbc:hive2://host460.ibm.com:10001/mydb;transportMode=http;httpPath=cliservice;transportMode=http;httpPath=cliservice"

Running SQL queries using JDBC

Now that your Thrift Server is tuned and running, you can use the Beeline client to run queries right from a command line.

Locate the Beeline client tool under $SPARK_HOME/bin/beeline, and run command

$SPARK_HOME/bin/beeline -u  "jdbc:hive2://host460.ibm.com:10001/mydb;transportMode=http;httpPath=cliservice" 
-i common.sql -n spark --force=true -f myquery1.sql

You should see typical output of:

Connecting to jdbc:hive2://host460.ibm.com:10001/mydb;transportMode=http;httpPath=cliservice
16/07/19 16:13:01 INFO jdbc.Utils: Supplied authorities: host460.ibm.com:10001
16/07/19 16:13:01 INFO jdbc.Utils: Resolved authority: host460.ibm.com:10001
Connected to: Spark SQL (version 2.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows selected (0.251 seconds)
100 rows selected (7.489 seconds)
Closing: 0: jdbc:hive2://hsot460.ibm.com:10001/mydb

The -i option takes a header file that contains common SQL applicable to a whole bunch of SQL statements, for example, a “use schema_name;” statement to indicate which database to use for varying scale factors. Putting these statements in a template (i.e., common.sql in the above example) allows that template to be run before invoking myquery1.sql.

7 comments on"How to: Run Queries on Spark SQL using JDBC via Thrift Server"

  1. Please put screenshots and steps to run these commands on Big SQL Technology Sandbox https://my.imdemocloud.com/ , so that we can try these out. None of the ports and commands you mentiond work even on the IBM Analytics Demo Cloud

  2. Vik,
    You need to look at the port in your installation . From my demo cloud account this is what I can find.
    spark-env.sh:export HIVE_SERVER2_THRIFT_PORT=10002

    /usr/iop/current/spark-client/bin/beeline -u “jdbc:hive2://iop-bi-master.imdemocloud.com:10002/bigsql;transportMode=http;httpPath=cliservice”
    scan complete in 1ms
    Connecting to jdbc:hive2://iop-bi-master.imdemocloud.com:10002/bigsql;transportMode=http;httpPath=cliservice
    Connected to: Spark SQL (version 1.4.1)
    Driver: Spark Project Core (version 1.4.1_IBM_2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 1.4.1_IBM_2 by Apache Hive
    0: jdbc:hive2://iop-bi-master.imdemocloud.com>

  3. […] Configurations changes — Spark 2.0.0 configurations are divided.  New advanced, custom spark2-hive-site-override sections have been added. They allow you to declare properties independent from those inherited from the Hive’s hive-site.xml. You can now configure the Thrift server’s properties with two new sections: spark2-thrift-fairscheduler and spark2-thrift-sparkconf.  This gives you more flexibility in configuring the thrift server and using different properties than defined in the spark-defaults.  For tuning suggestions for the thrift server, refer to the blog post How to: Run Queries on Spark SQL using JDBC via Thrift Server. […]

  4. Hi Jesse, Would you share your insight on concurrent connections to thrift server? for example, will the requests be handled concurrently or queued?

    Thanks in advance.

    Kevin

  5. Hi Jesse,

    Can we use Spark SQL CLI for directly running queries across a cluster ? I find very little documentation on Spark SQL CLI .

    Thanks

    Satish

  6. Hello Sir,

    I am new to Apache spark. Installed hadoop 2.8.1 on ubuntu and then installed spark-2.2.0-bin-hadoop2.7 on it. I used spark-shell and beeline both to create databases and tables. I was not able to see the tables created through spark-shell via beeline and vice-versa.
    Why spark-shell and beeline showing me different databases? I think these should show me the databases same from both the CLI? Can you please help me?

  7. Jaykumar Prajapati October 27, 2017

    Hello,
    I am a new to apache spark can you tell me that how much memory and CPU cores are required for 7 GB of database?

Join The Discussion

Your email address will not be published. Required fields are marked *