Watch how to build SQL Queries in a Scala notebook using Apache Spark in IBM Data Science Experience.

It can be painful to query your enterprise Relational Database Management System (RDBMS) for useful information. You write lengthy java code to create a database connection, send a SQL query, retrieve rows from the database tables, and convert data types. That’s a lot of steps, and they all take time when users are waiting for answers. Plus most relational databases are stuck on servers that live somewhere inside the walls of your organization, inaccessible to cloud-based apps and services.

There’s a more efficient and faster way to get the answers you need. You can use Apache® Spark™, the high-speed, in-memory analytics engine to query your database instead. Not only does Spark’s SQL API provide lightning-fast analytics, it also lets you access the database schema and data with only a few simple lines of code. How efficient and elegant.

What you’ll learn

This tutorial shows you how to use Spark to query a relational database. First, we’ll set up a PostgreSQL database to serve as our relational database (either on the cloud-based Compose postgreSQL service or in a local instance). Next, you’ll learn how to connect and run Spark SQL commands through the Spark Shell and then through IPython Notebook.

Why Spark?

Apache® Spark™ is an open-source cluster-computing framework with in-memory processing, which enables analytic applications to run up to 100 times faster than other technologies on the market today. It helps developers be more productive and frees them to write less code. I’m so glad that IBM is committed to the Apache Spark project, investing in design and education programs to promote open source innovation. We’re working hard to help developers leverage Spark to create smart, fast apps that use and deliver data wisely. Learn more.

Set up your PostgreSQL database

You can use a cloud-based Compose PostgreSQL instance (the faster, easier option), or install PostgreSQL locally and open external access to its port by HTTP.

Option 1: Set up a cloud-based Compose PostgreSQL database

This online option gives you the availability and flexibility of a cloud-based service and some neat browser-based tools. If you prefer to work locally, skip down to Option 2.

  1. Download psql.

    With Compose, your postgreSQL database will live in the cloud, but you need to install the psql command line tool. Go to http://www.postgresql.org/download/ and download PostgreSQL, accepting all default installation settings.

  2. Sign up for a Compose PostgreSQL account.

    Go to https://app.compose.io/signup/ select the PostgrSQL database option and enter your account information.Compose asks for a credit card upon sign-up, but you get a free 30-day trial.

  3. Click Deployments button.

    deploymentsbutton

  4. Click the deployment link to open it.
    open_deployment
  5. Click the Reveal your credentials link.

    You see your username and password, which you’ll use in a minute.

  6. Locate the Command Line, copy its contents, and keep this Compose browser window open.
  7. Populate the database.
    1. Open your terminal/command window and go to psql by typing the command:

      cd /Library/PostgreSQL/9.4/bin

      If your directory/version is different, locate it first, then cd to the correct directory

    2. Connect with the following commands:
      Type ./psql then within quotation marks, paste in the command line you just copied, then press Enter. This will look something like this:

      ./psql "sslmode=require host=haproxy429.aws-us-east-1-portal.3.dblayer.com port=10429 dbname=compose user=admin"
    3. When prompted, enter your Compose postgreSQL password and press Enter. You’ll see:
    4. Password:
      
      psql (9.4.4)
      
      SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
      
      Type "help" for help.
      
    5. Copy, paste, and run the following SQL commands:
      CREATE TABLE weather (
         city            varchar(80),
         temp_lo         int,           -- low temperature
         temp_hi         int,           -- high temperature
         prcp            real,          -- precipitation
         date            date
      );
      
      INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
      INSERT INTO weather VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');
      INSERT INTO weather VALUES ('Hayward', 54, 37, 0.25, '1994-11-29');

      Shell output from psql will look like:

      compose=> CREATE TABLE weather (
      compose(>     city            varchar(80),
      compose(>     temp_lo         int,           -- low temperature
      compose(>     temp_hi         int,           -- high temperature
      compose(>     prcp            real,          -- precipitation
      compose(>     date            date
      compose(> );
      CREATE TABLE
      compose=> SELECT * FROM WEATHER;
       city | temp_lo | temp_hi | prcp | date
      ------+---------+---------+------+------
      (0 rows)
      
      compose=> INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
      INSERT 0 1
      compose=> INSERT INTO weather VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');
      INSERT 0 1
      compose=> INSERT INTO weather VALUES ('Hayward', 54, 37, 0.25, '1994-11-29');
      INSERT 0 1
      compose=> SELECT * FROM WEATHER;
           city      | temp_lo | temp_hi | prcp |    date
      ---------------+---------+---------+------+------------
       San Francisco |      46 |      50 | 0.25 | 1994-11-27
       San Francisco |      43 |      57 |    0 | 1994-11-29
       Hayward       |      54 |      37 | 0.25 | 1994-11-29
      (3 rows)
      
  8. Return to Compose and from the menu on the left, choose Browser.
    compose_brows
  9. Click the compose database. You should see your new weather table. Click it to see the values you just added.

  10. Open your deployment again (see Steps 3-4) and
    copy the Public hostname/port. You’ll use it in a minute.compose_db_details .

Option 2: Set up a local PostgreSQL database

If you prefer to work with a locally-installed PostreSQL database, follow the steps below. The local option requires a some additional steps like opening external access to the database port by HTTP and restarting the database. (If you already followed steps for Option 1 to set up a cloud-based Compose PostgreSQL database, skip ahead to the section on accessing data with Spark.)

  1. Go to http://www.postgresql.org/download/ and Download PostgreSQL, accepting all default installation settings.
  2. Modify the pg_hba.conf file to allow external access via HTTP.
    By default, your postgreSQL database is accessible through port number 5432, and only localhost can access it. For this tutorial, we’ll open access to external programs and machines via HTTP. To do so:

    1. Open your terminal or command window.
    2. Sign in as the Postgres user by entering
      su postgres
    3. If prompted, enter your postgresql password.
    4. Add the following line to the pg_hba.conf file (located in /Library/PostgreSQL/9.4/data)

      host all all 0.0.0.0/0 md5
    5. Edit this file using your favorite editing tool. We edited through the command line interface using vi command:

       vi pg_hba.conf
  3. Restart the PostgreSQL database by running these 2 commands in Terminal:
     cd /Library/PostgreSQL/9.4/bin
    ./pg_ctl status -D ../data/

    You see the following message:

    pg_ctl: server is running (PID: XXXX)
    /Library/PostgreSQL/9.4/bin/postgres "-D/Library/PostgreSQL/9.4/data”

    Take note of these additional commands:

    ./pg_ctl stop -D ../data/

    to stop the database

    ./pg_ctl start -D ../data/

    to start the database again

  4. Populate the database.

    Launch the SQL Shell (psql) application on your machine (located at /Library/PostgreSQL/9.4/bin) to connect to the database and populate it with some data. We used the command line, and entered:

    cd /Library/PostgreSQL/9.4/scripts/

    and then:

    ./runpsql.sh 

    and psql returns:

    Server [localhost]:
    Database [postgres]:
    Port [5432]:
    Username [postgres]:
    Password for user postgres:
    psql (9.4.4)
    Type "help" for help.
    
    postgres=# CREATE TABLE weather (
       city            varchar(80),
       temp_lo         int,           -- low temperature
       temp_hi         int,           -- high temperature
       prcp            real,          -- precipitation
       date            date
    );
    postgres=# INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
    postgres=# INSERT INTO weather VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');
    postgres=# INSERT INTO weather VALUES ('1994-11-29', 'Hayward', 54, 37);
    postgres=# select * from weather;
       city      | temp_lo | temp_hi | prcp |    date
    ---------------+---------+---------+------+------------
    San Francisco |      46 |      50 | 0.25 | 1994-11-27
    San Francisco |      43 |      57 |    0 | 1994-11-29
    Hayward       |      37 |      54 |      | 1994-11-29
    (3 rows)
    Files
    
    

    For more on working in PostgreSQL, see
    http://www.postgresql.org/docs/9.4/static/tutorial-table.html
    .

Access SQL data via Spark Shell

There are 2 ways to work with Spark:

  • Access a virtual machine where Spark is installed.

    For this tutorial, we used a VM with Apache Spark v1.3.1 installed in it and hosted via Virtual Box on Mac OS X Version 10.9.4. The VM image is wrapped by Vagrant, a virtual development environment configuration software.

  • Or download Apache Spark and run locally on your machine. Get it at: http://spark.apache.org/downloads.html

Once you’ve installed Spark or know where it lives:

  1. Download postgresql-9.4-1200.jdbc41.jar from https://jdbc.postgresql.org/download.html and save it to a location accessible to the spark shell. Note its location. You’ll need its path in a few minutes.
  2. In your terminal or command window, open Spark shell.
    • If Spark’s installed locally, cd to the directory that contains spark shell. Then spark-shell
    • If using a VM, ssh into a VM/machine, where spark is installed.
  3. Create a new Spark dataframe object using SQLContext.load.

    In a command/terminal window, type:

    vagrant@sparkvm:~$ spark-shell --jars ./drivers/postgresql-9.4-1200.jdbc41.jar
  4. At the scala command prompt enter the following command.
    • If you’re using a cloud-based Compose PostgreSQL database, retrieve the public hostname:port you copied in Step 10 and insert within the URL value after jdbc:postgresql://. It should look something like this:

      scala> val jdbcDF = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://haproxy425.aws-us-east-1-portal.3.dblayer.com:10425/compose?user=admin&password=XXXXXXXXXXXXXXXX", "dbtable" -> "weather"))

    • If you’re connecting to a locally deployed PostgreSQL database, enter the following command:

      scala> val jdbcDF = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://192.168.1.15:5432/postgres?user=postgres&password=postgres", "dbtable" -> "weather"))

  5. Type these commands:
    scala> jdbcDF.show()
    scala> jdbcDF.printSchema()
    scala> jdbcDF.filter(jdbcDF("temp_h1") > 40).show()
  6. You see scala output that looks like this:

    Scala-load-Compose

    Scala-show-Compose

    Scala-others-Compose

    That’s it! You’ve accessed your PostgreSQL data via Spark SQL.

Access SQL data via IPython Notebook

In this part of the tutorial we walk through steps on how to modify Spark’s classpath and run Spark SQL commands through IPython Notebook.

Note: This section assumes familiarity with Spark Server installation and IPython Notebook.

  1. Retrieve the complete path and name of the jdbc driver as a string value (you noted this info in the last section).
  2. Locate compute-classpath.sh file under
    /usr/local/bin/spark-1.3.1-bin-hadoop2.6/bin
  3. Add the following line to the end of the file:

    appendToClasspath "/home/vagrant/drivers/postgresql-9.4-1200.jdbc41.jar”

  4. Restart the vm that runs Spark. Now the IPython Notebook is ready to connect and query the sample database.
  5. Launch the IPython Notebook.
  6. Insert a new cell.
  7. Create a new Spark dataframe object using SQLContext.load.
  8. Tip: Here, you can use the same Spark commands you used at the Scala command prompt in the previous section.

    SparkSQL-IPynb-Compose
    You see Spark commands in gray boxes and beneath each call, IPython shows the data returned.

Summary

Now you know how to connect Spark to a relational database, and use Spark’s API to perform SQL queries. Spark can also run as a cloud service, potentially unlocking your on-premises SQL data, which we’ll explore more in future posts. To try these calls with another type of database, you’d just follow these same steps, but download the jdbc driver supported by your RDBMS.

Join The Discussion

Your email address will not be published. Required fields are marked *