- Why deploy your Big Data application on Bluemix
- Cloud architecture overview for Big Data & Analytics solution hosting
- Introduction to Stock Volatility, our sample Big Data application
- Series Overview
- Workshop Tasks
- Set up your Bluemix account
- Create an SSH Keypair
- Create your Virtual Machine on Bluemix
- Install Docker on the virtual machine
- Create sample data warehouse on your virtual machine
- Deploy the Stock Volatility sample application
- Connect to your data warehouse with the Secure Gateway service
- Connect the Stock Volatility sample application to your secured data warehouse
Why deploy your Big Data application on BluemixA common theme in IT today is analytics--efforts to gain insights resulting from the systematic analysis of data. IT today provides a multitude of options for developing and deploying applications commonly referred to as Big Data. These applications quickly perform complex data analytics on huge data sets, from multiple perspectives, while requiring much less hardware and overall investment than previous workloads. IBM Bluemix enables you to host your analytics applications in the cloud and provides a unique combination of additional benefits: a selection of database technologies for managing the data, storage for large sets of data, a variety of analytics tools for implementing the analysis logic, and plenty of computational capacity to run that logic at scale. Bluemix frees you from the details of managing data and running the analysis, so you can focus on specifying the data and devising customized analysis processes that provide unique advantage to your business.
Bluemix handles traditional IT tasks so that you can focus on the business logic and data that differentiates your application. It enables you to begin your implementation immediately and avoid roadblocks of the traditional development scenarios such as provisioning hardware, network, storage, and middleware. Your organization can take advantage of automated provisioning and integration of the required components, as well as integrated security and monitoring capabilities built into the platform.
Cloud architecture overview for Big Data & Analytics solution hostingBig data and analytics require a new view on business intelligence, data management, governance, and delivery. Cloud computing is a perfect vehicle for hosting big data and analytics workloads. The Big Data and Analytics on Cloud Reference Architecture functional view (shown above in Figure 1 and documented at the Cloud Standards Customer Council Resource Hub) depicts a set of capabilities that a business must consider as they enter the big data and analytics space. It includes capabilities around data integration, management, security, and analytics. Through this workshop series, you will build out a subset of the key Big Data & Analytics on Cloud reference architecture capabilities. The workshop tasks cover the Integrated Data Warehouse, Archive Repository, Data Load, and Analytics Application components.
Bluemix services used in this workshopSince the Reference Architecture is vendor agnostic, we selected from a myriad of different technology choices for implementation in this workshop series.
Integrated Data WarehouseTrusted data is stored in the traditional enterprise data warehouse. Data for this repository is modeled to support interactive business intelligence activities. Warehouse data is normalized, matched, cleansed, and validated. This repository typically requires high availability and disaster recovery. It is also the most expensive repository. The Enterprise Warehouse repository keeps detailed data for the most current month(s) and aggregates yearly data as opposed to maintaining years of detailed data in raw form. It has the following characteristics:
- Validated (Trusted)
Archive RepositoryArchiving “cold” data from data warehousing environments is necessary as a way of reducing warehouse costs and improving performance. While this “cold” data may be of no interest for operational reporting or business intelligence, it is increasingly of relevance to users performing exploratory or deep analytics. Taking advantage of a cost effective Hadoop infrastructure within the Archive Zone to store the “cold” historic data provides data for deeper analytics. The Hadoop component for the archive repository in this workshop series is the IBM Analytics for Hadoop service. Analytics for Hadoop enables users to perform complex analysis of large data sets, built on the open-source Hadoop technology. Based on an enterprise-grade Hadoop offering, this service leverages Hadoop's distributed processing capabilities to provide easy access to large data sets with fast and efficient visualization of those data sets. Users can analyze and visualize Big Data on a single-node Hadoop cluster through a flexible pay-as-you-go payment model. Previously, similar jobs would have taken days or weeks, often run serially on hardware and technology of the past. Hadoop takes advantage of parallel processing capabilities to run these jobs in mere minutes, on commodity hardware, for a much smaller overall cost.
Data LoadThis component focuses on the process of loading or inserting data into a target repository (or analytical source) and making it available for use in a Big Data application. The data load process might be scheduled in batch intervals or in near real-time/”just-in-time” intervals, depending on the nature of the data and the business purpose for its use. The IBM DataWorks service enables data management for three separate roles: developers, data stewards, and business analysts. DataWorks enables developers to quickly build high-quality applications, with data easily accessible, allowing them to focus on writing business logic, not data access logic. DataWorks provides cloud-based data refinement to move data across various cloud-based and on-premise data sources, making data available to the applications that need it, when they need it, and in the form they expect. For data stewards, DataWorks easily enables self-service data access, instills confidence in the data among end users, and maintains data governance and security controls. For business analysts, DataWorks accelerates the finding and using of refined data for their high-value analytic needs. For all of these roles, DataWorks enables IT teams to better meet business demands and facilitate rapid, self-service data access.
Analytics ApplicationIBM Bluemix allows developers to focus on developing applications and provides the necessary services to get the job done. Upon the completion of your application development, you can deploy your application to the cloud, bind a service to your application, and automatically generate access credentials for your application to connect to the new service. Your application can then fetch the credentials through an environment variable named VCAP_SERVICES and parse it to get the specific connection information. This binding allows your application to be independent of the environment or service instance, since it parses the information dynamically.
Introduction to Stock Volatility, our sample Big Data applicationOur sample analytics application, called Stock Volatility, runs Hive queries against the data and displays the output in a bar chart. The goal of this application is to calculate the volatility of a selected stock during the following recession years: 2000, 2001, 2007, 2008, and 2009. The application UI allows you to select a stock and analyze the volatility. The developer's job is to write the queries that are needed to fetch the data. This application itself is simple and can easily be created using a SQL database, but if you are dealing with terabytes of complex data, you can leverage the power of Hadoop to analyze the data quickly on Bluemix and use tools like D3 libraries to visualize it. Our Big Data workshop series integrates Big Data and data warehouse augmentation capabilities to increase operational efficiency. Figure 2 above illustrates the solution architecture used in this workshop series to augment a hybrid data warehouse. The goal of warehouse augmentation is to help organizations get more value from an existing data warehouse investment while reducing overall costs, for example:
- Optimizing storage by providing a queryable archive
- Rationalizing data for greater simplicity and reduced expense
- Speeding data queries to enable more complex analytical applications
- Improving the ability to scale predictive analytics and business intelligence operations
Big Data Workshop Series OverviewThe Big Data workshop series is broken into 2 Workshops, as follows:
- Actionable Architecture: Secure Hybrid Data Warehouse on Bluemix
- Actionable Architecture: Data Warehouse Augmentation on Bluemix
Workshop tasksThis workshop shows how to deploy a Big Data application and connect it to a secured data warehouse. It consists of eight tasks that include getting your Bluemix account, setting up a sample System of Record (SoR), deploying a Secure Gateway connection from Bluemix to the SoR, and deploying a Big Data application connecting to the SoR through the secured connection. For this workshop, the SoR is a SQL database on a virtual machine in Bluemix, providing a simplified simulation of what would be done in a real-world environment.
Task 1. Set up your Bluemix accountIBM Bluemix is an open-standards, cloud-based platform for building, managing, and running all types of applications: mobile, smart devices, web, and big data. The Bluemix capabilities include Java™, mobile back-end development, application monitoring, and features from ecosystem partners and open source, all through an as-a-service model in the cloud.
Before you can use the Bluemix capabilities, you must register for an account. You can sign up for one at no charge for a 30-day free trial. After the trial period, you will need to provide a credit card to pay as you go for your resource usage. After you sign up, you can find helpful information in the overview section of the Bluemix Docs.
Tip: If you are using a free or Trial account, you have a limit of 4 service instances. During subsequent workshops, you will create a number of service instances for use with the application. If you’ve already created other services during previous Bluemix activities, you may need to delete some unused or unnecessary service instances to proceed through these workshops. To delete a service, from the Bluemix Dashboard, highlight the settings icon in the top right of the service panel and select Delete. If you are asked to restage your application, click Restage and wait for your application to be redeployed before proceeding.
- If you do not already have a Bluemix account, sign up for one at no charge.
- Log in to Bluemix. The dashboard opens as shown:
Task 2. Create an SSH KeypairFor this workshop, we’re going to use a virtual machine (VM) to simulate the computer hosting the system of record (SoR). The Secure Gateway service’s client needs two network connections, to the SoR and to the Bluemix data center, and so it must run in a region of the data center network that has access to both. For this example, we'll give the VM a public Internet IP address so that it will be able to use the Internet to connect to Bluemix. This VM could be hosted by any cloud provider; we'll create it in the Bluemix VM service so that you can create the VM using your Bluemix account and capacity.
To log into the Bluemix VM, you will need an RSA keypair. Since we will need to specify the keypair when we create the VM, we will create the keypair first. A keypair is more secure than a password, preventing anyone without your key from hijacking your VM. Bluemix provides two ways to specify a key: You can import one or create one. With the create option, Bluemix generates a keypair and gives you the private key. Here, we’ll use the import option, where we generate our own keypair and import the public key into Bluemix.
Generate the keypair by using tools on your computer. We’ll call our keys
bigdatakey. This generation will result in a pair of files, a private key (
bigdatakey) and a public key (
- In Unix/Linux: Run
ssh-keygen -t rsa -f bigdatakey
- Windows: Use PuTTY.
Task 3. Create your Virtual Machine on BluemixNow that you have a keypair, go to the Bluemix Dashboard and follow these steps to create a virtual machine:
- Select CREATE VIRTUAL MACHINES to start creating your new VM.
- On the Create a Virtual Machine properties page, below the Security Key field, press the + Add Key button.
- In the Add Key dialog, name your key bigdatakey. Copy the contents of your bigdatakey.pub file and paste those contents into the Public Key to import field. Press OK to close the window and import the public key.
- On the Create a Virtual Machine page, name your VM group Big_Data. To make sure your VM group name is unique, add your initials or a timestamp to the end of the name. Ensure the other settings are as shown below, then press Create.
- VM Cloud: IBM Cloud Public
- Initial instances: 1
- Image: Ubuntu 14.04
- IBM image default user ID: ibmcloud
- VM group: Big_Data_XX
- VM size: m1.small
- Security Key: bigdatakey
- Network: private
- The Dashboard page for your VM, Big_Data_XX, shows the details about your VM, which initially are the same as those you set in the create properties. When Virtual Machines Health status panel says Your VMs are running, it also displays the IPs for your VM. The first address, 129.xxx.xxx.xxx, is public and is used for Internet clients to address the VM. The second address, 192.168.xxx.xxx, is private and is used for other VMs hosted by Bluemix to address this VM. For this workshop, the public address is referred to as
- Now that your VM is created, you can log into it using Secure Shell (SSH). The command is:
$ ssh -i <private-key> -l <username> <public-IP-address>
<private-key>is the path and filename of your bigdatakey private key file
<username>is ibmcloud for this VM created from one of the virtual images supplied by Bluemix
<public-IP-address>is your VM’s public IP address
Task 4. Install Docker on the virtual machineThe Secure Gateway service in Bluemix requires a client that runs in the same data center as the system of record (SoR) the gateway will connect to. There are a couple of options for the Gateway client; the easiest one to configure, useful for development purposes, is one that runs in a Docker container. Since the Gateway client runs in a Docker container, install Docker on the VM that’s simulating a computer in a data center.
Docker documents the process for installing Docker on Ubuntu in Installing Docker on Ubuntu.
Tip: Notice that Docker’s instructions use
wgetto download from
https://get.docker.com/, which installs the latest version of Docker. Do not follow any instructions that use
docker.io; that approach typically installs an older version of Docker.
Log into your VM using SSH, as described above, and perform the following commands:
- Before installing any software, make sure your Ubuntu installation is running the latest version of all of its packages. Run this command:
$ sudo apt-get update
- Install the Docker package
$ wget -qO- https://get.docker.com/ | sh
- Verify that Docker is installed correctly.
$ sudo docker run hello-world
hello-worldruns correctly, part of the output should say:
Hello from Docker. This message shows that your installation appears to be working correctly.
Task 5. Create a sample data warehouse on your virtual machineTo simulate the SoR, we will use a MySQL database running in a Docker container. Although a data warehouse would not typically be hosted in MySQL, for the purposes of this workshop, MySQL is a free SQL database that is already available in a Docker container. The VM already has Docker installed to run the gateway client’s container, so Docker can also run a container with MySQL. Because MySQL is already installed in the container, you won’t have to install MySQL; you’ll just need to download the MySQL container and run it.
Download the schema and sample data filesWhen you install the image for the MySQL Docker container, the container simply runs the database server. For this workshop, we not only need the database server, but we also need it to contain a database with some particular sample data. To create that database of sample data, we’ve provided a couple of files for you to download. We’ll put these files in the
ibmclouduser’s home directory under
- Create the directory to store the downloads in. The scripts to initialize the database will look for the files in this directory.
$ mkdir ~/bigdata-nasdaq $ cd ~/bigdata-nasdaq
- Download the files for creating the schema and for creating the data.
$ wget https://hub.jazz.net/git/osowski%2Fbigdata-volatility/contents/master/data/bigdata-nasdaq-create.sql $ wget https://hub.jazz.net/git/osowski%2Fbigdata-volatility/contents/master/data/bigdata-nasdaq-data.sql
Install the database’s Docker containerNow that we have the files for initializing the database, we’ll start a Docker container from an image that has MySQL installed and has a mechanism to enable us to execute the initialization scripts. For this workshop, the Docker image you will use is
tutum/mysql. For more information about this Docker image, see the GitHub project: tutumcloud/tutum-docker-mysql.
- Create the MySQL container instance and load the sample data from the two downloaded files with the command below:
$ sudo docker run -d --name mysql-tutum -p 3306:3306 -v /home/ibmcloud:/home/ibmcloud -e ON_CREATE_DB="nasdaq" -e MYSQL_PASS=passw0rd -e STARTUP_SQL="/home/ibmcloud/bigdata-nasdaq/bigdata-nasdaq-create.sql /home/ibmcloud/bigdata-nasdaq/bigdata-nasdaq-data.sql" tutum/mysql
–druns the container in the background, not interactively
mysql-tutumis the name to give the container created from the image
3306:3306forwards the MySQL port to make it accessible from the host OS’s IP address
/home/ibmcloud:/home/ibmcloudbinds the directory to make the directory in the host OS available within the container
ON_CREATE_DBinstructs the container to create the "nasdaq" database when the container first starts
MYSQL_PASSsets the password of the database’s main user, in this example to
STARTUP_SQLtells the container to run the SQL files in the order specified via the space-separated list
tutum/mysqlis the name of the Docker image to create the container from
tutum/mysql. That container has a MySQL database server running in it. The database server contains a database named
nasdaqthat contains a table named
rawDatathat contains the sample data for a bunch of Nasdaq quotes.
Task 6. Deploy the Stock Volatility sample applicationYou will now deploy an application that will connect to the simulated data warehouse through the Secure Gateway. This application is already available in an IBM Bluemix DevOps Services project that you will fork and have a copy of your own. You will then configure the project's pipeline to deploy the application to Bluemix and to automatically push future changes in application updates.
Fork the bigdata-volatility sample project
- Access the bigdata-volatility sample application in IBM Bluemix DevOps Services.
- Click Fork Project. You may be prompted to log in or create a short name to log in with.
- Create a new name for your project. You are not required to change the name, since this project will be created in your account.
- Click Create. Your project is created and you are redirected to the new project page.
Configure the Build Pipeline
- Click Build & Deploy in the upper right corner of your new project.
- Click ADD STAGE
- Click MyStage and rename this stage to Build. All other defaults are acceptable.
- Click the JOBS tab at the top, click ADD JOB in the new tab, and select Build.
- For the Builder Type, select Ant. Our project uses a simple Ant script to build a Websphere Liberty application on Bluemix. You can integrate your own build scripts for your application.
- The Build stage is now complete. Click SAVE.
- Click ADD STAGE again. Click MyStage again and rename to Deploy. Set the Input Type to Build Artifacts. All other defaults are acceptable here.
- Click the JOBS tab at the top, click ADD JOB in the new tab, and select Deploy.
- Most defaults are acceptable in this panel, however, ensure that your Application Name is specific enough to be unique across all of Bluemix, as this name will become the application's hostname. You can configure these values separately, but for now we will use them as one and the same. Add your initials or a time stamp to the end of your Application Name and click SAVE.
Deploy your application to BluemixIn the Pipeline: All Stages view:
- In the Build stage, click the Run Stage button. Hint: It looks like a play button.
The project builds and automatically deploys the application to Bluemix.
Task 7. Connect to your data warehouse with the Secure Gateway serviceThe Secure Gateway service in Bluemix supports the development of hybrid cloud and hybrid IT applications—ones with parts running in multiple cloud and non-cloud environments. It provides secure connectivity from Bluemix to other applications and data sources—commonly called systems of record (SoR)—running on-premise or in other clouds. The service includes a remote client which enables secure connectivity. Most of the steps to set up the Secure Gateway must be performed in Bluemix. The step to set up the Secure Gateway Client must be performed on the remote system.
Create a Secure GatewayLike any service instance, an instance of the Secure Gateway service is bound to a particular application. You can then use that Secure Gateway instance to connect that application to as many systems of record (SoRs) as you like.
Follow these steps to add a Secure Gateway to an application:
- In the Bluemix Dashboard, click ADD A SERVICE OR API.
- Search for Secure Gateway by typing the name in the search field.
- Click the Secure Gateway service to open the details.
- Make sure that under App you have your Java Liberty application selected. Leave the other default values and click CREATE.
- Click RESTAGE.
Why restage?: Because you added a new service to a running application, you are prompted to restage the application to update it with the new service. Bluemix is trying to make sure that the application code is up to date with any changes that were applied.
- The Secure Gateway service is now created and bound to your application.
Add a gateway and clientFor a Secure Gateway to connect its application to a particular resource, you must define a gateway in the Secure Gateway and install a gateway client on that resource. The client only connects to that gateway and the gateway can only connect to one client. The client needs to have a network connection to the gateway, such as an Internet connection between the private data center and the Secure Gateway service instance in Bluemix. The client does not have to be installed on the same computer as the resource it will connect to, but the client does need to have a network connection to each resource. In this way, the client is a gateway connecting the Secure Gateway service instance to the resource.
Follow these steps to add a gateway and its client to the Secure Gateway:
- Go back to the Bluemix Dashboard. You should see the new Secure Gateway service that you created.
- Click the Secure Gateway service tile to open the Secure Gateway Dashboard.
- Click Add Gateway. The Add Gateway page is displayed.
- Provide a name for your gateway, such as Trading Systems, and click Connect it.
- The Connect it page is displayed and the bullet item for Name it is marked as complete. By default, the Docker option is automatically selected and shows the Docker command that you must run to create the gateway client.
- Copy and run the Docker command that is provided on your virtual machine. For example:
docker run -it bluemix/secure-gateway-client <configuration_id>
<configuration_id>is the configuration ID for the gateway this client will connect to. Note: Remember to add
sudoif necessary at the beginning of the Docker command above.
- The Gateway is connected. The client logs a status that it is connected, and the Gateway page shows that it is connected.
Note: The Docker -run command is provided when you create your Secure Gateway configuration. Each configuration has a different command-line parameter that specifically defines that client. Unlike the docker command that created the MySQL database, this docker container runs in the foreground and will take control of the terminal while running. It's best to open a new terminal window for use while connected.
Add a destinationThe Gateway creates a connection between the Secure Gateway in Bluemix and the client running in the private data center, enabling the Bluemix application access to resources within the data center. Each system the application wants to connect to is represented by a destination. The system must have an IP address or hostname and a port that an IP client can use to connect to the system. The destination binds the system’s address to an Internet address that the application can use to access the system, and uses the client to access the system.
Follow these steps to add a destination to the gateway:
- In the Bluemix UI, click Add destinations. The Create Destinations page is displayed and the bullet item for Connect it is marked as complete.
Note: A destination is a Secure Gateway connection to a specific on-premises resource. The host name and port number provide direct access to that resource from the cloud side.
- Complete the following fields:
- Destination name, such as Stock Quote Warehouse
- Host name or IP address of your VM, the VM’s
- The port for accessing the resource, which for the MySQL container is 3306
- For the drop down about how you want to secure access to the destination, leave it as the default option of TCP. This means there will be no TLS (Transport Layer Security). Your application can communicate directly to the gateway without requiring any certificates.
- Click the + icon. The destination is added to the collection of destinations below the graph.
- Click I’m done to complete your configuration. The Secure Gateway Dashboard page is displayed.
Retrieve your secured Destination URL
- Click the connection card associated with the Trading Systems Gateway you just setup.
- Click the Info icon, which is associated with the Stock Quote Warehouse Destination card.
- Copy & paste the value under the Cloud Host : Port label to a temporary text file or another location you can easily reference in the next Task. This value will look something along the lines of the following:
This value will be needed to configure our connection to our data warehouse from our application in the next Task.
Task 8. Connect the Stock Volatility sample application to your secured data warehouseNow we'll use the Secure Gateway to connect the sample application to the data warehouse.
- In the Bluemix Dashboard, click your newly deployed application. In the previous Task, bigdata-volatility-walkthrough was used as the application name.
- Click Environment Variables in the left side-menu and then click on USER-DEFINED.
- You will now create some user-defined variables for the application to connect to your data warehouse. Click ADD and create a variable for each row below, omitting the colon in the variable name:
- DATAWAREHOUSE_HOST: The previously copied Cloud Host : Port from the Secure Gateway Dashboard. E.G.
- DATAWAREHOUSE_PORT: The previously copied port from the Cloud Host : Port from the Secure Gateway Dashboard.
- DATAWAREHOUSE_DB: nasdaq
- DATAWAREHOUSE_USERNAME: admin
- DATAWAREHOUSE_PASSWORD: The value of MYSQL_PASS from Task 5.2. E.G. passw0rd
- DATAWAREHOUSE_HOST: The previously copied Cloud Host : Port from the Secure Gateway Dashboard. E.G.
- Click SAVE and your application will be restarted. Click Overview in the left side-menu and wait for your application to restart.
- After your application is restarted, click the Routes link in the top of the page to access your running application.
- Hover over the left arrow and the menu will slide out. Select Recession Analysis.
- Select a stock from the drop down list and click Analyze.
- You are presented with a historical view of stock prices for the selected stock during years with recession-marked periods.
ConclusionYou have now completed Actionable Architecture: Secure Hybrid Data Warehouse on Bluemix (Big Data Workshop 1). In this workshop, you:
- Created a Virtual Machine by using the Bluemix Virtual Machine service
- Set up and configured a simulated on-premise data warehouse using containers
- Securely connected a data warehouse to the cloud for secured access from Bluemix applications
- Deployed a stock volatility application, running on Bluemix, that utilizes data from the on-premise data warehouse via the Secure Gateway connection.
AcknowledgementsMany individuals contributed time and effort to the creation of this workshop series, from initial use case discussion to planning to hands-on development to documentation. I'd like to acknowledge and thank the core team of contributors to this Big Data workshop series, specifically:
- Bobby Woolf
- Katerina Goulioutkina
- Manav Gupta
- Rajeev Sikka
- Ruth Willenborg
- Shahir Daya
- Xiaomei Wang