The recipe titled ‘Weather Data meets IBM Cloud. Part 1 – Ingestion and Processing of Weather Data‘ showed how to ingest weather data from the Weather Company into the IBM Cloud Platform using IBM Functions. Once ingested, the data was then delegated onwards for further processing using IBM Event Stream which took the role as the central message broker in the overall architecture.
In this recipe we shall continue by storing the incoming weather data in IBM Cloud Object Storage which takes the role as a landing zone for weather data. Once the weather data has been stored in IBM Cloud Object Storage we will then show how the data can be queried using IBM SQL Query. A key component in this solution is the IBM Cloud Object Storage bridge which stores messages from a specific IBM Event Stream topic in a dedicated bucket (i.e. folder) of IBM Cloud Object Storage at well-defined intervals. The major advantage of the bridge is that it offers a solution for storing IBM Event Streams event data without any programming at all. All that is needed is a simple configuration.
In the recipe you will go through the following steps:
- Get started by provisioning the required services.
- Create and configure a bucket in Cloud Object Storage for storing weather data.
- Configure the IBM Event Streams to IBM Cloud Object Storage bridge so that ingested weather data will be stored automatically in the bucket.
- Define a simple SQL query to consolidate the weather observation data stored in numerous JSON files using IBM SQL Query.
In other words – the focus will be on Data at Rest and will be one important steps towards preparing the data for analysis and visualization (which will be looked into in other recipes).
Before you start it is important to provide you with some details on the architecture of the solution. The architecture diagram is shown below and is a typical IoT architecture with services for event data ingestion, event handling (message bus), landing zone (object storage) and event data processing to transform, aggregate, load and visualize event data in a form so that it becomes business relevant. In other words, it is a data refine pipeline turning “raw oil” into “business relevant information” covering data at rest.
In the first first recipe, weather data were retrieved from the Weather Company Data service by IBM Cloud Functions and delegated onwards to IBM Event Streams. We refer you to the pre-requisite recipe ‘Weather Data meets IBM Cloud. Part 1 – Ingestion and Processing of Weather Data’ for detailed information on these aspects. In this recipe we shall continue by provisioning and configuring IBM Cloud Object Storage so that it can serve as a landing zone for weather data.
In this recipe we will use to use the IBM Cloud Object Storage Bridge to write data from IBM Event Streams to IBM Cloud Object Storage. This bridge stores messages from a specific topic in a dedicated bucket of IBM Cloud Object Storage at well-defined intervals. Once that the weather data has been stored in IBM Cloud Object Storage, it is then possible to query and transform the data using IBM SQL Query. As a prerequisite for writing and reading data from IBM Cloud Object Storage, proper access permissions must however have been created using IBM Identity and Access Management.
In principle it would be possible to read data from IBM Event Streams and store it in IBM Cloud Object Storage in a manual fashion by defining Apache Kafka consumers – be it in the form of IBM Functions or in form of consumer applications that wait for messages to appear on a specific IBM Event Streams topic. Such applications can be developed in Node.js, Python or Java (or a language of your choice) and then deployed either to Cloud Foundry or to the IBM Kubernetes Service in form of Docker containers. Using the bridge is however a solution that can be done without any coding at all. All that is required is a simple configuration using e.g. IBM Identity and Access Management credentials.
IBM Cloud Object Storage
IBM Cloud Object Storage is a highly scalable, durable and resilient cloud storage service, that makes it possible to virtually store limitless amounts of data simply and cost effectively. It is commonly used for data archiving and backup, web and mobile applications, and as scalable, persistent storage for analytics and IoT. In this recipe we shall look at the last use case for Object Storage, namely that of serving as a landing zone for weather data.
The central concept in Cloud Object storage is that of a bucket which serves as a container for objects (i.e. files). The file size may range from 0 KB to 10 TB, i.e. IBM Cloud Object Storage may handle small JSON objects, larger video files as well as archive files. In this recipe the focus will be on storing small to very small JSON files.
Cloud Object Storage, originally named CleaverSafe, was developed with high scalability, availability and security in mind. In the range of the various kind of storage mediums provided on the Cloud it is one of the most cost effective and scalable solutions. IBM Cloud Object Storage offers 99.999999999% durability of data (11 nines). Two parameters are important for what concerns availability, performance (and consequently price): resiliency options and storage tier.
There are several options for achieving resiliency of the data through replication which is configured on a per bucket level. Replication may be Cross Region, where data is stored across three regions within a geography for highest availability and resiliency. The two other options currently are Regional, where data is stored in multiple Availability Zones within a single region for best availability and performance, and Single Data Center, where the data is stored across multiple devices in a single data center.
Cloud Object Storage offers four storage class tiers along with a policy-based archive capability, so that you can meet your workload performance needs while managing your storage costs most effectively. These include Standard Tier for data with active workloads, Vault for less-active workloads, Cold Vault for cold workloads and Flex for dynamic workloads. Archive can furthermore be used for long-term retention.
IBM Cloud Object Storage supports the S3 protocol of Amazons Simple Storage Service (S3). However, it offers several advantages that goes beyond S3. Using IBM Cloud Object Storage there are no data transfer charges to transfer data across regions. Moreover, the integrated IBM Aspera® high-speed data transfer option makes it easy to transfer data to and from Cloud Object Storage at very high speed. Moreover, IBM Cloud Object Storage offers immediate data consistency for both single-region & multi-region services. Last but not least, IBM Cloud Object Storage replicates the data accross instances using secure slicing and erasure coding.
Cloud Object Storage is designed to protect data and maintain availability through redundant network paths and redundant system components. It uses an Information Dispersal Algorithm (IDA) to divide files into unrecognizable slices that are then distributed to the storage nodes. No single node has all the data, which makes it safe and less susceptible to data breaches while needing only a subset of the storage nodes to be available to fully retrieve the stored data. This ability to reassemble all the data from a subset of the chunks dramatically increases the tolerance to node and disk failures. An Object can be read and written from a subset of slices contained in two of the three data centers. Since the data is encrypted and spread throughout multiple regions, you are able to retrieve that object from a subset of the slices contained in two of the three datacenters. We refer to the Redbook titled ‘Cloud Object Storage as a Service‘ for more information on the technology.
The technology protects against digital and physical breeches and provides strong data-at-rest security by combining encryption, erasure coding and data dispersal. Moreover, flexible encryption options to give customers more control of their data security. Cloud Object Storage provides additional options for customers to use their own keys in two models SSE-C, and SSE with IBM Key Protect (for Key Management) – including Bring Your Own Key (BYOK).
Additional security and control over data is achieved through an integration with IBM Cloud Identity and Access Management (IAM) that offers integrated support for policies and permissions. IBM Cloud Identity and Access Management enables you to securely authenticate users for both platform services and control access to resources consistently across IBM Cloud. Using IAM you can:
- Set bucket level permissions for users and applications
- Ability to use IBM Cloud user roles to set data access policies
- Ability to create and manage API Keys for developers and application access.
- Supports multiple unique credentials including AWS style HMAC credentials.
IBM Cloud Object Storage Bridge
The IBM Cloud Object Storage bridge reads data from an Event Streams Kafka topic and stores the data into IBM Cloud Object Storage. This way, event data can be archived from the Kafka topics in Event Streams to an instance of the Cloud Object Storage service. The bridge consumes batches of messages from Kafka and uploads the message data as objects to a bucket in the Cloud Object Storage service. By configuring the Cloud Object Storage bridge, it is possible to control how the data is uploaded as objects to Cloud Object Storage. For example, the properties that you can configure are as follows:
- The bucket name that the objects are written into.
- How frequently objects are uploaded to the Cloud Object Storage service.
- How much data is written to each object before it is uploaded to the Cloud Object Storage service.
For more information, please see the documentation on IBM Cloud.
IBM SQL Query
IBM SQL Query is a serverless, interactive querying service to quickly and easily analyzes data in IBM Cloud Object Storage. IBM Cloud SQL Query uses Apache Spark as an open source, fast, extensible, in-memory data processing engine optimized for low latency, ad hoc analysis of data and with full support for large joins, window functions and arrays. IBM Cloud SQL Query is highly available and executes queries using compute resources across multiple facilities. Direct reference to data stored in IBM Cloud Object Storage lets you take advantage of the scale, flexibility and data protection that it offers.
IBM SQL Query can be used to quickly submit and run SQL queries directly to Cloud Object Storage. There is no setup required, the data is queried where it resides and results can be written back to Cloud Object Storage. Access to the data can be conrolled by using IBM Identity and Access Management (IAM). With IAM policies, it is possible to grant users granular control to the IBM Cloud Object Storage buckets. IBM Cloud SQL Query supports a variety of data formats such as CSV, JSON and Parquet and allows for standard ANSI SQL.
In this recipe we shall just use IBM SQL Query to create and run simple queries for consolidating weather data once that it has been stored in IBM Cloud Object Storage. This will in turn require that the IBM Event Streams bridge for Cloud Object Storage has been configured correctly, that a bucket has been created and that access control has been properly defined for that bucket using IBM Identity and Access Management.
IBM Cloud Identity and Access Management
IBM Cloud Identity and Access Management (IAM) makes it possible to securely authenticate users and control access to resources consistently across IBM Cloud. A set of IBM Cloud services are enabled to use Cloud IAM for access control and are organized into resource groups within your account to enable giving users quick and easy access to more than one resource at a time.
The following shows an overview of the structure (adopted from the solution tutorial “Best practices for organizing users, teams, applications” on IBM Cloud):
Cloud IAM access policies are used to assign users and service IDs access to the resources within your account. You can group users and service IDs into an access group to easily give all entities within the group the same level of access. A policy assigns a subject, which is a user, service ID, or access group, one or more roles with a combination of attributes that define the scope of access to a target. The policy can provide access to a single service down to the instance level, to a set of resources organized together in a resource group, or to account management services. Depending on the IAM roles that you assign, the subject is allowed varying levels of access for completing account management tasks, working with service instances, or accessing a service by using the UI or completing API calls.
IBM Identity and Access Manager (IAM) has integrated support for policies and permissions for Cloud Object Storage:
- Set bucket level permissions for users and applications.
- Ability use IBM Cloud user roles to set data access policies.
- Ability to create and manage API Keys for developers and application access.
- Supports multiple unique credentials including AWS style HMAC credentials.
In this recipe we shall follow a simple out-of-the-box scheme by creating a credential that gives write access to the Cloud Object Storage bucket without restricting the access to the resources in a specific resource group.
To get started you will essentially need to go through the following 3 steps:
- Create a resource group for the services.
- Provision the IBM Cloud Object Storage service.
- Provision the IBM SQL Query service.
You will continue in this recipe by using the same configuration settings as in the first recipe, i.e. the same values for the region (‘eu-de’ in our case), organization (‘email@example.com’), space (‘devs’) and tags (‘weather-data-demo’).
Create Resource Group
To create a resource group for the IBM Cloud Object Storage and IBM SQL Query services do the following:
- Invoke Manage > Account in the IBM Cloud Portal.
- Select Account Resources > Resource Groups in the toolbar to the left.
- Click Create to create a new resource group.
- In the dialog ‘Create a new resource group’ enter ‘weather-data-demo’ as a name for the new resource group.
- Click Add to create the resource group.
You should now be able to see the new resource group in the list of resource groups for your account.
Provision the Cloud Object Storage service
Having created the resource group you can continue by creating an instance of the IBM Cloud Object Storage service. If you have already registered for IBM Watson Studio, you can either reuse the service that was created for IBM Watson Studio or alternatively create a new instance. However, if you create a new instance you will need to use the Standard Plan rather than the Lite Plan since the free plan is restricted to 1 instance per account.
To create the service do the following:
- Select the IBM Cloud Catalog again.
- Enter ‘object’ as search string.
- Select the ‘Object Storage’ service.
- On the Object Storage page:
- Provide a name for the service (optionally using ‘Weather Data Demo’ as prefix).
- As resource group choose the one that you just created.
- Add a tag (e.g. ‘weather-data-demo’) to the service. Tags will allow you to filter the resource later on so that you only see the services that are relevant for the current initiative.
- Scroll down and select for pricing the Lite plan.
- Click Create to create the service.
The dashboard for the IBM Cloud Object Storage service provides you with commands for creating buckets, managing access to the bucket using IBM Identity and Access Management and for uploading objects to the bucket. We will go through some of these steps in the next section.
Provision the IBM SQL Query service
Next, create an instance of the IBM SQL Query service:
- Select the IBM Cloud Catalog again.
- Enter ‘SQL’ as search string.
- Select the ‘SQL Query’ service.
- On the SQL Query page:
- Provide a name for the service.
- Select the region (e.g. Dallas).
- Select the ‘weather-data-demo’ resource group.
- Add a tag (e.g. ‘weather-data-demo’) to the service.
- Scroll down and select for the pricing the Lite plan (or alternatively the Standard Pricing Plan).
- Click Create to create the service.
To check the result, select IBM Cloud > Resource List in the upper right corner of the IBM Cloud UI. Filter the elements using the ‘weather-data-demo’ tag and expand the sections for Cloud Foundry Services, Services and Storage as shown below:
Creating and Configuring the Cloud Object Storage Bucket
In this section of the recipe you will create and configure a Cloud Object Storage bucket to store the ingested weather data. This will essentially require 3 steps:
- Create the bucket for the region of your choice.
- Obtain the Cloud Resource Name and the endpoints which will be needed in the next section.
- Create service credentials for the bucket so that it can be accessed by IBM Event Streams.
To create the new bucket do the following:
- Open the dashboard for your IBM Object Storage Service.
- Select ‘Getting started’ in the toolbar to the left. This should take you to the page for setting up new buckets and managing access.
- Invoke the Create Bucket command, which will take you to the ‘Create bucket’ dialog.
- Define the properties of the bucket as follows:
- Enter a Unique bucket name, e.g. ‘weather-data-demo’.
- For Resiliency select Regional.
- Select the Location (i.e. region) where you want the weather data to be stored. This should optimally be the same region as the one chosen for IBM Event Stream. In our case that yields ‘eu-de’, which means that the data will be stored and replicated in the 3 Availability Zones in Frankfurt.
- For the Storage class select Standard or Flex.
- Leave the check buttons unchecked. However, please feel free to select one and view the available parameters if you are curious. Then deselect the check button.
- Click Create Bucket to create the bucket as specified.
The next web page will provide you with means for uploading files to the bucket (if you select the Objects tab in the toolbar to the left), for viewing the properties and Cloud Resource Name (Configuration) and for setting policies and access rights (Policies):
You will not need to upload data to the bucket right now. The data should come from IBM Event Streams and you will therefore need to get hold on the public endpoint for the bucket in preparation for the next step:
- Select Configuration on the toolbar to the left.
- Copy and paste the Cloud Resource Name (CRN) for the bucket to a local file.
- Copy and paste the public endpoint to the file as well.
Last step required to configure the bucket is to create the credentials that will define the access rights for other services and applications:
- Select the Service credentials in the toolbar to the left:
- Invoke the New credentials command. This will open up the ‘Add new credential’ dialog.
- Define the properties of the new credential as follows:
- Enter a name for the credential, e.g. ‘WeatherDataWriter’.
- Select Writer as the role.
- For the Service Id, select ‘Auto Generate’.
- Click Add to create the new credential.
- Copy and paste the JSON object that defines the new credential to a local file.
The credential created is an IAM credential. If you invoke the command Manage > Access (IAM) from the IBM Cloud dashboard, and then select the Service Id tab in the toolbar to the left, then you will get a list of IAM service id’s. You should be able to scroll down and find the Service Id named ‘WeatherDataWriter’ which was created ‘behind the scene’ when you created the credential for the IBM Cloud Object Storage bucket. Select the ‘WeatherDataWriter’ service id to view ther details:
Notice that a service ID identifies a service or application similar to how a user ID identifies a user. Service IDs are consequently created to enable access to the IBM Cloud services by applications hosted both inside and outside of IBM Cloud. The defined credential gives write access to just this bucket but it does so independent of the client trying to access the bucket. This may be OK for testing purposes where we may want to access the bucket with a variety of clients (such as e.g. IBM Event Streams, IBM SQL Query and later on also IBM Watson Studio), but it may to lax for a production system.
Configuring the Cloud Object Storage Bridge
Having created and configured the bucket, the next step is to configure the IBM Event Stream bridge so that it will store weather data being produced by e.g. IBM Functions to the bucket. This configuration will require the credentials and endpoint that you obtained in the previous section.
With this information at hand the configuration of the bridge is straightforwardly done as follows:
- Open the dashboard for the IBM Event Streams service that you created in section 3.
- Select Manage in the toolbar to the left.
- Select Bridges to create a new bridge.
- Click the + Button. This will open up the Create Bridge dialog.
- Select the Cloud Object Storage Outbound bridge.
- In the resulting dialog
- Enter a name for the bridge, e.g. ‘weatherdatademo’.
- Enter the name of the bucket that you created (e.g. ‘weather-data-demo’).
- Select the topic that you created in the previous recipe. This is the topic that is used by IBM Functions for sending weather data.
- Keep the default settings for the other properties as shown above.
- Click Next.
- In the next dialog paste in the JSON object for the credential that you created in the previous section.
- Click Next.
- In the next dialog, paste in the public endpoint that you copied in the previous section.
- Add the prefix ‘https://’ to the endpoint.
- Click Create Bridge.
The bridge will be created within a few seconds. Eventually you should get the message that it is running:
The bridge is configured with thresholds defining an upload duration and upload size of the information to be send from IBM Event Streams to IBM Cloud Object Storage. This means that the upload is not real-time but happens at specific (user defined) intervals, in our case every 600 seconds. So wait a while (approximately 10 minutes), then:
- Go to the dashboard of your IBM Cloud Object Storage.
- Select Buckets in the toolbar to the left.
- Select your bucket (named ‘weather-data-demo’)
- Observe that objects have been created:
Notice that in our case the first objects are indeed created very 10 minutes (approximately), and then eventually only every hour. This is so because the trigger for reading weather data observations was changed from firing every minute to firing every hour. During testing we wanted to see results as fast as possible. Once it worked, we could do with less frequent data (for reporting purposes) and the trigger associated with the sequence was consequently changed.
Querying the Cloud Object Storage using SQL Query
The scene has now been set for creating queries over the weather data stored in the IBM Cloud Object Storage bucket. The input data has been uploaded and you have defined (at least) ‘Writer’ access to the Cloud Object Storage bucket.
In this section we will simply test one of the predefined queries first (as a kind of smoke test) and then continue with a query that extracts the weather observation data stored in the bucket. To achieve this, locate the SQL Query service that you created in the list of IBM Cloud resources and select it to open the dashboard for the service:
- Invoke the Launch SQL Query UI command.
- On the next page, click Samples in the upper right corner of the dashboard.
- Select the first SQL query named ‘Table exploration’. This will insert the SQL query into the query editor at the top of the window:
- Click the Run button to execute the query and wait until the results are displayed in the Results tab.
- Select the Query details tab to see the definition of the query.
IBM SQL Query maintains a list of executed jobs that are stored in IBM Cloud Object Storage at the specified Target location. You can therefore select previous jobs and view the results as well as the definition of the query that was run.
In the next step we shall work with a query that extracts objects from JSON files provided the object is of class “observation”. This query is defined as:
CAST(from_unixtime(valid_time_gmt) AS DATE) as date,
FROM cos://eu-de/weather-data-demo/* STORED AS JSON o
WHERE o.class = "observation"
ORDER BY valid_time_gmt DESC
The query defines the location of the bucket using a bucket URL in the FROM clause (‘FROM cos://eu-de/weather-data-demo/*’). This URL is likely to be different in your case so before you can execute the SQL query you will need to modify it to match the location of your bucket. So, to test the query on your data, do the following:
- Copy the query definition from the recipe.
- Insert it into the SQL editor in place of the existing sample query.
- Go back to the dashboard for IBM Cloud Object Storage.
- Select the Buckets tab in the toolbar to the left:
- For the weather-data-demo bucket, invoke the SQL URL command from the pull-down menu to the left.
- Copy the SQL URL by clicking the Copy command to the right.
- Go back to the SQL Query UI and modify the bucket URL in the FROM clause of the query to use your location.
- Run the query and wait for the results to be returned.
Notice that the query extracts just a subset of the properties returned by the Weather Data Service. It also adds a new column that computes the date from the ‘valid_gmt_time’ timestamp. This is a GMT time zone date which may differ from your local time zone. If you need to get more details on how to define SQL queries take a look at the samples available on IBM SQL Query or go to the online documentation titled SQL Reference.
In this recipe we have shown how to consume weather data from IBM Event Streams and then store it into a bucket of IBM Cloud Object Storage using the IBM Cloud Object Storage bridge. This part required a straightforward configuration of the bridge using IAM credentials. The bridge is configured with thresholds defining an upload duration and upload size of the information to be send from IBM Event Streams to IBM Cloud Object Storage which means that over time the bucket will contain a large set of small files containing JSON objects. In a second step of the recipe we have then shown how to use IBM SQL Query to access and query the eather observatzion data. We have also touched upon the use of IBM Identity and Access Management for granting write permission to the Cloud Object Storage bucket.