Introduction

BigSheets is an analytics application within IBM® BigInsights™. In many ways it looks like a spreadsheet, but under the surface it creates MapReduce jobs to work on data across the Hadoop cluster.

For working on different file formats BigSheets has a number of “Readers” that can divide the contents into columns and rows. Among the standard readers are those for comma-separated values (CSV), tab-separated values (TSV), and character-delimited data. Readers are also included for JSON files and Hive files.

The Line reader simply displays the contents of each line as one column. This reader is the default reader for all unknown file formats. However, many data files are formatted in other ways. A common format is fixed-length records. In fixed-length records each field in each record has the same offset and length. Because the records have the same length it is not necessary to include  record and field separators, although it is quite common to use a new line to separate records.

In this article I will describe how to create and publish a new BigSheets reader. A reader is programmed in Java and packaged as a zip file, so it is easy to install into BigInsights. When you use the Eclipse integrated development environment with the BigInsights Tools for Eclipse installed, the task of developing and deploying a new reader is easy.

As an example of a file of fixed-length records, I’m using a data file from New York Stock Exchange (NYSE). Daily trades are delivered as one file per day. The data format is a fixed-length record with one record per line. The first line in the data file contains the date of the trades.
An example data file EQY_US_ALL_TRADE_20131218.zip (298 MB) can be downloaded from NYSE. The full description of the data format is found in the DAILY TAQ CLIENT SPECIFICATION.PDF document at NYSE Market Data. Here is shown how the NYSE data looks like when using the standard line reader in BigSheets.

Reader Line Reader

Note: The data files from NYSE come as zip files. Though BigSheets can read a number of different compression formats, for example gz, bzip2, and cmx, it cannot read zip, so you need to unzip the zip file or recompress to one of the supported compression formats.

Preparing the development environment

The BigSheets readers are written in Java™. The BigInsights Tools for Eclipse is an ideal development platform for BigSheets readers. But before you start programming, you must prepare the Eclipse environment for BigInsights.

You should use Eclipse v4.2.2 and install the BigInsights Tools for Eclipse. See Installing the Tools for Eclipse for a detailed description.

Programming the new reader

A new reader is shipped to BigSheets as a plug-in file. This is a zip compressed file containing a JSON file named plugin.json and a Java archive file named plugin.jar. These two files are automatically created during the publishing process. Before you can create the reader, you need to code the Java class that is the centerpiece of the reader.

  1. In Eclipse choose the perspective BigInsights.
  2. Open the New Project wizard, expand the BigInsights folder and select BigInsights Project.Reader New Project
  3. Click Next and enter a project name. Here I use NYSEreader. Click Finish.

The NYSEreader project is created and shown in the Project Explorer.

  1. Now from the menu click File → New → BigSheets Reader.
  2. In the Name field enter the name for the reader. Here I use DailyTrades.
  3. You can leave the Package field empty, but it’s good practice to use a package name.
  4. Select the default AbstractTextReader in the Super class list.
  5. Click Finish.

Reader New BigSheets Reader

The BigSheets Reader wizard creates the DailyTrades.java file with the overall structure of the new DailyTrades Java class as shown below. There are two public methods in the class—getNextValue and getResourceSchema—to program.

[code language=”java”] package my.own.bigsheets.readers;

import java.io.IOException;

import org.apache.hadoop.mapreduce.Job;
import org.apache.m2.exceptions.M2Exception;
import org.apache.pig.ResourceSchema;
import org.apache.pig.data.Tuple;

import com.ibm.bigsheets.reader.AbstractTextReader;

public class DailyTrades extends AbstractTextReader {

@Override
public Tuple getNextValue() throws IOException, M2Exception {
// TODO: This method needs to return the next tuple (record) to be processed.
return null;
}

@Override
public ResourceSchema getResourceSchema(String arg0, Job arg1) {
// TODO: The getResourceSchema method should return the output schema for your specific reader. Therefore, it needs to return the two columns:
// NumberOfChars – the number of characters in the line; a numeric value with a return type of DataType.INTEGER.
// LineContent – the actual content of the line read from the file; a string value with a return type of DataType.STRING.
return null;
}

}
[/code]

The wizard has created the DailyTrades Java class template with most necessary classes imported. However, for this project I need to import a few extra classes. The extra classes are shown highlighted in the following code snippet.

Tip: Code examples are rendered with “SyntaxHighlighter”. Just double click on any code example and the full code snippet will be marked ready for cut-and-paste.
[code language=”java” highlight=”10,11,12,13″] package my.own.bigsheets.readers;

import java.io.IOException;

import org.apache.hadoop.mapreduce.Job;
import org.apache.m2.exceptions.M2Exception;
import org.apache.pig.ResourceSchema;
import org.apache.pig.data.Tuple;

import org.apache.hadoop.io.Text;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

import com.ibm.bigsheets.reader.AbstractTextReader;

[/code]

The NYSE Daily Trades data file contains a header record that has two spaces, an eight-character date (mmddyyyy), and 60 spaces. Because the data records in the file do not contain the date, but only the time, it is necessary to add the date to each row, especially when working on more than one file.

To make it optional to add the date to each row, an argument can be added to the plugin.json file. In order to catch this argument in the Java code, you need to add a constructor DailyTrades(). You need one constructor for each combination of arguments.

[code language=”java” firstline=”17″] public class DailyTrades extends AbstractTextReader{
private String tradedate = null;
private Boolean includeDate = false;

public DailyTrades() throws M2Exception {
includeDate = false;
}

public DailyTrades(String insertTradeDate) throws M2Exception {
if (insertTradeDate.equalsIgnoreCase("true")) {
includeDate = true;
}
}
[/code]

The method getNextValue() is used to read the input file line by line. Here the parsing of the record occurs. Because the record is one long text string, the most important technique used here is “substring”. The header record containing the trading date is the only line that starts with a blank, so it is not necessary to use a line counter to detect the header record.

I decided to format date as MM-DD-YYYY and time as HH:MM:SS.sss. Whether to format or not could be implemented as an option, but a formatted data and time makes it much more readable.

[code language=”java” firstline=”30″] @Override
public Tuple getNextValue() throws IOException, M2Exception {
Tuple tuple = null;
Text text = null;
String value = null;
try {
if (in.nextKeyValue()) {
text = (Text) in.getCurrentValue(); // Get the line as a text field
value = text.toString(); // Convert to a String
if (value.charAt(0) == ‘ ‘) { // first line is header with date
tradedate = value.substring(2, 4) + "-" + value.substring(4, 6) + "-"+ value.substring(6, 10); // data format: mmddyyyy
in.nextKeyValue();
text = (Text) in.getCurrentValue();
value = text.toString();
}
// Create tuple and append all fields
tuple = TupleFactory.getInstance().newTuple();
if (includeDate) {
tuple.append(tradedate);
}
// extract the fields
tuple.append( value.substring(0, 2) + ":" + value.substring(2, 4) + ":" + value.substring(4, 6) + "." + value.substring(6, 9) ); // time
tuple.append( value.substring(9, 10) ); // exchange
tuple.append( value.substring(10, 26) ); // symbol
tuple.append( value.substring(26, 30) ); // condition
tuple.append( Integer.parseInt(value.substring(30, 39)) ); // volume
tuple.append( Double.valueOf(value.substring(39, 46)) + (Double.valueOf(value.substring(46, 50)) / 10000) ); // price
tuple.append( value.substring(50, 51) ); // stopstock
tuple.append( value.substring(51, 53) ); // correction
tuple.append( value.substring(53, 69) ); // sequence number
tuple.append( value.substring(69, 70) ); // source
tuple.append( value.substring(70, 71) ); // reporting
}
} catch (InterruptedException e) {
System.err.println(e.getMessage());
throw new IOException(e);
}
return tuple;
}
[/code]

The method getResourceSchema() is used to create column headers and define column data types. Because BigSheets is generating Pig Latin code, the data types must be Pig data types. There are no ‘date’ or ‘time’ types in Pig Latin, so just use the string type ‘chararray’ for the TradeDate and TradeTime columns.

[code language=”java” firstline=”70″] @Override
public ResourceSchema getResourceSchema(String arg0, Job arg1)
throws IOException {
Schema result = new Schema();
if (includeDate) {
result.add( new Schema.FieldSchema("TradeDate", DataType.CHARARRAY) ); // From line 1 MMDDYYYY
}
result.add( new Schema.FieldSchema("TradeTime", DataType.CHARARRAY) ); // 9c HHMMSSXXX
result.add( new Schema.FieldSchema("Exchange", DataType.CHARARRAY) ); // 1c A = NYSE AMEX, B = NASDAQ …
result.add( new Schema.FieldSchema("Symbol", DataType.CHARARRAY) ); // 16c Security symbol, fx "IBM"
result.add( new Schema.FieldSchema("SaleCondition", DataType.CHARARRAY) ); // 4c
result.add( new Schema.FieldSchema("TradeVolume", DataType.INTEGER) ); // 9c
result.add( new Schema.FieldSchema("TradePrice", DataType.DOUBLE) ); // 11c 9999999,9999 (7+4)
result.add( new Schema.FieldSchema("StopStock", DataType.CHARARRAY) ); // 1c blank/N/Y
result.add( new Schema.FieldSchema("TradeCorrection", DataType.CHARARRAY) ); // 2c 00=Regular, 01=Original …
result.add( new Schema.FieldSchema("TradeSequenceNum", DataType.CHARARRAY) ); // 16c
result.add( new Schema.FieldSchema("SourceofTrade", DataType.CHARARRAY) ); // 1c C/N
result.add( new Schema.FieldSchema("TradeReporting", DataType.CHARARRAY) ); // 1c A=Nyse, B=Nasdaq …
return new ResourceSchema(result);
}
[/code]

This is all the Java coding for the new reader. Next step is to publish the code.

Publishing and deploying the new reader

When the Java code is ready, you can publish and deploy it. You can publish the code in Eclipse. You can then deploy the reader from the BigInsights Web Console.

Before publishing the new reader ensure that you have a connection from Eclipse to the BigInsights cluster. In the BigInsights Servers window, right-click on the server name and click Test Connection. Or create a new connection. You must connect to the cluster using a user with Application Administration or System Administration rights in order to publish the reader.

Reader Cluster Connection

To publish the reader use the BigInsights Application Publish wizard by right-clicking the project name in the Project Explorer. The wizard will compile the Java code into a JAR file named plugin.jar. The wizard will also create the JSON file plugin.json, which describes the plugin. For a full description of the JSON objects see Creating BigSheets plug-ins.

On the Specify Location page in the wizard, select the BigInsights server to publish to and click Next.

Reader Publish Connection

On the Specify Application page, click Create New Application. In the Name field enter a unique name (NYSEreader). In the Description field enter a text explaining the reader purpose. Leave the Icon field as is (icons are not used for readers).
Click Next.

Tip: You cannot replace a plugin as long as it is deployed. So before replacing your existing plugin with a new version, first undeploy the plugin from the BigInsights Console (don’t delete it).

Reader Publish Cluster

On the Application Type page, click BigSheets. When you select BigSheets, the Workflow and Text Analytics pages of the wizard are skipped.
Click Next.

Reader Publish App Type

On the BigSheets page, enter a name in the ID field that will be unique in the BigInsights applications list. You will get an error message if the name is already in use.
Click on the New button to add an extension. This will bring up the Extension window.

Reader Publish Plugin

On the Add BigSheets extensions to the plug-in page, enter the name of the Java class or browse to find it in the project. In the ID field use the class name as well. This ID must be unique.

Click the New button to add an argument for the reader. The argument for the Daily Trades reader is whether the trade date should be inserted as first column. The Name argument will be shown when the reader is selected, so choose the Name carefully. Select the data type BOOLEAN.

Close the Arguments window to return to the Extensions page and click Next.

Reader Publish Extensions

On the Zip and Publish Application page, the full structure of the new reader application is shown. In the project, a new compressed filed named BIApp.zip is created and sent to the BigInsights cluster when you click Finish.

Reader Publish Zip

Deploying the reader

When the reader has been published to the BigInsights cluster it is marked as NOT DEPLOYED. To make the reader available for use in the BigSheets, log in to the console and open the Applications page.

From the Run | Manage | Link menu, click Manage. To quickly find the reader, type a few characters in the search field, for example nyse.

In the Applications window on the right side select the reader and click Deploy.

Reader Deploy Application

Using the reader

In the BigInsights Web Console, open the Files page and select the NYSE Daily Trades file that you want to analyze. On the right side, click the Sheet radio button. Click the Pencil button next to Line Reader and from the drop-down list select the NYSE_Daily_Trades reader.

Reader Selecting New Reader

To insert the trading date as the first column in the sheet, click the check box. Click green check mark button and you’re done.

Reader In Action

Troubleshooting

If you have problems running your new reader in BigSheets, look in the log file. The log file is in /var/ibm/biginsights/sheets/logs (if BigInsights was installed with the default configuration).

plugin.json

The plugin.json file contains the information about the plug-in and the extensions. The plugin.json file is a JSON object that contains the plug-in object. The plug-in object is defined in order for the plug-in to be installed and recognized in the system.

Sometimes it’s handy to manually edit the plugin.json file because you have more options available, than the wizard provides. By manually editing the file you can, for example, give the reader name as a quoted string and thus use spaces in the name.

The extension element is an array of records each describing one reader.

The argument element in the impl object is used to give arguments to the reader. Each argument consists of a name element, a description element, and a type element.

The type element can be STRING or BOOLEAN. When the reader is selected from the reader drop-down list, arguments of type STRING are shown as a text boxes and Boolean arguments are shown as check boxes. A defaultValue element can be used to insert a default value, or false/true for boolean.

Listing 1. The plugin.json file

[code language=”javascript”] {
plugin: {
pluginId: "com.ibm.bigsheets.nyseReaders",
name: "NYSEreaders",
version: "0.1.0.0",
runtime: "",
requires: [],
extension: [
{
id: "NyseDailyTrades",
type: "bigsheets.reader",
name: "NYSE Daily Trades",
desc: "NYSE Daily Trades reader. Provided by IBM",
impl: {
classname: "com.ibm.bigsheets.nyseReaders.NyseDailyTrades",
readerType: "fixed",
arguments: [
{ name: ‘Insert Trade Date as first column ?’,
desc: ‘Select the checkbox, if you want to add the trade date as the first column’,
type: "BOOLEAN",
defaultValue: false }
],
installpath:"",
datapath:"*",
schemaRequired:false
}
}
] }
}
[/code]

If you’re new to BigSheets, try the BigSheets tutorial to get started.

Join The Discussion

Your email address will not be published. Required fields are marked *