1. What is IBM BigInsights?
IBMÂ® BigInsightsâ„˘ is a hardware-agonstic software platform for storing and analyzing large-scale data collections. BigInsights includes IBM Open Platform with Apache Hadoop along with tools and value-add services get you started quickly and to simplify development and maintenance.
IBM Open Platform with Apache hadoop is composed of 100â€° open source components for use in big data analysis. This product offering includes open source components such as Ambari, Hadoop, YARN, Hive, HBase, Knox, Avro, Flume, Pig, Slider, Sqoop, ZooKeeper, Oozie, Avro, Nagios, and more. IBM Open Platform has incorporated the most recent releases across all of the components of Enterprise Hadoop, including Hadoop 2.6.
The following list shows the value-add services that are available.
- BigInsights Business Analyst Module contains Big SQL, BigSheets, and the BigInsights Home page services.
- BigInsights Data Scientist Module adds Text Analytics, Big R, and Machine Learning for Big R to the Analyst module.
- BigInsights Enterprise Management Module provides a set of management tools.
2. What open source technologies are supported within BigInsights?
BigInsights supports the current releases of the following open source technologies.
For a complete list of open source versions see the documentation.
5. Where can I find product documentation for BigInsights?
Product documentation is now available in IBM Knowledge Center. IBM Knowledge Center includes the documentation for all IBM products and all supported releases of those products.
Big SQL FAQs
1. What is Big SQL?
Big SQL is a massively parallel processing (MPP) SQL engine that runs in Apache Hadoop to achieve vastly improved performance and SQL execution breadth over other SQL-on-Hadoop offerings. Big SQL delivers
Internal tests indicate that Big SQL returns results 20x faster, on average over Apache Hive, with performance improvements ranging up to 10-70x faster for individual queries.
- Comprehensive SQL-on-Hadoop support
Big SQL successfully runs ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification.
- Enterprise-wide access to data
Big SQL enables seamless inclusion of Hadoop into your enterprise data environment. Its federated access supports multiple data sources such as DB2, Oracle, Teradata with a single SQL statement.
- Support for existing business analytics tools
Simple extension of business analytics tools such as IBM Cognos Business Intelligence, Microstrategy or Tableau to Hadoop
- Enhanced data security
Big SQL delivers authorization capabilities typically found in an RDBMS environment. It provides full GRANT/REVOKE based security, group and role based hierarchical security and enables row and column access control or fine grained control.
2. What is the difference between Big SQL and Hive?
Big SQL provides the following support over Hive:
- More comprehensive SQL support (see below for details)
- Federated queries
- Statistics-driven optimization and query planning
- Automatic memory tuning for concurrency and parallelism
- Workload management for fault tolerance and cluster elasticity
Big SQLâ€™s support for query related SQL:
- Subquery usage in FROM, WHERE and HAVING clauses
- Table expressions using the WITH..AS syntax
- Join operations in WHERE clause which result in a left or right outer join
- Query hints for tables, subqueries and joins
- Full support for set operators: UNION, EXCEPT, INTERSECT, with or without the ALL predicate
- Full support for DISTINCT predicate including with the SELECT statement, in subqueries and with functions such as COUNT
- Comprehensive SORT support with NULL FIRST, NULL LAST predicates
- Built-in functions for over 200 scalar functions and 25 aggregate functions (e.g., Ratio-to-Report)
- Stored procedures
3. What is the difference between Big SQL and Impala?
Primary differences between Big SQL and ImpalaÂ are as follows:
- Comprehensive SQL support: Big SQL provides more comprehensive SQL support, satisfying SQL2003; whereas, Impala only supports a subset of SQL92. Big SQL successfully runs ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification.
- Guaranteed execution for complex queries: Impala is limited to joined tables fitting into the aggregated memory of data nodes which is a big limitation for IT to support conventional BI tools. Big SQL provides automatic memory tuning for concurrency and parallelism.
- Federated queries: Big SQL handles queries across multiple data sources; Impala does not
- Performance tuning: Big SQL supports query hints for tables, subqueries and joins as well as statistics-driven optimization and query planning, leveraging IBMâ€™s expertise with database technology
4. What is the difference between Big SQL and SQL for relational databases?
Big SQLÂ supports SQL2003 which is a subset of SQL2011. In addition, some SQL2003 datatypes and constructs may be limited to how Hive 0.12 represents metadata when querying against Hadoop-based data.
For query-related SQL, Big SQL does not support the following features:
- Data types: spatial or geographic, BLOBs, JSON
- Functions: Time interval, percentile, quartile, Name/Value Pair related
- Hive indexes: see related FAQ â€śCan Big SQL use Hive indexes?”
5. Can Big SQL use Hive Indexes?
Big SQLÂ does not currently support indexes on Hadoop tables. You can, however, define unique and referential constraints on Hadoop table/columns, which are not enforced but can be helpful to the optimizer in determining the best access plan.
6. Does Big SQL run on other Hadoop distributions such as Cloudera or Hortonworks?
No. Big SQL is part of IBMâ€™s Hadoop distribution, BigInsights.
7. Which data sources can Big SQL access?
Supported data sources:
- IBM BigInsights
- IBM PureData System: IBM PureData System for Analytics, IBM PureData System for Operational Analytics
For details about supported versions, view the system requirements for BigInsights.
Big R FAQs
1. What is Big R?
Big R is a library of functions that provides end-to-end integration with the R language and BigInsights. By using Big R, you can use R as a query language for big data hosted on a BigInsights cluster. Learn more about Big R in Tutorial: Analyzing big data with IBM InfoSphere BigInsights Big R.
2. What are the prerequisties to run Big R?
- R and Big R packages should be installed on every BigInsights console and data nodes. Learn more about R and Big R installation in Installing the Big R service.
- Big R connects the R users to BigInsights through Big SQL JDBC connection. BigInsights Hadoop and Big SQL server must be up and running before using Big R.
3. What is a Big R package ?
Big R package is a R package provides the classes and functions for R users to explore, manipulate, analyze and visualize Big Data residing in the BigInsights cluster.
4. Are there any dependencies for Big R package?
Yes, Big R package has three dependencies, including
rJava. Users should install these three R packages before they can install the Big R package.
5. What is a bigr.frame?
bigr.frame is the Big R’s proxy to the big data that resides on the BigInsights cluster. bigr.frame is an abstraction of big data with structured column names and column types information similar to a R data.frame. Users can perform many operations with R language syntax on bigr.frame while the data remains on the cluster.
6.Where can I find error logs if my R query fails?
Usually the error message is output directly to the screen for most R operations. If users are running groupApply, rowApply or tableApply functions, the R error message is captured into HDFS files saved in the cluster. Users can run bigr.logs() to view these logs.
7. When I run groupApply function, the error says the R can’t be found, how can I fix this?
If R is installed on a path that is not in the BigInsights Hadoop or Big SQL server’s PATH env, this error may be caught. Before rerun the query, users can issue
bigr.set.server.option('R_HOME', '') to set the R install path manually.
8. What diagnostic info should I provide when I contact IBM for help?
If something goes wrong with Big R and users’ queries fail, users can run bigr:::bigr.debug() APIÂ to turn on the debug option, this will show more diagnosis info that can help IBM Support to pinpoint the issue. Users can turn the debug on by
To turn it off, run:
Text Analytics FAQs
1. What is BigInsights Text Analytics ?
Text Analytics is a powerful mechanism used to extract structured data from unstructured or semi-structured text. The Text Analytics framework consists of an all-new web-based visual tool for creating and running extractors on your input documents in supported formats. You can continue to write programs to extract text as well using the same AQL that is used to build the pre-built extractors.
2. What is Annotation query language or AQL ?
Annotation query language or AQL is a query language used in IBM BigInsights to build extractors that can extract structured information from unstructured or semi-structured content. You can write your own AQL, or use the new web tool to create extractors and export the AQL.
3. What are Text Analytics extractors ?
Extractors are programs that extract structured information from unstructured or semi-structured text by using AQL constructs.
An extractor consists of compiled modules, or Text Analytics module (TAM) files, and content for external dictionary and table artifacts. At a high level, the extractor can be regarded as a collection of views, each of which defines a relationship. Some of these views are designated as output views, while others are non-output views. In addition, there is a special view called Document. This view represents the document that is being annotated. Furthermore, the extractor might also have external views whose content can be customized at run time with extra metadata about the document that is being annotated.
4.Â What are Extractor Modules ?
Text Analytics modules are self-contained Text Analytics packages. They contain a set of text extraction rules that are created by using AQL, and other resources that are required for text extraction.
5. What are AQL files ?
AQL files are text files that contain text extraction rules that are written in the AQL programming language. A module contains one or more AQL files.
6. How can we develop the Text Analytics extractors ?
You can develop Text Analytics extractors in the new web tool.Â After you get the Data Scientist Module or the Enterprise Management Module and install the Text Analytics service on the IBM Open Platform for Apache Hadoop, navigate to the BigInsights home page and launch the new web tool.Â Inside this tool you can create text extraction projects. You can use any of the provided extractors or customize them to your needs.Â Load sample documents into the tool so that you can run your extractor as you develop it and refine it to get the results you want. Save and run your built extractor against documents stored in HDFS.
7. Are there any pre-buit Text Analytics extractor libraries ?
Yes, BigInsights includes pre-built extractor libraries that you can use to extract a fixed set of entities. You can run these pre-built extractors as-is or extend them in the web tool. You can also code them yourself with AQL.
Back to top
1.What is BigSheets?
BigSheets is a browser-based analytic tool that be used to break large amounts of unstructured data into consumable, situation-specific business contexts. In BigSheets, you work with collections of data in master workbooks, child workbooks and sheets.
2. What is a Workbook in BigSheets?
Workbooks contain a set of data from one or more master or child workbooks. You can create a workbook to save a particular set of data results and then tailor the format, content, and structure of those results to refine and explore only the data that is pertinent to your business questions. You can visualize workbook data in a map or a chart.
3.What is a sheet in workbook?
Workbooks can have one or more sheets. Sheets within workbooks are representations of data that apply a different function to analyze and view subsets of the data. Each row in a sheet represents a record of data and each column represents a property of the record. Add sheets to workbooks to progressively edit and explore the data.
4.How do you adminster BigSheets using REST API?
BigSheets REST APIs can be used to complete BigSheets functions like creating and running workbooks.