RafieTarabay
Tags: Big Data and analytics
Published on September 5, 2018 / Updated on November 9, 2020
Skill Level: Any Skill Level
The speed and forms in which Data is generated makes it difficult to manage and process through traditional DataBases. Here comes Big Data technologies like Hadoop. Big Data is important for predictions, analysis and to get better decision making.
Extract insight from a high volume, variety and velocity of data in a timely and cost-effective manner
Variety: Manage and benefit from diverse data types and data structures
Velocity: Analyze streaming data and large volumes of persistent data
Volume: Scale from terabytes to zettabytes
Traditional Approach (BI) vs Big Data Approach
- Traditional Approach (BI): [Structured & Repeatable Analysis] Business Users Determine what question to ask then IT Structures the data to answer that question
- Big Data Approach: [Iterative & Exploratory Analysis] IT Delivers a platform to enable creative discovery then Business Explores what questions could be asked
What is Hadoop?
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
Apache Hadoop is developed as part of an open source project.
Commercial distributions of Hadoop are currently offered by four primary vendors of big data platforms:
– Cloudera
– Hortonworks
– Amazon Web Services (AWS)
– MapR Technologies.
In addition, IBM, Google, Microsoft and other vendors offer cloud-based managed services that are built on top of Hadoop
IBM, Microsoft’s Azure HDInsight, and Pivotal (a Dell Technologies subsidiary) are based on the Hortonworks platform.
while Intel use Cloudera.
The 4 Modules of Hadoop
1. Distributed File-System HDFS: allows data to be stored in an easily accessible format, across a large number of linked storage devices
2. MapReduce: MapReduce do two basic operations – reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).
3. Hadoop Common: provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
4. YARN: manages resources of the systems storing the data and running the analysis.
Advantages and disadvantages of Hadoop
Hadoop is good for:
- processing massive amounts of data through parallelism
- handling a variety of data (structured, unstructured, semi-structured)
- using inexpensive commodity hardware
Hadoop is not good for:
- processing transactions (random access)
- when work cannot be parallelized
- Fast access to data
- processing lots of small files
- intensive calculations with small amounts of data
What hardware is not used for Hadoop?
- RAID
- Linux Logical Volume Manager (LVM)
- Solid-state disk (SSD)
Big data tools associated with Hadoop
Apache Flume: a tool used to collect, aggregate and move huge amounts of streaming data into HDFS
Apache HBase: a distributed database that is often paired with Hadoop
Apache Hive: an SQL-on-Hadoop tool that provides data summarization, query and analysis
Apache Oozie: a server-based workflow scheduling system to manage Hadoop jobs
Apache Phoenix: an SQL-based massively parallel processing (MPP) database engine that uses HBase as its data store
Apache Pig: a high-level platform for creating programs that run on Hadoop clusters
Apache Sqoop: a tool to help transfer bulk data between Hadoop and structured data stores, such as relational databases
Apache ZooKeeper: a configuration, synchronization and naming registry service for large distributed systems.
apache Solr: enterprise search engine
Apache Spark: Spark is an alternative in-memory framework to MapReduce, Supports streaming, interactive queries and machine learning.
Kafka: distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data.
GUI tools to manage Hadoop
- Ambari: developed by HortonWorks.
- HUE: developed by Cloudera
What is IBM Big SQL?
– Industry-standard SQL query interface for BigInsights data
– New Hadoop query engine derived from decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization
Why Big SQL?
– Easy on-ramp to Hadoop for SQL professionals
– Support familiar SQL tools / applications (via JDBC and ODBC drivers)
What operations are supported
– Create tables / views. Store data in DFS, HBase, or Hive warehouse
– Load data into tables (from local files, remote files, RDBMSs(
– Query data (project, restrict, join, union, wide range of sub-queries, and built-in functions, UDFs, etc.(
– GRANT / REVOKE privileges, create roles, create column masks and row permissions
– Transparently join / union data between Hadoop and RDBMSs in single query
– Collect statistics and inspect detailed data access plan
– Establish workload management controls
– Monitor Big SQL usage
IBM Big SQL Main Features
1- Comprehensive, standard SQL
– SELECT: joins, unions, aggregates, subqueries
– GRANT/REVOKE, INSERT … INTO
– PL/SQL
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
2- Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes without persisting intermediate results
– In-memory operations with ability to spill to disk (useful for aggregations that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
3- Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
4- Integration with RDBMSs via LOAD, query federation
Why choose Big SQL instead of Hive and other vendors?
IBM Spectrum Scale (Also know as “GPFS FPO”)
What is expected from IBM Spectrum Scale ?
• high scale, high performance, high availability, data integrity
• same data accessible from different computers
• logical isolation: filesets are separate filesystems inside a filesystem
• physical isolation: filesets can be put in separate storage pools
• enterprise features (quotas, security, ACLs, snapshots, etc.)
HDFS vs GPFS Commands
1) Copy File
– HDFS: hadoop fs -copyFromLocal /local/source/path /hdfs/target/path
– GPFS: cp /source/path /target/path
2) Move File
– HDFS: hadoop fs -mv path1/ path2/
– GPFS: mv path1/ path2/
3) Compare Files
– HDFS: diff < (hadoop fs -cat file1) < (hadoop fs -cat file2)
– GPFS: diff file1 file2
HDFS vs IBM Spectrum Scale
Login to IBM Cloud (BlueMix) https://console.bluemix.net/ , and search for Hadoop
You will get two results (Lite = Free), and (Subscription=with Cost) –> choose “Analytics Engine”
Ambari (GUI tools to manage Hadoop)
Ambari View is developed by HortonWorks.
Ambari is a GUI tool you can use to create(install) manage the entire hadoop cluster. You can keep on expanding by adding nodes and monitor the health, space utilization etc through Ambari.
Ambari views are more to help users to use the installed components/services like hive, pig, capacity scheduler to see the cluster-load and manage YARN workload management, provisioning cluster resources, manage files etc.
Download VM from https://www.cloudera.com/downloads/quickstart_vms/5-13.html
How to use HUE
Example 1
Copy file from HD to HDFS
Using command line
hadoop fs -put /HD PATH/temperature.csv /Hadoop Path/temp
Using HUE GUI
Example 2
Use Scoop to move mySql DB table to Hadoop file system inside hive directory
> sqoop import-all-tables \
-m 1\
–connect jdbc:mysql://localhost:3306/retail_db \
–username=retail_dba \
–password=cloudera \
–compression-codec=snappy \
–as-parquetfile \
–warehouse-dir=/user/hive/warehouse \
–hive-import
Parameters description
-m parameter: number of .parquet files
/usr/hive/warehouse is the default hive path
To view tables after move to HDFS > hadoop fs -ls /user/hive/warehouse/
To get the actual hive Tables path, use terminally type hive then run command set hive.metastore.warehouse.dir;
Example 3
Working with Hive tables
Task 1: To view current Hive tables
show tables;
Task 2: Run SQL command on Hive tables
select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;
Task 3: Run SQL command on Hive tables
select p.product_id, p.product_name, r.revenue from products p inner join
(
select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float)) as revenue
from order_items oi inner join orders o
on oi.order_item_order_id = o.order_id where o.order_status <> ‘CANCELED’ and o.order_status <> ‘SUSPECTED_FRAUD’
group by order_item_product_id
) r
on p.product_id = r.order_item_product_id
order by r.revenue desc
limit 10;
Example 4
Copy temperature.csv file from HD to new HDFS directory “temp” then load this file inside new Hive table
hadoop fs -mkdir -p /user/cloudera/temp
hadoop fs -put /var/www/html/temperature.csv /user/cloudera/temp
Create Hive table based on CVS file
hive> Create database weather;
CREATE EXTERNAL TABLE IF NOT EXISTS weather.temperature (
place STRING COMMENT ‘place’,
year INT COMMENT ‘Year’,
month STRING COMMENT ‘Month’,
temp FLOAT COMMENT ‘temperature’)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION ‘/user/cloudera/temp/’;
Example 5
Using HUE create a workflow using Oozie to move data from mySQL/CSV files to Hive
Step 1: get the virtual machine IP using ifconfig
Step 2: navigate to the http://IP:8888 , to get HUE login screen (cloudera/cloudera)
Step 3: Open Oozie: Workflows>Editors>Workflows> then click “create” button
then
Simple Oozie workflow
-
delete HDFS folder
-
Copy mySql table as text file to HDFS
-
Create Hive table based on this text file
Example 6
Create schedual to run workflow
steps: Workflow > Editor > Coordinators
Notes:
Setup workflow settings
Workflow can contains some variables
To define new variable –> ${Variable}
Sometimes you need to define hive libpath in HUE to work with hive
ozzie.libpath : /user/oozie/share/lib/hive
Data representation formats used for Big Data, Common data representation formats used for big data include:
- Row- or record-based encodings:
−Flat files / text files
−CSV and delimited files
−Avro / SequenceFile
−JSON
−Other formats: XML, YAML
- Column-based storage formats:
−RC / ORC file
−Parquet
- NoSQL Database
What is Parquet, RC/ORC file formats, and Avro?
1) Parquet
Parquet is a columnar storage format,
Allows compression schemes to be specified on a per-column level
Offer better write performance by storing metadata at the end of the file
Provides the best results in benchmark performance tests
2) RC/ORC file formats
developed to support Hive and use a columnar storage format
Provides basic statistics such as min, max, sum, and count, on columns
3) Avro
Avro data files are a compact, efficient binary format
What is NoSQL Databases?
NoSQL is a new way of handling variety of data. NoSQL DB can handle Millions of Queries per Sec while normal RDBMS can handle Thousands of Queries per Sec only, and both are follow CAP Theorem.
Types of NoSQL datastores:
• Key-value stores: MemCacheD, REDIS, and Riak
• Column stores: HBase and Cassandra
• Document stores: MongoDB, CouchDB, Cloudant, and MarkLogic
• Graph stores: Neo4j and Sesame
CAP Theorem
CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability.
- Consistency means Every read receives the most recent write or an error
- Availability means Every request receives a (non-error) response
(without guarantee that it contains the most recent write)
How Famous Databases align with CAP Theorem?
- HBase, and MongoDB —> CP [give data Consistency but not Availability]
- Cassandra , CouchDB —> AP [give data Availability but not Consistency]
- Traditional Relational DBMS are CA [support Consistency and Availability but not network partition]