Overview

Skill Level: Intermediate

This post covers modern-day solutions for implementing data warehouses.

Ingredients

A data warehouse is a relational database that stores information collected from multiple sources. Unlike a traditional database that is used for processing transactions, a warehouse is used for data analysis, real-time reporting and decision making. From a business perspective, data warehousing helps to structure historical transaction data and use that to generate high-quality information. This information could be further used to improve the business, the customer satisfaction, and the overall efficiency.

Step-by-step

  1. Understanding the Data Warehouse

    A business usually maintains at least two types of databases — an operational database that stores all the records of daily transactions, and a data warehouse that comprises of historical data. The operational data acquired passes through an operational data store and undergoes data extraction, transformation, loading and is processed through certain additional layers of data cleansing.

     

    Data_warehouse_overview

    Once the data is integrated and transformed, it is then stored in a data warehouse and later into data vaults which are all just relational databases. The data is structured into hierarchical groups often known as dimensions and into aggregate facts. Dimension helps to categorize and catalog the data and the common dimensions include users, products etc. Facts are concerned with sales and profit on a daily basis. The high-quality data is then used by business professionals for data mining, analytical research, generation of reports, market research and business decision making.

  2. Basics of Data Warehouse Architecture

    The data warehouse architecture has been ever evolving based on changing business requirements. Modern data warehouses use a hybrid approach that comprises of multiple cloud and analytic services that make up the data warehouse architecture. However, the basic concept revolving around the architecture has stayed the same.

    Datawarehouse

     

    For a medium-sized organization, the data warehouse should comprise of the following layers:

    1. Data Sources: The data is derived from several independent sources such as sales, marketing, customer expenditure etc.
    2. Staging Area: The data is cleansed and all the unessential information is filtered out at this stage. A staging area helps in building metadata and summaries that can be useful for the warehouse management.
    3. Data warehouse: The traditional OLTP consists of metadata and raw data. In addition, there is usually an additional type of data called summary data that helps to precompute some of the common operations in advance.
    4. Data mart: The data from the warehouse is loaded into individual data marts. Each data mart is focused on a single subject or a particular domain. Data marts are organized and maintained by a department who has a particular interest in that sector.
    5. End-user: The end user makes use of the Datamart technology to create reports and graphs for data mining and making strategic decisions.
  3. Developing a Traditional Data Warehouse

    In a traditional sense, data warehouses were built using Oracle or a similar relational database solution and everything else revolved around it.

    Here are the steps that you need to develop a data warehouse on your own.

    1. Identify and understand the requirements: The key business strategists need to be interviewed to understand and identify the requirements. To have better insights into how an organization wants to analyze the business data.
    2. Define the data sources: The bigger the size of the business, the bigger the data source. Identify the essential data sources based on the requirements that you’ve analyzed in step 1. The data sources should match the processes that you need to implement in the data warehouse.
    3. Designing a Dimensional Model: Dimensional models are important to make the data more accessible. A dimensional should make it easier to query data, it should be extensible and support OLAP cubes. Since Data warehouses are subject oriented, dimensions help to build a master table comprising of non-overlapping data elements. This helps you to filter and group data easier. The dimension for a sales business would include, product, user, Store, date, time etc.
    4. Developing the Relational Schema: This is the stage where the dimensions are actualized and connected to fact tables using a relational database model. There are many schema models that you can use. The popular ones include — Star Schema, Snow Flake Schema, Star Flake Schema, Distributed Star Schema, etc.
  4. Modern Data Warehousing Techniques

    Modern DWs require petabytes of storage and more optimized techniques to run complex analytic queries. The traditional methods are not cost effective and relatively less efficient to fit into the modern day data warehousing needs. There are tons of cloud solutions that can help you build data warehouses that are performance optimized, inexpensive, and support parallel query execution.

    Here are some of the popular and reliable options for data warehousing.

    1. Amazon Redshift
    2. TeraData
    3. Oracle 12c Database
    4. IBM db2 Warehouse on Cloud
    5. Google Cloud’s BigQuery

     

    To automate certain aspects of the process such as data warehousing, data analysis, and data management — there are a few cloud vendors like Panoply, Periscope Data, Tableau etc. that work with the underlying data warehousing stack. Personally, I’ve found these tools to be useful to easily warehouse, analyze, visualize and share data across a business.

  5. Choosing the Right DW Strategy

    Some of the platforms listed above have been around for long whereas others are new entrants into data warehousing. Amazon Redshift is reliable, fast and is a part of Amazon Web services. It’s built on top of the Massive Parallel Processing (MPP) technology to support the high-performance requirements of an organization. Google’s BigQuery is also an excellent replacement for traditional warehouses. If you need to crunch petabytes of data in a very limited time, BigQuery might be a good choice.

    Teradata has been the market leader in the DW space for a long time. Teradata has an enterprise version of data warehouse tools that lets businesses generate robust analytics in real-time. Oracle offers a state-of-the-art solution for data warehousing through Oracle 12c Database and Oracle Exadata machine. It’s one of the traditional methods for building a data warehouse that’s still popular today.

    IBM’s db2 warehouse on Cloud, formerly known as (DashDB) is performance scalable solution for managing a database in the cloud that can take heavy analytic workload.

  6. Summary

    When a business starts to grow, it’s essential to design and develop an analytics system to make strategic business decisions. In this post, I’ve discussed the data warehousing architecture that’s employed to keep track of historical data. I’ve also covered some of the popular data warehousing platforms that are built for enterprises.

Join The Discussion