Understanding the Data Warehouse
A business usually maintains at least two types of databases â€” an operational database that stores all the records of daily transactions, and a data warehouse that comprises of historical data. The operational data acquired passes through an operational data store and undergoes data extraction, transformation, loading and is processed through certain additional layers of data cleansing.
Once the data is integrated and transformed, it is then stored in a data warehouse and later into data vaults which are all just relational databases. The data is structured into hierarchical groups often known as dimensions and into aggregate facts. Dimension helps to categorize and catalog the data and the common dimensions include users, products etc. Facts are concerned with sales and profit on a daily basis. The high-quality data is then used by business professionals for data mining, analytical research, generation of reports, market research and business decision making.
Basics of Data Warehouse Architecture
The data warehouse architecture has been ever evolving based on changing business requirements. Modern data warehouses use a hybrid approach that comprises of multiple cloud and analytic services that make up the data warehouse architecture. However, the basic concept revolving around the architecture has stayed the same.
For a medium-sized organization, the data warehouse should comprise of the following layers:
- Data Sources: The data is derived from several independent sources such as sales, marketing, customer expenditure etc.
- Staging Area: The data is cleansed and all the unessentialÂ information is filtered out at this stage. A staging area helps in building metadata and summaries that can be useful for the warehouse management.
- Data warehouse: The traditional OLTP consists of metadata and raw data. In addition, there is usually an additional type of data called summary data that helps to precompute some of the common operations in advance.
- Data mart: The data from the warehouse is loaded into individual data marts. Each data mart is focused on a single subject or a particular domain. Data marts are organized and maintained by a department who has aÂ particular interest in that sector.
- End-user: The end user makes use of the Datamart technology to create reports and graphs for data mining and making strategic decisions.
Developing a Traditional Data Warehouse
In a traditional sense, data warehouses were built using Oracle or a similar relational database solution and everything else revolved around it.
Here are the steps that you need to develop a data warehouse on your own.
- Identify and understand the requirements: The key business strategists need to be interviewed to understand and identify the requirements. To have better insights into how an organization wants to analyze the business data.
- Define the data sources: The bigger the size of the business, the bigger the data source. Identify the essential data sources based on the requirements that youâ€™ve analyzed in step 1. The data sources should match the processes that you need to implement in the data warehouse.
- Designing a Dimensional Model: Dimensional models are important to make the data more accessible. A dimensional should make it easier to query data, it should be extensible and support OLAP cubes. Since Data warehouses are subject oriented, dimensions help to build a master table comprising of non-overlapping data elements. This helps you to filter and group data easier. The dimension for a sales business would include, product, user, Store, date, time etc.
- Developing the Relational Schema: This is the stage where the dimensions are actualized and connected to fact tables using a relational database model. There are many schema models that you can use. The popular ones include â€” Star Schema, Snow Flake Schema, Star Flake Schema, Distributed Star Schema, etc.
Modern Data Warehousing Techniques
Modern DWs require petabytes of storage and more optimized techniques to run complex analytic queries. The traditional methods are not cost effective and relatively less efficient to fit into the modern day data warehousing needs. There are tons of cloud solutions that can help you build data warehouses that are performance optimized, inexpensive, and support parallel query execution.
Here are some of the popular and reliable options for data warehousing.
- Amazon Redshift
- Oracle 12c Database
- IBM db2 Warehouse on Cloud
- Google Cloudâ€™s BigQuery
To automate certain aspects of the process such as data warehousing, data analysis,Â and data management â€” there are a few cloud vendors likeÂ Panoply, Periscope Data,Â Tableau etc. that workÂ with the underlying data warehousing stack. Personally, I’ve found these tools to be useful to easily warehouse, analyze, visualize and share data across a business.
Choosing the Right DW Strategy
Some of the platforms listed above have been around for long whereas others are new entrants into data warehousing. Amazon Redshift is reliable, fast and is a part of Amazon Web services. Itâ€™s built on top of the Massive Parallel Processing (MPP) technology to support the high-performance requirements of an organization. Googleâ€™s BigQuery is also an excellent replacement for traditional warehouses. If you need to crunch petabytes of data in a very limited time, BigQuery might be a good choice.
Teradata has been the market leader in the DW space for a long time. Teradata has an enterprise version of data warehouse tools that lets businesses generate robust analytics in real-time. Oracle offers a state-of-the-art solution for data warehousing through Oracle 12c Database and Oracle Exadata machine. Itâ€™s one of the traditional methods for building a data warehouse thatâ€™s still popular today.
IBMâ€™s db2 warehouse on Cloud,Â formerly known as (DashDB) is performance scalable solution for managing a database in the cloud that can take heavy analytic workload.
When a business starts to grow, itâ€™s essential to design and develop an analytics system to make strategic business decisions. In this post, Iâ€™ve discussed the data warehousing architecture thatâ€™s employed to keep track of historical data. Iâ€™ve also covered some of the popular data warehousing platforms that are built for enterprises.