Understanding Data Warehousing
A data warehouse is an essential core element for enterprise data infrastructure systems. It helps to centralize and organize an ever increasing amount of consolidated data, such as historical data, log files, and transaction data. Data Warehouses are technically different from Data Lakes and Data Marts, however, conceptually, they form a stream of data towards the end user. Data lakes generally hold the largest datasets in unstructured format, think Big Data, which then can be used as source data for a data warehouse. For further data refinement, data from data warehouses can then be moved into data marts. These data marts are designed to serve a specific function or department, like sales or marketing, whereas the data warehouse has multiple purposes.
Defining Data Warehousing
Data warehouses are repositories for processed data pooled together from raw data sources and made available for downstream processing, analysis, and reporting business intelligence (BI) to organizational decision makers. Common sources for data warehousing include: transactional systems, relational databases, data lakes, flat files, multimedia databases, spatial databases, and time-series databases.
Data warehouses are not singular databases. They contain multiple databases where all incoming data is organized into the rows and columns of data tables. Schemas help to organize these tables, allowing for query tools to easily access and analyze relevant data. It is often faster to access data warehouses than other types of databases.
Benefits of Data Warehousing
Data warehouses offer the overarching and unique benefit of allowing organizations to analyze a variety of data and extract significant value from it. Additional benefits include:
- Improved Data Quality
- Historical Data Records and Analysis
- Unified Access to Multiple Critical Data Sources
- Integrations Enhance Reporting and Analysis
- Better Informed and Faster Decision Making Capabilities
Approaches to Data Warehousing
There are two main approaches to data warehousing: Extract Transform and Load, and Extract Load and Transform.
ETL-based data warehousing
Extract Transform and Load, or ETL-based data warehousing is the most common approach. The ETL approach has three functional layers, the staging layer, integration layer, and access layer. The staging layer, or Extract, takes data from each disparate source and stores it on a staging server. The integration layer, or Transform, transforms the staged data into usable formats and then Loaded into the data warehouse tables.
ELT-based data warehousing
Extract Load and Transform, or ELT-based data warehousing performs all the same functions as its brother approach ETL. The difference, however, is that the staging database is inside the data warehouse, and transformations are performed inside the data warehouse before data is loaded into tables.
On the surface there looks to be little difference between the two approaches. The ETL approach is more common, but because the staging server stands between data sources and the DWH, it is a slower process, and often used for small data sets. The ELT approach is best for semi-structured and unstructured data, and is faster without the external staging area. Additionally, because raw data is loaded into the DWH, it creates a rich historical archive that can be queried again in new ways, a feature not possible with ETL because data is not stored in raw format.
Data Warehouse Architecture
Data warehouse architecture is the design and structuring of a data warehouse. With three common architecture models, or tiers, they progressively build analytic capabilities on top of a storage system adding more complexity to fulfill more demanding needs.
Single-tier Architecture — Single-tier architecture is a single layer of hardware designed to keep storage space at a minimum. These systems are used mainly for batch processing, but are avoided in the use of real-time systems.
Two-tier Architecture — A two-tier system is an advancement in control and efficiency of the single-tier system. By separating the analytical and business processes, data can be cleansed before entering the data warehouse.
Three-tier Architecture — Three-tier systems are typical of data warehouses. They have a bottom, middle, and top tier known as the source layer, reconciled layer, and the data warehouse layer. The bottom layer is concerned with collecting data from multiple sources. The middle layer provides interfaces for data scientists to prepare bottom level data for the warehouse. The top layer delivers reports, visualizations, and OLAP analysis.
Each data warehouse architecture must meet these requirements: separation, scalability, extensibility, security, and administrability.
Related systems
Data can reside in many different storage types besides data warehouses, including databases, data lakes, and data marts. They each provide advantages and disadvantages.
Data warehouse vs. database
Databases are simpler than data warehouses. Unlike data warehouses, databases do not hold historical information. They are updated with only the most relevant information. For example, for customer information, a customer database will only contain the most recent contact information for them, whereas a data warehouse could have all the addresses that the customer has lived at.
Data warehouse vs. data lake
Data lakes and data warehouses can hold a variety of data types. Data lakes, however, hold raw data, data yet to be processed. In some configurations, sensors, and other devices simply pour their data into data lakes to be stored, perhaps later a data scientist will come along and find value in it. Data warehouses differ from data lakes in that they hold processed data and store it in a refined format used by business operators.
Data warehouse vs. data mart
Data marts are smaller data warehouses that focus on a specific domain or department. Data marts may also be downstream pulling data from a data warehouse only to refine it further for a specific end user. They are used for analysis and reporting.
Examples of Data Warehouses
University Automates Data Warehouse Infrastructure
Cornell University serves over 22,000 students, and was faced with a critical problem with their data warehouse. IBM purchased the software that Cornell was dependent on for merging data into their Oracle Data Warehouse. Unfortunately for Cornell, IBM then soon afterwards discontinued support for the product. Cornell was faced with an opportunity to replace their outmoded infrastructure.
After a search for what looked to be the solution, they decided to try a proof of concept on a limited portion of their Data Warehouse, and it was a success. The solution automated the data warehouse and cut their nightly batch processing times in half, or more. After a few years, Cornell’s team had completed the conversion to the new system. Now they receive calls from other universities interested in the same solution.
Health Insurer Automates Data Warehouse Infrastructure
Helsana insures over 1.9 million customers, and relied on hand-coding, manual processes, and institutional knowledge to maintain their data. A reorganization occurred and the 55 person IT department was split in 8 groups and dispersed across the company. Ensuing collaboration challenges ensued. The situation deteriorated more and more, and more time was spent working on updates and fixes and there was no time to innovate.
After clearly defining their needs and prioritizing their most important, they began a process of selecting a short list of vendors. The vendor they selected proved ideal. After a demonstration the software had already automated the writing of some native code based on visual prototypes to reduce errors. Unlike other ETL and ELT tools, the chosen solution is metadata-driven, and automatically produces and maintains metadata to track upstream and downstream dependencies. So, developers can make changes knowing the underlying infrastructure will automatically remain integrated.
Data Warehousing FAQs
What is a data warehouse?
A data warehouse is a collection of data used for analytics and reporting. It provides a central repository or business analysts to find valuable, actionable insights that drive business operations. Source data comes from transnational systems, relational databases, internal and external sources.
What is business intelligence?
Business intelligence is the insights derived from analysis of business information. BI differs from analytics in that it “describes” what is happening now, whereas analytics attempts to “predict” what will happen based on available data.
What is metadata?
Metadata describes other data. It is used in data warehouses, and other storage types like data lakes to help understand massive sets of data without needing to look at the content of the data. In a way metadata acts like a summary of content data. This is useful, for example, to save time searching volumes of large video files when the metadata can more quickly be queried.
What is big data?
As the term implies, big data refers to very large data sets, sometimes beyond the terabyte range and into the petabyte range. Because big data is so large and complex, it requires advanced data tools to analyze. Big data represents many things, including media files, social media, logs from autonomous vehicles, and more.