w3resource logo


gallery w3resource

Data Warehousing

Secondary Nav

 

This presentation describes an introduction of Data Warehousing.

Transcript

Data Warehouse

► A data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a system, usually separated from the original system.
► It is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources.
► It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

► A process of transforming data into information and making it available to users in a timely enough manner to make a difference.
► It is a relational or multidimensional database management system designed to support management decision making.
► A data warehousing is a copy of transaction data specifically structured for querying and reporting.

History

★ The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse".
★ 1960s - General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.
★ 1970s - ACNielsen and IRI provide dimensional data marts for retail sales.
★ 1983 – Teradata introduces a database management system specifically designed for decision support.
★ 1988 - Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".

Data Warehouse includes

★ Retrieving data
★ Analyzing data
★ Extracting data
★ Loading data
★ Transforming data
★ Managing data

Benefits

★ Maintains a copy of information from the source transaction systems
★ Congregate data from multiple sources into a single database
★ Maintain data history, even if the source transaction systems do not.
★ Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
★ Present the organization's information consistently
★ Provide a single common data model for all data of interest regardless of the data's source
★ Restructure the data so that it makes sense to the business users
★ Saves Time
★ Generates a High ROI

Characteristics of Data Warehouse

► The concept of a Data Warehouse was introduced by Bill Inmon, the father of Data Warehouse. Here are the characteristics :
★ Subject Orientation
★ Time variance
★ Non-Volatile
★ Integrated

Subject Orientation

► Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Time variance

► In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.

Non-Volatile

► Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Integrated

► Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Types of Systems

★ Data Mart
★ Online analytical processing (OLAP)
★ Online Transaction Processing (OLTP)
★ Predictive analysis

Data Mart

► Data Mart : A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as sales, finance or marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.

Online analytical processing (OLAP)

► Online analytical processing (OLAP) is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day.

Online Transaction Processing (OLTP)

► Online Transaction Processing (OLTP) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF).

Predictive analysis

► Predictive analysis is about finding and quantifying hidden patterns in the data using complex mathematical models that can be used to predict future outcomes. Predictive analysis is different from OLAP in that OLAP focuses on historical data analysis and is reactive in nature, while predictive analysis focuses on the future. These systems are also used for CRM (Customer Relationship Management).

Data warehouse vs OLTP

► Data warehouse ► OLTP
★ Subject Oriented ★ Application Oriented
★ Used to analyze business ★ Used to run business
★ Summarized and refined ★ Detailed data
★ Snapshot data ★ Current up to date
★ Integrated Data ★ Isolated Data
★ Knowledge User (Manager) ★ Clerical User
★ Large volumes accessed at a time (millions) ★ Few Records accessed at a time
★ Mostly Read (Batch Update) ★ Read/Update Access
★ Redundancy present ★ No data redundancy
★ Database Size 100 GB - few terabytes ★ Database Size 100MB -100 GB
★ Query throughput is the performance metric ★ Transaction throughput is the performance metric
★ Hundreds of users ★ Thousands of users
★ Managed by subsets ★ Managed in entirety

Data warehouse environment

► The environment for data warehouses and marts includes the following :
★ Source systems that provide data to the warehouse or mart
★ Data integration technology and processes that are needed to prepare the data for use
★ Different architectures for storing data in an organization's data warehouse or data marts
★ Different tools and applications for the variety of users
★ Metadata, data quality, and governance processes must be in place to ensure that the warehouse or mart meets its purposes

Data Warehouse Components

► The primary components of data warehouse are :
★ Operational Data
★ Load Manager
★ Warehouse Manager
★ Query Manager
★ Detailed, Lightly and Highly summerized data
★ Archive and Backup Data
★ Meta Data
★ End user access tools

Data Warehouse Components
(Operational Data)

★ The data comes from the mainframe systems in the traditional network and hierarchical format.
★ Data can come from traditional RDBMS like Oracle, Informix etc.
★ Operational data can also come from external sources (e.g. commercial databases and databases associated with supplier and customers).

Data Warehouse Components
(Load Manager)

★ Performs all the operations associated with extraction and loading data into the data warehouse.
★ Operations include simple transformations of the data to prepare the data for entry into the warehouse.
★ The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.

Data Warehouse Components
(Warehouse Manager)

★ Analysis of data to ensure consistency.
★ Transformation and merging the source data from temporary storage into data warehouse tables.
★ Create indexes and views on the base table.
★ Denormalization
★ Generation of aggregation
★ Backing up and archiving of data

Data Warehouse Components
(Query Manager)

★ Performs all operations associated with management of user queries.
★ This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs.
★ The complexity of a query manager is determined by facilities provided by the end-user access

Data Warehouse Components
(Detailed, Lightly and Highly summerized data)

★ This area of the warehouse stores all the detailed data in the database schema.
★ In most cases detailed data is not stored online but aggregated to the next level of details.
★ The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data.
★ This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles.
★ The purpose of the summarized information is to speed up the query performance.

Data Warehouse Components
( Archive and Backup Data)

★ This area of the warehouse stores detailed and summarized data for the purpose of archiving and backup.
★ The data is transferred to storage archives such as magnetic tapes or optical disks.

Data Warehouse Components
( Meta Data)

► The data warehouse stores all the Metadata (data about data) definitions used by all processes in the warehouse. It is used for variety of purpose including :
★ The extraction and loading process – Meta data is used to map data sources to a common view of information within the warehouse.
★ The warehouse management process – Meta data is used to automate the production of summary tables.
★ As part of Query Management process Meta data is used to direct a query to the most appropriate data source.

Data Warehouse Components
(Warehouse Manager)

★ Analysis of data to ensure consistency.
★ Transformation and merging the source data from temporary storage into data warehouse tables.
★ Create indexes and views on the base table.
★ Denormalization
★ Generation of aggregation
★ Backing up and archiving of data

ETL (extract, transform, and load)

Data Warehouse Components
ETL (extract, transform, and load)

► Extract, transform, and load (ETL) refers to a process in database usage and especially in data warehousing that:
★ Extracts data from outside sources
★ Transforms it to fit operational needs, which can include quality levels
★ Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)

Data Warehouse Components
Extract

► The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, since extracting data correctly sets the stage for the success of subsequent processes.
★ Most data warehousing projects consolidate data from different source systems.
★ Each separate system may also use a different data organization and/or format.
★ Common data source formats are relational databases and flat files, but may include non-relational database
★ The streaming of the extracted data source and load on-the-fly to the destination database.
★ The goal of the extraction phase is to convert the data into a single format appropriate for transformation processing.

Data Warehouse Components Transform

► The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target.
★ Selecting only certain columns to load
★ Translating coded values
★ Encoding free-form values
★ Deriving a new calculated value
★ Sorting
★ Joining data from multiple sources and deduplicating the data
★ Aggregation
★ Generating surrogate-key values
★ Transposing or pivoting
★ Splitting a column into multiple columns

Data Warehouse Components Load

► The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in an historical form at regular intervals—for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data.

Data Warehouse Architectures

► Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are :
★ Data Warehouse Architecture (Basic)
★ Data Warehouse Architecture (with a Staging Area)
★ Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture
(Basic)

► End users directly access data derived from several source systems through the data warehouse.
The next figure illustrates three things :
★ Data Sources (operational systems and files)
★ Warehouse (metadata, summary data, and raw data)
★ Users (analysis, reporting, and mining)
★ Data Warehouse Architecture (Basic)

Data Warehouse Architecture
(with a Staging Area)

► You need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.
Staging area - A place where data is processed before entering the warehouse.
The next figure illustrates three things :
★ Data Sources (operational systems and files)
★ Staging Area (where data sources go before the warehouse)
★ Warehouse (metadata, summary data, and raw data)
★ Users (analysis, reporting, and mining)

Data Warehouse Architecture
(with a Staging Area)

► To customize your warehouse's architecture for different groups within your organization.
This by adding data marts , which are systems designed for a particular line of business.
The next example illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Data Warehouse Architecture
(with a Staging Area)

► This illustrates five things :
★ Data Sources (operational systems and flat files)
★ Staging Area (where data sources go before the warehouse)
★ Warehouse (metadata, summary data, and raw data)
★ Data Marts (purchasing, sales, and inventory)
★ Users (analysis, reporting, and mining)

Data Warehousing
Software tools

► ELT (extract, transform, and load)  ► Reporting
★ Informatica                         ★ Cognos
★ Ab-Initio                           ★ Business Objects
★ DataStage                           ★ Exsion Corporate
► Database                             ★ Qlikview
★ Teradata                            ★ Microstrategy
★ Oracle                             ► Data Quality
★ Netezza                             ★ QualityStage
★ DB2                                 ★ Trillium
                                       ★ Business Objects
                                       ★ Informatica



Join our Question Answer community to learn and share your programming knowledge.

Help the community:

PHP: Loop

JavaScript: Delete duplicates in an array

Oracle: Round the number