An article by Ian Manning

DWheader.gif (32604 bytes)

Data Warehousing – what exactly is it ?

Heralded as the solution to the management information dilemma, the term "data warehouse" has become one of the most used and abused terms in the IT vocabulary. But ask a variety of vendors and professionals for their vision of what a data warehouse is and how it should be built, and the ambiguity of the term will quickly become apparent.

To a number of people, a data warehouse is any collection of summarised data from various sources, structured and optimised for query access using OLAP (on-line analytical processing) query tools. This view was originally propagated by the vendors of OLAP tools. To others, a data warehouse is virtually any database containing data from more than one source, collected for the purpose of providing management information. This definition is neither helpful nor visionary, since such databases have been a feature of decision support solutions since long before the coining of the term "data warehouse".

The concept of "data warehousing" dates back at least to the mid-1980s, and possibly earlier. In essence, it was intended to provide an architectural model for the flow of data from operational systems to decision support environments. It attempted to address the various problems associated with this flow, and the high costs associated with it. In the absence of such an architecture, there usually existed an enormous amount of redundancy in the delivery of management information. In larger corporations it was typical for multiple decision support projects to operate independently, each serving different users but often requiring much of the same data. The process of gathering, cleaning and integrating data from various sources, often legacy systems, was typically replicated for each project. Moreover, legacy systems were frequently being revisited as new requirements emerged, each requiring a subtly different view of the legacy data.

Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for legacy data. From here data could be distributed to "retail stores" or "data marts" which were tailored for access by decision support users. While the data warehouse was designed to manage the bulk supply of data from its suppliers (e.g. operational systems), and to handle the organization and storage of this data, the "retail stores" or "data marts" could be focused on packaging and presenting selections of the data to end-users, often to meet specialised needs.

Somewhere along the way this analogy and architectural vision was lost, often manipulated by suppliers of decision support software tools. Data warehousing "gurus" began to emerge at the end of the 80s, often themselves associated with such companies. The architectural vision was frequently replaced by studies of how to design decison support databases. Suddenly the data warehouse had become the miracle cure for the decision support headache, and suppliers jostled for position in the burgeoning data warehousing marketplace.

Despite the recent association of the term "data warehousing" with OLAP and multi-dimensional database technology, and the insistence of some people that data warehouses must be based on a "star schema" database structure, it is wise to restrict the use of such designs to data marts. The use of a star schema or multi-dimensional / OLAP design for a data warehouse can actually seriously compromise its value for a number of reasons:

(a) such designs assume that all queries on the warehouse will be of a quantitative nature - i.e. queries on aggregated numeric data. This overlooks the fact that data warehouses can also offer enormous benefit as repositories of text-based or qualitative data - e.g. the provision of a 360° view of customers by collecting profile information from a range of sources ;

(b) such designs require the pre-aggregation of data in the data warehouse. In doing so, and eliminating much of the original transactional data, much information can be lost. If information requirements change, requiring alternative aggregations from the transactional data, a star or multi-dimensional design will quickly become obsolete. A normalised design, on the other hand, which accommodates transactional level data would be able to support any number of alternative aggregations. While capacity and/or performance constraints may preclude this as an option for some data, the storage of low level transactional data in a data warehouse should not be ruled out, as this is often the only way of ensuring maximum flexibility to support future information needs ;

(c) optimised models such as star schemas are, in general, less flexible than normalised designs. Changes to business rules or requirements are generally more easily accommodated by normalised models.

Data marts provide the ideal solution to perhaps the most significant conflict in data warehouse design - performance versus flexibility. In general, the more normalised and flexible a warehouse data model is, the less well it performs when queried. This is because queries against normalised designs typically require significantly more table join operations than optimised designs. By directing all user queries to data marts, and retaining a flexible model for the data warehouse, designers can achieve flexibility and long term stability in the warehouse design as well as optimal performance for user queries.

 

Why is it so expensive?

While the data warehousing concept in its various forms continues to attract interest, many data warehousing projects are failing to deliver the benefits expected of them, and many are proving to be excessively expensive to develop and maintain. For this reason it is important to have a clear understanding of their real benefit, and of how to realise this benefit at a cost which is acceptable to the enterprise.

The costs of data warehousing projects are usually high. This is explained primarily by the requirement to collect, "clean" and integrate data from different sources - often legacy systems. Such exercises are inevitably labour-intensive and time-consuming, but are essential to the success of the project - poorly integrated or low quality data will deliver poor or worthless management information. The cost of extracting, cleaning and integrating data represents 60-80% of the total cost of a typical data warehousing project, or indeed any other decision support project.

Vendors who claim to offer fast, cheap data warehouse solutions should be asked to explain how they are able to avoid these costs, and the likely quality of the results of such solutions must be carefully considered. Such vendors typically place the emphasis on tools as a solution to the management information problem – OLAP tools, data integration technology, data extraction tools, graphical user query tools, etc. Such tools resolve only a fraction of the management information problem, and represent a small proportion of the cost of a successful data warehousing project.

Focus on technology rather than data quality is a common failing among data warehousing projects, and one which can fatally undermine any real business benefit.

 

 How can the cost be justified ?

Given the high costs, it is difficult to justify a data warehousing project in terms of short-term benefit. As a point solution to a specific management information need, a data warehouse will often struggle to justify the associated investment. It is as a long term delivery mechanism for ongoing management information needs that data warehousing reaps significant benefits. But how can this be achieved ?

Given the above facts about the loading of costs on data warehousing projects, it is clear that focus must be on the reduction of the ongoing cost of data extraction, cleaning and integration.

A number of years ago I conducted a study for a multi-billion dollar manufacturing and services organization. The purpose of the study was to identify why previous data warehousing projects had failed to deliver the expected benefits, and to make recommendations for how future projects could rectify this.

The study resulted in a number of significant findings, including the following:

1. 80% of the data used by the various data warehouses across the corporation came from the same 20% of source systems.

2. Each new data warehousing project usually carried out its own process to extract, clean and integrate data from the various sources, despite the fact that much of the same data had been the subject of previous exercises of a similar nature.

3. The choice of data to be populated in the data warehouse was usually based on needs of a specific group, with a particular set of information requirements. The needs of other groups for the same data were rarely considered.

Experience of other organizations showed a very similar pattern to the above. From these findings alone it is clear that there is scope for economies of scale when planning data warehousing projects ; if focus were to be placed initially on the 20% of source systems which supplied 80% of the data to decision support systems, then an initial project which simply warehouses "useful" data from these systems would clearly yield cost benefits to future MIS projects requiring that data. Rather than targeting a specific business process or function, benefits should be aimed at the wider audience for decision support. Such a project would form an invaluable foundation for an evolving data warehouse environment.

When building a data warehouse the use of multi-dimensional, star-schema or other optimised designs should be strongly discouraged, in view of the inherent inflexibilities in these approaches as outlined above. The use of a relational, normalised model as the backbone of the warehouse will ensure maximum flexibility to support future growth. If user query access is then strictly limited to data marts, the data warehouse needs only to support periodic extracts to data marts, rather than ad-hoc query access. Performance issues associated with these extracts can be addressed in a number of ways - for example through the use of staging areas (either temporary or permanent) where relational table structures are pre-joined or "flattened" to support specific extract processes.

Once this initial project is complete, emphasis can be placed on the growth of the warehouse as a global resource for unspecified future decision support needs, rather than as a solution to specific requirements at a particular time. In subsequent phases of the warehouse development, new data which is likely to play a major role in future decision support needs should be carefully selected, extracted and cleaned. It can then be stored alongside the existing data in the warehouse, hence maximising its information potential. As new information needs emerge, the cost of meeting them will be diminished due to the elimination of the need to perform much of the costly extraction, cleaning and integration functions usually associated with such systems.

Over time, this environment will grow to offer a permanent and invaluable repository of integrated, enterprise-wide data for management information. This in turn will lead to massively reduced time and cost to deliver new decision support offerings, and hence to true cost justification.

The effort required to achieve this must not be underestimated, however. Identifying which data is "useful" requires a great deal of experience and insight. The way in which the data is modelled in the warehouse is absolutely critical - a poor data model can render a data warehouse obsolete within months of implementation. The process used to identify, analyse and clean data prior to loading it into the warehouse, and the attendant user involvement, is critical to the success of the operation. Management of user expectations is also critical. The skills required to achieve all of the above are specialised.

Once in the warehouse, data can be distributed to any number of data marts for user query access. These data marts can take any number of forms, from client-server databases to desktop databases, OLAP cubes or even spreadsheets. The choice of user query tools can be wide, and can reflect the preferences and experience of the users concerned. The wide availability of such tools and their ease of implementation should make this the cheapest part of the data warehouse environment to implement. If data in the warehouse is well-structured and quality-assured, then exporting it to new data marts should be a routine and low-cost operation.

In summary, a data warehouse environment can offer enormous benefits to most major organizations if approached in the correct way, and if distractions from the main goal of delivering a flexible, long-term information delivery environment are placed in perspective.

Last Updated: 10/07/2000

About the author

Ian Manning is a freelance IT consultant of some 20 years' experience, specialising in data warehousing and decision support consultancy. Roles have included project management, systems architecture, business and systems analysis, system design, database design and tuning, data analysis and development. These have been carried out for a wide range of clients from a number of different sectors, including banking, manufacturing, local government, pharmaceuticals, energy and insurance. He can be contacted at : imanning@ioxis.com

Feedback

To: webmaster@manning.uk.com

From:

Return email address:

Message:

Back To Top

You are visitor number: Please load images to see this counter

FED UP WITH BAD SERVICE? CLICK HERE TO DO SOMETHING ABOUT IT: