29-09-2017 Door: Barry Devlin

Operating a Data Warehouse

Deel dit bericht

Having designed and built your data warehouse, I imagine that you’d like to deliver it successfully to the business and run it smoothly on a daily basis. That’s the topic of today’s article.

As digitalization continues apace across all industries, the role and value of a data warehouse—together with its attendant data marts and associated data lake—becomes ever more central to business success. With such informational systems now becoming as important as traditional operational systems, and often more so, it should be self-evident that highly reliable and efficient operating practices must be adopted.

Historically, however, approaches to operating a data warehouse environment have been somewhat lax. Manual and semi-automated methods for populating and updating the contents of the warehouse have been widespread. Advances made in the data warehouse itself have been offset by the spread of ad hoc approaches in departmentally managed data marts. The data lake has seen the re-emergence of a cottage industry of hand-crafted scripts, often operated by individual data scientists.

The challenges of data lake operations and management have recently attracted widespread comment (centered on the phrase data swamp) and increasing focus by vendors. Nonetheless, it is the data warehouse—as the repository of truth about the legally-binding history of the business and the basis for reliable exploration and analysis of business challenges and opportunities—where most can be gained by the adoption of advanced management and automation practices. The combination of data warehouse automation (DWA) and Data Vault address these needs from two perspectives: deployment of function and ongoing day-to-day operations.

Deployment seldom gets the attention it deserves; it’s not exactly the sexiest part of any IT project! However, for a data warehouse, deployment needs to be treated as a long-term, monogamous relationship. A data warehouse is substantially more complex than most IT projects, given the variety and number of systems involved. It is also significantly more important to get right as we move toward data-driven business and more agile development approaches.

As a data warehouse moves from the development phase (design and build discussed previously) to test, quality assurance, and on to production, seemingly mundane—yet highly important—issues such as packaging and installation of the code built in the previous phase must be addressed. In the case of DWA, where all cleansing, transformation, (and loading, of course) occur in the target databases of the data warehouse, mart and lake, this code consists of both definitional SQL (DDL) that builds the database structures such as tables, indexes, etc. and the processing code (DML) that creates the data to populate them.

In the context of a warehouse designed with the Data Vault model and methods, WhereScape Data Vault Express™ allows a set of objects, such as the related Hub, Satellite, and Link objects of a customer ensemble, to be bundled together, transported and installed with ease from the development environment to quality assurance and then subsequently into production. This bundle includes the selected object and related metadata—structure (DDL), processing code (ELT procedures, etc.), and jobs. When installed into an environment, the product determines what DDL changes need to be made if an object already exists (for example, add columns, new indexes, and so on) and constructs the appropriate DDL syntax statements. New ELT code such as stored-procedures are then installed and compiled in the database.

The clear aim—subject, of course, to internal policies—is to automate the deployment activities in order to speed deployment in agile development approaches and to reduce the chance of human error across the full life cycle.

Having deployed the system to production, the next—and ongoing—task is to schedule, execute, and monitor the continuing process of loading and transforming data into the data warehouse. In this phase, jobs defined by WhereScape consist of a sequence of interdependent tasks. For example, one sequence could be to drop certain indexes on a table, load new data to the table, and rebuild the indexes.  The administrator creating the job can specify the tasks and their interdependencies to ensure the objects are processed in the correct order.  During execution, the tasks run in parallel (up to a specified threshold, to ensure the system is not overloaded), subject to dependency constraints. This feature is particularly useful in a Data Vault 2.0 environment, where the design supports elevated levels of parallelization of load tasks.

To ensure that data consistency is maintained, if a task fails during execution, then all downstream dependent tasks are halted.  When the problem has been resolved, the job is restarted and will pick up from where it left off and continue through to completion.

As mentioned earlier, a modern analytical environment consists of a combination of data warehouse, marts, and data lake. From an operational point of view, given potential interdependencies of data across these systems, it makes sense to manage this ensemble as a single, logical environment. WhereScape supports this aim, both in terms of its job definitions that span multiple systems and in its provision of a centralized monitoring and logging repository. Here, all job execution activities such as processing times (start and finish), rows loaded, errors, exceptions, and so on are logged.  This provides historical information that can be used to track load performance over time, allowing administrators to make any necessary adjustments as data volumes grow.

The smooth, ongoing daily operation of the entire data warehouse environment is a fundamental prerequisite to its acceptance by users and its overall value to the business. Nothing will destroy users’ confidence more quickly than arriving every morning and not knowing whether the warehouse is up and running with the latest data updates.

And yet, there’s more! How do you support the changes in requirements that occur almost continuously when a data warehouse has been successful? Data Vault and Data Warehouse Automation can help here too. That is the topic of the fourth and final post of this series.

This blog is part of a series and has been published on WhereScape.com.

See part 1, See part 2.

Barry Devlin

Dr. Barry Devlin behoort tot de autoriteiten op het gebied van business insight en is een van de grondleggers van datawarehousing. Met meer dan 30 jaar ervaring in IT, waarvan 20 jaar bij IBM als "Distinguished Engineer", is hij een alom gewaardeerd consultant en docent, en auteur van het standaardwerk “Data Warehouse – from Architecture to Implementation.” Zijn meest recente boek “Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data” verscheen in 2013. Hij is oprichter van 9sight Consulting, gespecialiseerd in de menselijke, organisatorische en IT-implicaties, en het ontwerp van oplossingen voor diepgaande inzichten in de business. Barry sprak op ons jaarlijkse congres DW&BI Summit

Alle blogs van deze auteur

Partners