BIDA0003 – Data Model Improvements

0

The data warehouse data model is the foundation of the BI Solution for any company.

In our many years of experience we consistently see poor data models implemented even by large companies that really should know better.

Poor data models can cause issues for BI systems for many years. And once there are large numbers of reports and applications built over the top of the data models they are very difficult and expensive to migrate to a better set of models.

Poor data models can live on in production for many, many years causing frustration and lost time and money for the users of the system and the developers who support the system.

It is much better to deliver the first version of the data warehouse a little later with better data models than to deliver sooner with poor data models.

With our many years of experience we have developed our own suite of data models which have a large number of specific design features to make them excellent data models.

Here in Romania we do not expect many companies to wish to adopt our data models because most companies that need a data warehouse already have one in production.

Migration to our models would be a large undertaking that only a few companies would be willing to go ahead with.

However, if you have a data warehouse in production and you have questions about what improvements could be made to your data models?

Then we would be happy to provide you the free service of reviewing your data models and preparing for you a list of recommendations for improvements that could be made to your data models.

Of course, this service would be provided at no charge to prospective customers as part of our process of establishing our credibility to you and your peers in your company in the area of data models.

In the near future we will be updating this post to talk about “common data model mistakes”.

If you are on our email list you will be notified by email when we update this post.

If you would like to ask us questions about your data models and your thoughts on what you believe might be issues in your data models?

Then please feel free to put your best email address in the form below.

You also have a space for asking us any question you would like in the form.

We will be emailing our subscribers useful and
valuable information regarding business intelligence.
Please subscribe here to be on our general emailing list.

BIDA0002 – ETL Performance & Reliability Improvement

0

The two most common descriptions of ETL Systems are:

  • It takes too long
  • It costs too much

These two descriptions apply equally when describing the development of ETL as well as the run times of the ETL subsystem.

If you have described your ETL System as “it takes too long” or it “costs too much”?

Then this blog post is for you.

In this blog post we describe some of the common issues that we have seen with ETL subsystems.

Firstly we will discuss some of the run time issues.

Run Time Issues for ETL

The number one reason why ETL systems are slow in production is redundant processing.

This means that processing is being performed that is not necessary to perform.

The usual cause of redundant processing is that the developers of the ETL system did not know how to write the more complicated code to reduce, or eliminate, redundant processing.

Some examples of redundant processing are as follows.

1. Extracting data that has not changed.

In many cases it is hard for developers to determine what data has changed in an operational system. So, during the development period, the developers simply extract all the data for those extracts where it is difficult to tell what has changed.

With pressures of moving the data warehouse in to production these full extracts often go in to production and remain in production until they are causing too many problems and are fixed.

At the very worst the ETL subsystem should be build to perform “delta detection” prior to data being placed in to the staging area so that the only rows that go in to the staging area are rows that have been inserted or updated. If a row has been deleted the “row deleted flag” should be set on the staging area row. The row itself should not be deleted.

This “delta detection” code can be difficult to write in such tools as DataStage and Informatica which is one more reason why it is very often not written.

2. Applying changes to data that has not changed

The mirror image of extracting data that has not changed is the ETL system is applying changes to data in the data warehouse where the data has not changed.

The very often happens when there is no ability to detect changes in rows properly and so more data than is necessary is extracted, and then this data is applied to the target tables without performing a delta detection process either before the staging area or against the target data warehouse tables.

In these cases the updates are logged. If the rows are deleted and then re-inserted the deletes are logged and the indexes are updated.

In all cases the work that is required causes a large extra, unnecessary, processing load.

3. Resummarising data for summary tables

The #1 thing that can be done to improve query performance in a data warehouse is to create summaries to support the queries that can be answered using aggregated data. The larger the volume of the transaction records the larger the benefit of maintaining summaries to answer some queries.

However, maintaining incremental updates of summaries is not a simple thing to do. And many companies develop ETL systems where the updates of the summary tables or cubes performs a lot of redundant processing.

4. Lack of proper parallelism of ETL processing.

Everyone wants their data delivered to their data warehouse, and cubes and reports, by 9am when the business day starts. One of the biggest complaints we see in ETL run time systems is “our data is not updated by 9am, we have to wait.”

Most ETL tools make it very difficult to most effectively parallelize the processing of batches such that the machine running the ETL can be run at high utilization during ETL processing.

Because it is difficult it is generally not well implemented.

Further, in batches that have parallisation in the batch, when one command in the batch fails it generally required an ETL developer make some form of manual changes to the “code” that runs the batch to complete the batch once the problem is fixed.

This introduces the opportunity for manual errors to be introduced in to that ETL processing. Especially because these changes might be being made in the middle of the night when the developer is not so mentally aware as he might otherwise be.

Development Issues With ETL

We have had many years of ETL development experience across many tools, most notably DataStage and Informatica.

As good as these tools are, the development times of the ETL mappings are relatively slow.

The answer for most companies has been to reduce the cost of development by hiring cheaper people, especially from the Indian outsourcers such as Tata, Wipro, InfoSys and others. The booming IT developer market in India has been largely a result of companies sending development work to India and ETL development work is no different.

As a result of this trend the ETL tools developers responded with creating graphical user interfaces that made it possible for a relatively low skilled, low paid developers to develop ETL mappings.

This has reduced the cost of ETL development but it did not particularly reduce the time required to perform ETL development. It did not improve the quality of ETL development either.

During our start up phase we will support ETL developed in DataStage or Informatica. However, we would openly admit we are no faster than any other good DataStage / Informatica developers. This is because the nature of the tools drives the development time, not the skill of the user of the tools.

We took a different approach and we have our own BIDA ETL software.

We took the approach of “make highly skilled people more productive writing ETL” rather than “enable less skilled people to write the ETL.”

Not only does this reduce the cost of ETL development, it improves the quality of the ETL subsystem because the people doing the development work are more skilled and experienced. It is normal for someone with much more skill and much more experience to do higher quality work than a less experienced person. It’s why we pay them more.

So our approach was to find a way to enable highly skilled, highly paid developers to develop ETL faster, cheaper, better than less skilled people who were paid less. We wanted to find a way where the overall cost of such development was considerably less than hiring the “Indian Outsourcing Company”.

And this is what we have done.

The #1 feature of the ETL tools is that they have a graphical user interface to make the development of the ETL possible by less skilled, lower paid developers.

The #1 feature of our BIDA ETL tool is that we avoided having a graphical user interface so as to make the more highly skilled, more highly paid, developers more productive.

It really is as simple as that.

We know that we can develop and support ETL systems for much less cost than anyone else in Romania can. Indeed, we can build ETL systems so quickly and so cost effectively we can migrate the finished BIDA ETL system to Informatica or DataStage if that is what you want and still perform that development work fast than anyone else.

 

Collapse of Data Structures Plus Processing Algorithms

Because the ETL tools rely on a graphical user interface in order to develop their mappings it is necessary to collapse the data structure and the processing algorithms in to one “program object” which we will call a “mapping”.

Many people do not understand what we mean by that so please allow us to explain.

When you write, say, a fact table mapping, you must define the source, you must define the target, and you must define the lookups for the dimension string keys from the source to the integer keys that will be put on the fact table. All fact table processing with dimensional models looks like this.

In ETL tools you must code those lookups inside the mapping along with any other processing you wish to do for the fact table processing. This means your algorithms for your lookup processing are inside the mapping.

Although this seems obvious and is therefore accepted by everyone, it is a serious issue that slows down development considerably.

Why? Because those same lookups, such as the lookup for dates, times, customers, products, geography, company structure etc are common across MANY fact tables. So the lookups must be coded by the ETL developer each time for each lookup for each fact table.

In a large data warehouse with 50+ fact tables you could be coding upwards of 500 such lookups, all of which have to be tested individually for correctness. And that takes developer time.

The far better way to develop ETL is to separate the data structures from the processing algorithm. This is, in fact, one of the most fundamental development strategies that has been taught for decades.

Always separate the data structures from the algorithms so that you have the algorithm coded only once but you can give the algorithm many data structures to perform it’s processing on.

And this is the approach we have taken. In BIDA ETL the developer does not need to concern himself with the algorithm that will be applied to the data structures he is creating in mappings. The algorithm is separated from the data structures.

This provides a very large productivity boost and that boost is linear to the amount of mappings to be developed.

To explain simply?

With standard ETL tools using a graphical interface the ETL developer must code the algorithm for the ETL processing in to the graphical user interface as well as code the sources and target data structures for processing.

With BIDA ETL the ETL developer only codes the sources and target data structures and processing that is unique to that mapping. He does not code the algorithm for the processing of the data. That is separated and not in the hands of the ETL developer.

This is the fundamental difference that makes BIDA ETL development faster, cheaper, better and more reliable.

Summary

In this blog post we have talked about some of the issues with ETL development.  The topics covered included:

  • Extracting data that has not changed
  • Applying changes to data that has not changed
  • Resummarizing data for summary tables
  • Lack of proper parallelism of ETL processing
  • Collapse of Data Structures Plus Processing Algorithms

For a sizable ETL subsystem these thing can add up to significant elapsed time per ETL batch cycle and result in the BI system not being up to date at 9am in the morning.

Obviously, we have long experience at solving all these problems and many more.

We have implemented many of these solutions in the data warehouse project of our development partner here in Bucharest. We are able to describe them in great detail to you if you would like us to do that.

Best Regards

The BIDA Team!

If you have any of these ETL problems and would like to talk with us about how we solve them in the ETL systems we have implemented?

Please feel free to put your best email address in the form below. You also have a space to ask us any question you would like in the form.

We will be emailing our subscribers useful and
valuable information regarding business intelligence.
Please subscribe here to be on our general emailing list.

BIDA0001 – SQL Server Manual Partitioning

0

If you are not familiar with the benefits of manually partitioning a data warehouse when using SQL Server, then this post is intended to give you a brief introduction as to the benefits of manual partitioning.

Because we were building sql server data warehouses before the database supported partitioning feature was introduced by Microsoft in SQL Server 2005 we had to implement manual partitioning.

When microsoft implemented database supported partitioning in SQL Server 2005 everyone moved to the database supported partitioning because there was very little extra cost involved.

Enterprise Edition could be purchased as a Client Access License and this was suitable for data warehouses.

All this changed in SQL Server 2012 when Enterprise Edition, the only edition with the partitioning feature, was changed to only be available as a per core license and only for USD17,000+.

USD17,000+ per core for the data warehouse database license is not affordable for a lot of Romanian companies.

So for those Romanian companies who wanted to use SQL Server for your data warehouse and did not know how to implement manual partitioning you were faced with a decision:

  • Buy SQL Server Enterprise Edition for USD17,000+ per core.
  • Implement your data warehouse on Standard Edition without partitioning the large tables.

In talking with a number of Romanian companies who implemented SQL Server as their data warehouse we have found that many of them have implemented Standard Edition with no partitioning of the largest fact tables.

Further, we have often found that the tables are stored in a single file group, often “Primary”, and often in tables with clustered indexes.

These three things all serve to slow down the performance of your data warehouse.

In SQL Server it is best to implement the large fact tables as follows:

  • Partition the tables, usually by month.
  • Place the different partitions for both indexes and data on different disks to spread the workload of queries for recent data over a number of disk drives.
  • To avoid the use of clustered indexes and place the data in to separate files using file groups to direct the placement of the data.

Though this does take some time and effort to set up correctly and to maintain over the years. The major benefits are:

  • Superior performance.
  • Ease of maintenance of the much smaller tables and indexes.
  • The much lower cost of the Standard Edition License either as a per core purchase or a Client Access License purchase.

Obviously, here in Romania, most companies would like to be able to pay the license fee for Standard Edition if they can get “nearly the same” performance as they could get from the Enterprise Edition.

Table partitioning is the single largest performance improvement feature available in Enterprise Edition that is not available in Standard Edition.

Partitioning also improves the ability to manage the growth of the data warehouse over time. As years and years of data builds up in the data warehouse partitioning means only the newer tables are updated. Tables and files are smaller. Indexes are smaller.

The benefits of implementing partitioning are widely documented.

What is not widely documented in how to implement manual partitioning on SQL Server. Certainly MicroSoft has no interest in explaining to you how this is done!

We have written a very detailed white paper to show you how to manually partition tables in SQL Server Standard Edition to get “almost the same” performance as the Enterprise Edition partitioning feature provides.

If you would like a copy of this detailed white paper please just put your best email address in to the form below and we will send you a link where you can download this document.

We will be emailing our subscribers useful and
valuable information regarding business intelligence.
Please subscribe here to be on our general emailing list.