The two most common descriptions of ETL Systems are:
- It takes too long
- It costs too much
These two descriptions apply equally when describing the development of ETL as well as the run times of the ETL subsystem.
If you have described your ETL System as “it takes too long” or it “costs too much”?
Then this blog post is for you.
In this blog post we describe some of the common issues that we have seen with ETL subsystems.
Firstly we will discuss some of the run time issues.
Run Time Issues for ETL
The number one reason why ETL systems are slow in production is redundant processing.
This means that processing is being performed that is not necessary to perform.
The usual cause of redundant processing is that the developers of the ETL system did not know how to write the more complicated code to reduce, or eliminate, redundant processing.
Some examples of redundant processing are as follows.
1. Extracting data that has not changed.
In many cases it is hard for developers to determine what data has changed in an operational system. So, during the development period, the developers simply extract all the data for those extracts where it is difficult to tell what has changed.
With pressures of moving the data warehouse in to production these full extracts often go in to production and remain in production until they are causing too many problems and are fixed.
At the very worst the ETL subsystem should be build to perform “delta detection” prior to data being placed in to the staging area so that the only rows that go in to the staging area are rows that have been inserted or updated. If a row has been deleted the “row deleted flag” should be set on the staging area row. The row itself should not be deleted.
This “delta detection” code can be difficult to write in such tools as DataStage and Informatica which is one more reason why it is very often not written.
2. Applying changes to data that has not changed
The mirror image of extracting data that has not changed is the ETL system is applying changes to data in the data warehouse where the data has not changed.
The very often happens when there is no ability to detect changes in rows properly and so more data than is necessary is extracted, and then this data is applied to the target tables without performing a delta detection process either before the staging area or against the target data warehouse tables.
In these cases the updates are logged. If the rows are deleted and then re-inserted the deletes are logged and the indexes are updated.
In all cases the work that is required causes a large extra, unnecessary, processing load.
3. Resummarising data for summary tables
The #1 thing that can be done to improve query performance in a data warehouse is to create summaries to support the queries that can be answered using aggregated data. The larger the volume of the transaction records the larger the benefit of maintaining summaries to answer some queries.
However, maintaining incremental updates of summaries is not a simple thing to do. And many companies develop ETL systems where the updates of the summary tables or cubes performs a lot of redundant processing.
4. Lack of proper parallelism of ETL processing.
Everyone wants their data delivered to their data warehouse, and cubes and reports, by 9am when the business day starts. One of the biggest complaints we see in ETL run time systems is “our data is not updated by 9am, we have to wait.”
Most ETL tools make it very difficult to most effectively parallelize the processing of batches such that the machine running the ETL can be run at high utilization during ETL processing.
Because it is difficult it is generally not well implemented.
Further, in batches that have parallisation in the batch, when one command in the batch fails it generally required an ETL developer make some form of manual changes to the “code” that runs the batch to complete the batch once the problem is fixed.
This introduces the opportunity for manual errors to be introduced in to that ETL processing. Especially because these changes might be being made in the middle of the night when the developer is not so mentally aware as he might otherwise be.
Development Issues With ETL
We have had many years of ETL development experience across many tools, most notably DataStage and Informatica.
As good as these tools are, the development times of the ETL mappings are relatively slow.
The answer for most companies has been to reduce the cost of development by hiring cheaper people, especially from the Indian outsourcers such as Tata, Wipro, InfoSys and others. The booming IT developer market in India has been largely a result of companies sending development work to India and ETL development work is no different.
As a result of this trend the ETL tools developers responded with creating graphical user interfaces that made it possible for a relatively low skilled, low paid developers to develop ETL mappings.
This has reduced the cost of ETL development but it did not particularly reduce the time required to perform ETL development. It did not improve the quality of ETL development either.
During our start up phase we will support ETL developed in DataStage or Informatica. However, we would openly admit we are no faster than any other good DataStage / Informatica developers. This is because the nature of the tools drives the development time, not the skill of the user of the tools.
We took a different approach and we have our own BIDA ETL software.
We took the approach of “make highly skilled people more productive writing ETL” rather than “enable less skilled people to write the ETL.”
Not only does this reduce the cost of ETL development, it improves the quality of the ETL subsystem because the people doing the development work are more skilled and experienced. It is normal for someone with much more skill and much more experience to do higher quality work than a less experienced person. It’s why we pay them more.
So our approach was to find a way to enable highly skilled, highly paid developers to develop ETL faster, cheaper, better than less skilled people who were paid less. We wanted to find a way where the overall cost of such development was considerably less than hiring the “Indian Outsourcing Company”.
And this is what we have done.
The #1 feature of the ETL tools is that they have a graphical user interface to make the development of the ETL possible by less skilled, lower paid developers.
The #1 feature of our BIDA ETL tool is that we avoided having a graphical user interface so as to make the more highly skilled, more highly paid, developers more productive.
It really is as simple as that.
We know that we can develop and support ETL systems for much less cost than anyone else in Romania can. Indeed, we can build ETL systems so quickly and so cost effectively we can migrate the finished BIDA ETL system to Informatica or DataStage if that is what you want and still perform that development work fast than anyone else.
Collapse of Data Structures Plus Processing Algorithms
Because the ETL tools rely on a graphical user interface in order to develop their mappings it is necessary to collapse the data structure and the processing algorithms in to one “program object” which we will call a “mapping”.
Many people do not understand what we mean by that so please allow us to explain.
When you write, say, a fact table mapping, you must define the source, you must define the target, and you must define the lookups for the dimension string keys from the source to the integer keys that will be put on the fact table. All fact table processing with dimensional models looks like this.
In ETL tools you must code those lookups inside the mapping along with any other processing you wish to do for the fact table processing. This means your algorithms for your lookup processing are inside the mapping.
Although this seems obvious and is therefore accepted by everyone, it is a serious issue that slows down development considerably.
Why? Because those same lookups, such as the lookup for dates, times, customers, products, geography, company structure etc are common across MANY fact tables. So the lookups must be coded by the ETL developer each time for each lookup for each fact table.
In a large data warehouse with 50+ fact tables you could be coding upwards of 500 such lookups, all of which have to be tested individually for correctness. And that takes developer time.
The far better way to develop ETL is to separate the data structures from the processing algorithm. This is, in fact, one of the most fundamental development strategies that has been taught for decades.
Always separate the data structures from the algorithms so that you have the algorithm coded only once but you can give the algorithm many data structures to perform it’s processing on.
And this is the approach we have taken. In BIDA ETL the developer does not need to concern himself with the algorithm that will be applied to the data structures he is creating in mappings. The algorithm is separated from the data structures.
This provides a very large productivity boost and that boost is linear to the amount of mappings to be developed.
To explain simply?
With standard ETL tools using a graphical interface the ETL developer must code the algorithm for the ETL processing in to the graphical user interface as well as code the sources and target data structures for processing.
With BIDA ETL the ETL developer only codes the sources and target data structures and processing that is unique to that mapping. He does not code the algorithm for the processing of the data. That is separated and not in the hands of the ETL developer.
This is the fundamental difference that makes BIDA ETL development faster, cheaper, better and more reliable.
In this blog post we have talked about some of the issues with ETL development. The topics covered included:
- Extracting data that has not changed
- Applying changes to data that has not changed
- Resummarizing data for summary tables
- Lack of proper parallelism of ETL processing
- Collapse of Data Structures Plus Processing Algorithms
For a sizable ETL subsystem these thing can add up to significant elapsed time per ETL batch cycle and result in the BI system not being up to date at 9am in the morning.
Obviously, we have long experience at solving all these problems and many more.
We have implemented many of these solutions in the data warehouse project of our development partner here in Bucharest. We are able to describe them in great detail to you if you would like us to do that.
The BIDA Team!
If you have any of these ETL problems and would like to talk with us about how we solve them in the ETL systems we have implemented?
Please feel free to put your best email address in the form below. You also have a space to ask us any question you would like in the form.
We will be emailing our subscribers useful and
valuable information regarding business intelligence.
Please subscribe here to be on our general emailing list.