BIDA0015-ETL Performance Improvements Case Study

0
305

Hello and welcome to our latest blog post.

We really appreciate you coming by to read or listen to our latest blog post!

So thank you for dropping by!

The text of this video will be a little less formal because this blog post is intended to be listened to as an audio rather than read on line.

We realize that you are very busy and you are more likely to listen to this blog post as an audio than you are to read the blog post itself.

So because this blog post is intended to be listened to? The text will be in a little more detail than usual.


Introduction

Now, on the front page of our web site we ask you questions about whether you are satisfied with your ETL system.

We pose questions about the speed of processing, the delivery time of the data and the cost of adding new tables and fields.

And we ask that if you are not satisfied with your ETL in any way, shape or form, we would like to talk with you and hear what your problems are.

We are pretty sure we can help you.

Today we are going to share with you an example of performance improvement at a local Romanian company we have been working with over the last 5 years.

Indeed, we were able to gain the confidence of this company by showing them some of the improvements we could make to their ETL way back in 2015.

To give you just a little background? They are a retailer with a chain of stores nationwide.

Pretty much everyone in Romania knows their name.


The Story Of ETL Performance Improvement for this Customer

So here we go with the story of how we have improved the ETL performance, and data coverage, for this Romanian retailer.

Ok?

The ERP involved is Microsofts Navision and so SQL Server was chosen as the target data warehouse platform.

The first release of the data warehouse had been implemented in 2012 using SQL Server 2008R2 because it was the latest stable release when the data warehouse went live.

After some years of operation the customer was having problems getting the data delivered to their users at a reasonable time.

In this case the data from the overnight processing in Navision was not available to the business users until after lunch, around 1pm.

What were some of the issues in the ETL System? Perhaps your ETL system contains some of these issues.

1. Using Stored Procedures For ETL Development

We do not recommend using Stored Procedures to build ETL Systems where they can be avoided.

Because Stored Procedure Languages differ from database to database you immediately lock your self in to that database if you use SPs as your ETL.

Today, we recommend ETL be written in ANSI Standard SQL Statements.

Where tools are used to perform ETL they should be database independent.

It should always be possible to move your data warehouse database to a different database even if you never actually do that.

It helps in negotiating license fees.

Stored Procedures are also the single most expensive approach to writing ETL systems.

2. Serial Execution Of ETL Processing

Because the ETL was written with Stored Procedures they were executed in serial.

It is difficult to design sql server jobs to execute Stored Procedures in parallel with reliable testing for success of groups of stored procedures.

This lack of parallel processing of the ETL system meant longer elapsed times for the ETL.

3. No Staging Area

There was no staging area constructed.

So the ETL processing, which lasted around 7 hours, was drawing data from the ERP production system directly that whole period.

Not only does this slow down the production ERP but the data that is being extracted is changing over the time of the extraction because business processing was occurring during the extraction period.

4. Incremental Updates Not Implemented

Incremental extracts and updates were not properly implemented.

It is quite difficult to implement incremental update processing properly across an entire data warehouse.

When incremental update processing is not properly implemented this leads to reprocessing of data that does not need to be reprocessed.

This reprocessing adds directly to the ETL run times and the delay of data for the business users.

In this particular case tens of millions of rows of data that had not changed were being re-created and re-processed.

5. Clustered Fact Tables Rather Than Partitioned Fact Tables

The largest fact tables were created as clustered tables using a clustered primary index.

The ETL was then reprocessing many rows in each of the largest fact tables.

To do this re-processing the ETL was re-creating the rows for up to 2 months, deleting the rows in the target fact table, and inserting the re-created rows.

These deletes and inserts were logged and the combination of the logging and the maintenance of the clustered indexes added a great deal of time to the fact table processing.

6. No Multi-Level Summary Tables

There were no multi-level summary tables as per Ralph Kimballs publications.

So once all the data was processed in to the data warehouse the reporting side of the project had to read the detailed transaction fact tables to populate “reports” for the users.

So not only did the ETL finish around 1pm. The reports that were then produced also had to be run by the users to get their results and the run times were long because detailed fact tables were being scanned.

So that was the situation we were faced with 5 years ago.


How did we address some of these issues over the last 5 years?

1. Manually Partition Fact Tables

The first thing we did was to implement an example of a partitioned fact table in test.

Then we ran the exact same ETL for the production fact table against the manually partitioned fact table in test.

Obviously the performance was about 30% faster than the production clustered fact table.

This got their interest because they had never considered manual partitioning.

Database partitioning was not supported in SQL Server 2012 Standard Edition.

They did not want to pay for SQL Server 2012 Enterprise Edition to get database partitioning.

2. Delta Detection To Eliminate Unnecessary Database Processing

We then implemented a simple delta detection algorithm for the partitioned fact table.

This meant that rather than deleting and re-inserting up to 2 months worth of rows, just those rows that needed to be inserted, updated or deleted were processed.

This cut run times by about 60% from the original processing times for this prototype fact table.

3. Introduction of Multi-Level Summary Fact Tables

In his early works in the 90s Ralph Kimball wrote about how a great deal of query time can be saved by having multi-level summary fact tables.

His product “Red Brick” implemented this idea.

SSAS in SQL Server and all cube products implement a version of this idea.

The customer had not implemented any multi-level summary fact tables.

We implemented sample tables for the customers sales transactions and GL transactions to show how the reports would run much faster.

The customer was a MicroStrategy customer.

MicroStrategy has built in features to navigate multi-level summary fact tables because many of the employees who worked at MicroStrategy in the 90s were trained by Ralph Kimball.

Once the then prospect saw that we could make all these improvements in their ETL Processing and Report Query times with relatively simple changes they chose to engage us to support their ETL and data models.

This was way back in 2015.

We then implemented these very simple ideas in to their production ETL system and the run times went down to about 5 hours from about 7 hours.


In 2016 the client wanted to upgrade both the ERP and the Data Warehouse from SQL Server 2008R2 to a later release.

They chose SQL Server 2014 Standard Edition.

Unfortunately the Optimizer had been altered and the extraction SQL from the production ERP could not be made to work on the new version of SQL Server.

After exhaustive testing to get the old SPs to work on the new version of the database failed we chose to implement a staging area.

We had planned to implement a staging area at some point in time.

The database upgrades simply brought the plan forward.

We used the Navision timestamp field to perform incremental extracts from the Navision ERP.

We chose to replicate the entire production Navision System in to the staging area and then repoint the production ETL SPs to the staging area.

Of course, this added an hour of ETL processing time so our processing times went back up from around 5 hours to around 6 hours.

This would be a temporary increase as we would now be able to have greater control over the staging area in terms of indexing and processing loads.

We had previously had to be very careful of processing loads on the production ERP caused by the ETL SPs.

We devised a way to create the staging area that was very quick to create.

And from our learning and experience in that area we can now create staging areas for Navision systems in a weeks work.

We have published a detailed white paper on how to create staging areas for Navision on our web site.

By having the staging area on the same machine as the data warehouse we were able to persuade the optimizer to query the staging area using the same access paths as had previously been used on the Navision Production ERP.

More importantly the full extraction process ran in just 30 minutes per day so the extraction could be run out of business hours and when almost no changes were happening inside the production ERP.

We implemented a delta detection processing algorithm for the 800 or so tables that had data in them.

So we could easily tell which rows in the staging area had changed using a flag rather than using a timestamp.

This was important in implementing incremental updates in later years for more tables being sent in to the data warehouse.

So in 2016 we were having run times of around 6 hours and the old Stored Procedures we inherited were still running serially.

The next major improvement we proposed was to implement the BIDA Scheduler.

The BIDA Scheduler is a very good scheduler that allows the creation and management of “processing groups” that can be processed in parallel.

These “processing groups” can be followed by a “semaphore” group to make sure all processing groups in the current group are complete prior to continuing to the next set of processing groups.

So, please let me provide a simple example of why parallel processing groups is a good idea.

In this customer about 800 tables in their Navision system have data in them.

Most tables are small. But, of course, the sales transactions and the GL transaction tables have quite a lot of rows added to them every day.

And these tables are important to be reported on as early as possible each day. Especially the sales table.

When using Stored Procedures to perform ETL processing the ETL processing is done one table after another.

So the sales transactions must be extracted, and then the GL transactions must be extracted.

The BIDA Scheduler allows you to create a schedule where the sales transactions and the GL transactions can be extracted at the same time.

Then it can guarantee that all extract processing has been completed successfully before allowing the processing to move on to the delta detection and application of all the changes to the staging area.

With the BIDA Scheduler we were able to define processing so that we could run portions of the ETL in parallel.

This reduces run times.

With the customers agreement we implemented the BIDA Scheduler and we split up the processing of the largest fact tables and ran them in parallel.

With all these changes applied, back in 2016, we were able to get processing down to around 4.5-5 hours.

Much better than the 6 hours or so we were then achieving and much better than the 7 hours or so we had started with.

And with processing times less than 5 hours we could make the 9am deadline pretty reliably every day.

Of course, between 2016 and May 2020 we added much more data to the ETL processing and data warehouse.

The volumes of transactions have increased as their number of stores have increased and the number of sales in stores have increased.

We would keep doing performance tuning as much as was possible to the old Stored Procedures and we wrote all the new ETL using the BIDA ETL tools and we implemented new tables using the BIDA data models.

So the fact that we maintained processing times to around 4.5-5 hours while adding a lot more data and volumes going up is also important.

With improvements in processing in the production ERP itself we were able to consistently move the ETL start time to earlier and earlier in the morning.

We also did a machine upgrade and database upgrade in 2019.

Now, the big reason for this blog post is this.

Recently the customer wanted to decrease the ETL run time significantly.

The customer wanted access to the results of prior day sales at 7am rather than 9am.

Without re-writing much of the now 8 year old ETL Stored Procedures we could not get that performance.

We opted instead for a CPU upgrade and some SSD.

We also chose to do some performance tuning of the old Stored Procedures.

Just enough to get the run times down to where we wanted them.

So on our blog post you will see two images of run times from the BIDA Scheduler.

One before and one after our recent performance tuning.

In the diagrams you can see that on the 5th of May the run time for the Week Day Batch was 4 hours, 29 minutes, and one second.

This is right on or about our target processing times.

We have been happy to see processing times in the 4.5-5 hour range for week day processing.

That has been our target range for the last four years.

Remember, that was down from around 7 hours in 2015 with Monday morning processing often being around 8 hours.

In the second diagram, on the 27th of May, you can see that the processing time for the Week Day Batch was 2h 34m 30s.

That is nearly two full hours reduction in ETL processing!

You can see that we can start the batch at 2:15am and so with processing times of around two and a half hours we know we can reliably, every day, be sure the ETL will be long finished by 7am for the business users.

Indeed, business users who come in very early at 6am will also have access to previous day sales final figures by 6am.

Just as a side note you can see the batch number is 1,272 on the 27th of May 2020. And yes, that does mean that there have been 1,272 batches run using the BIDA Scheduler since we put it in to production in 2016.

It has been very reliable in running this customers ETL System even though the original ETL System was written as Stored Procedures and run from the jobs menu in SQL Server Manager.

Now? What changes did we make to reduce the ETL run times by another 2 hours?

With the faster CPU and the SSD we placed the data that is most used in the ETL processing on to the SSDs for faster processing.

For the most part we put the monthly partitions of large fact tables for “this years data” on to the SSDs.

We also adjusted the level of parallelism for portions of the batch to run more of the workload in parallel.

This required some schedule changes that were relatively easy to do.

We also went in and looked at the most time consuming Stored Procedures and we altered some of the now very old Stored Procedures to improve their performance.

But all in all?

When you consider we had already done a lot of performance work to get the Week Day Batch to run in around 4.5 to 5 hours?

And when you consider we were now able to get that elapsed time down to around 2.5 hours?

We think we did something that is worth telling you about in a blog post!


Summary

Now I would like to summarize what we have shared with you in this blog post because we feel it is important.

We would also like to leave you with some “food for thought” about your ETL system on SQL Server, especially if it is written using Stored Procedures.

We are a startup company and while our company is in start up mode we are focusing on finding customers who are using SQL Server as their data warehouse database and who have problems with their ETL Subsystems.

Of course, we don’t expect you to volunteer the fact if you do have ETL problems.

We know we have to earn your respect by demonstrating to you we know how to fix ETL problems and we know how to improve ETL performance.

If we prove to you that we can do that and you come to respect that we can do that?

Then maybe you will be willing to share with us what you feel are the issues that you have with your ETL system.

Ok?

So in this blog post we have told you about how we were introduced to a prospect here in Romania who were getting their data in to their data warehouse only at about 1pm each day. And then reports had to be run off that data.

They would have liked to get their data earlier in the day.

There were a number of limitations in the ETL and Data Warehouse. They:

  1. Used Stored Procedures For ETL Development.
  2. Had Serial Execution Of ETL Processing
  3. Had No Staging Area.
  4. Had Not Implemented Incremental Updates Properly.
  5. Had Used Clustered Facts Tables Rather Than Partitioned Fact Tables.
  6. Had No Multi-Level Summary Tables.

Over time we were able to address all these issues without vastly re-writing the ETL Stored Procedures.

We are maintaining more than 60,000 lines of Stored Procedure code written as long ago as 2012.

By addressing those issues, and even with adding a staging area, we were able to cut the ETL processing time from 7 hours to 6 hours.

When we introduced the BIDA Scheduler to be able to properly run processing in parallel we cut that further to about 4.5-5 hours.

We have maintained the 4.5-5 hour processing window for the last 4 years despite adding a lot more processing and a lot more data to the daily ETL processing.

The customer recently asked us to reduce the processing time more and we chose to move forward with a combination of a faster CPU, an SSD and some relatively simple to implement improvements to the old Stored Procedures.

In the end we were able to reduce processing time to just over 2.5 hours.

This is what we have talked about in this blog post today.

We hope that by reading or listening to this post we have given you some reasonable evidence we know how to improve processing times for ETL systems implemented on SQL Server, especially those implemented with Stored Procedures.

If you are using SQL Server for your data warehouse and you have any of the six limitations we have listed above?

We would really like to show you how we can reduce your ETL run times and get your business users the data they need on their devices, sooner, each day.


Before we finish this blog post we want to leave you with one last thought.

Ok?

One of the main reasons Stored Procedures should never be used for ETL is this.

The programmers of the Stored Procedures often cut corners and produce code that is inefficient, slow and hard for someone else to come along and maintain.

They do this because when they are writing the Stored Procedure the emphasis is on:

“How long does it take to write, test and implement the Stored Procedure in to production?”

This emphasis is because there is an easily recognized and easily quantifiable cost to the development of the Stored Procedures.

The programming consultants work on a paid by the hour basis.

So if it takes 8 hours to write a Stored Procedure and put it into production?

The easily quantifiable cost is 8 hours times the hourly rate.

Many customers sincerely believe that this cost represents the majority of the cost of the development of the Stored Procedure.

It does not.

In fact, when an ETL Stored Procedure is implemented in to production, depending on the skill of the developer, you have only incurred between 20%-30% of the cost of that Stored Procedure.

The other 70%-80% of the cost of that Stored Procedure is yet to be incurred.

This is called “maintenance” and it can be 3-5 times the cost of development.

Since we know these are the numbers?

We know how to implement ETL systems that are not only faster and cheaper to develop, but they must be faster and cheaper to maintain too.

If you are using SQL Server and you have your ETL System written using Stored Procedures?

Or if you think your ETL maintenance costs are too high and would like to reduce them?

We would be very interested in showing you how our BIDA ETL will significantly reduce your cost of ownership of your data warehouse over the next 5 year period.

Ok?


Now, here at the end of our blog post?

I would like to say again that we are very pleased you have chosen to read or listen to our blog post.

We hope we have given you some real food for thought about your ETL system you have implemented on SQL Server.

We hope by sharing with you some of the story of the ETL support we have provided for this particular Romanian customer that we have earned just a little credit towards demonstrating that we know what ETL systems are about and how to improve them.

My contact details are on our web site!

I would love to talk with you and hear what you consider are your issues with your ETL system.

I would like to say thank you very much for reading our blog post or listening to me today.

I really appreciate you!

Thank you!

Mihai Neacsu


Contact Us


We will be emailing our subscribers useful and
valuable information regarding business intelligence.
Please subscribe here to be on our general emailing list.

LEAVE A REPLY

Please enter your comment!
Please enter your name here