Monday 18 March 2013

Populating billions of rows in a SQL Server table

Few learning from recent experience gained when my team was trying to fulfill a benign requirement - populate a few billion rows in a table. The table in question had 140 columns - 30 mandatory and rest were optional. Optional columns were also marked as SPARSE in order to save space. Since the table was to be used for performance testing, it was not necessary to populate all the columns for each row.
 
We started out by doing some basic math around how much space we will require on the disk - calculate the space required for mandatory columns per row and then multiply by number of rows. In our case it turned out to be 120 GB. Later we used the Microsoft's recommended approach to get even better estimate. The estimate of the space required jumped to 190+ GB. Questions related to space requirements were resolved.
 
Now came the most intriguing part. How to generate so many rows with minimum effort. Since this database was to be used for performance testing, we figured that not all of the rows need to be unique.
 
We used Data Generation Plan that comes free with database project template of Visual Studio 2010 Ultimate edition to generate 1 million unique rows. Then we duplicated the rows with slight variation in some random columns using INSERT statement to reach up to 4 million rows and exported all the rows into a CSV file. All the activities detailed till now were done on the development environment.
 
Now we took the exported CSV file containing 4 million rows to the test environment and used BULK INSERT command to import rows in parallel sessions in a loop. Before doing so, we changed the recovery mode of the database to Bulk Logged and changed the file growth settings for log and data files to increase by 3 GB instead of the default setting of "By 10 percent, Limited to 2097152 MB". The process took a full day to complete.
 
One interesting learning was that BCP utility runs about 40% slower than BULK INSERT command. Import of 4 million rows took 5-6 minutes using BCP utility against the 3-4 minutes taken by BULK INSERT command on development environment.
 
I am guessing this process would have been much faster had we also introduced partitioning scheme for the table as that might have allowed faster parallel data load for rows that had different partition key ranges.

Tuesday 12 March 2013

Semantic Logging (without ETW)

Microsoft's Patterns And Practices recently released CTP of Semantic Logging Application Block (SLAB) as part of Enterprise Library suite. It can be downloaded from their codeplex site
UPDATE: Official first release of SLAB has been included in Enterprise Library 6. There is a new post on the same here.
 
It provides a solution for a very common problem - Logging. It enables developers to have a structured way of logging events (business, technical, error etc.) so that the later analysis of the logged information can produce valuable inputs to the business. It is not that SLAB brings a new way of looking at things. Instead, it brings out a prescriptive solution which uses ETW (Event Tracing For Windows) to ensure high performance logging and provides two different flavors of logging - synchronous and asynchronous. The underlying intent and approach remains the same though.

I want to highlight two general points about logging which apply to enterprise applications irrespective of the solution you chose for logging (SLAB or no SLAB - it should not matter). 
  1. Semantic logging is a generic concept that can be implemented even with good old Enterprise Logging Application Block.
  2. You need to decide upfront, perhaps during planning phase, about the data points you will like to collect. For example - If your application is a flight booking website, you may be interested in storing information about the flight searches being done on the website (to find a trend and may be show a cheap fare as a link of home page in a week's time to attract new visitors), the steps where users withdraw from your website when they are guided by the booking flow of website (to find a pattern), the time taken by payment gateway and insurance providers, number of times you contacted a third party (to keep an account of usage for later reconciliation) etc. Once these events and the data related to each event is decided, rest of the steps are fairly easy.
I created a simple demo application that implements the idea of semantic logging. It's current implementation directly logs information in the EventLog  but can be modified easily to use Enterprise Library or SLAB to log information - the key is that infrastructure and separation required for logging key business information is moved into one project - SemanticLogging.Demo.Analytics. You can choose to use a different storage that support querying (e.g. Database) to enable development of a dashboard like functionality that can be used by business analysts to measure effectiveness of new changes.


If you use Logging Application Block of EntLib, then you can assign categories to each business area so that logging can easily be enabled or disabled for a particular business area. E.g. you can log analytics data related to airfare search under "AirFareSearch" category and analytics data related to booking of air tickets under "AirTicketBooking".

Of course, this is just a demo of concept and the actual solution can be bettered with time and effort. 

You can download the solution here.