Few learning from recent experience gained when my team was trying to fulfill a benign requirement - populate a few billion rows in a table. The table in question had 140 columns - 30 mandatory and rest were optional. Optional columns were also marked as SPARSE in order to save space. Since the table was to be used for performance testing, it was not necessary to populate all the columns for each row.
We started out by doing some basic math around how much space we will require on the disk - calculate the space required for mandatory columns per row and then multiply by number of rows. In our case it turned out to be 120 GB. Later we used the Microsoft's recommended approach to get even better estimate. The estimate of the space required jumped to 190+ GB. Questions related to space requirements were resolved.
Now came the most intriguing part. How to generate so many rows with minimum effort. Since this database was to be used for performance testing, we figured that not all of the rows need to be unique.
We used Data Generation Plan that comes free with database project template of Visual Studio 2010 Ultimate edition to generate 1 million unique rows. Then we duplicated the rows with slight variation in some random columns using INSERT statement to reach up to 4 million rows and exported all the rows into a CSV file. All the activities detailed till now were done on the development environment.
Now we took the exported CSV file containing 4 million rows to the test environment and used BULK INSERT command to import rows in parallel sessions in a loop. Before doing so, we changed the recovery mode of the database to Bulk Logged and changed the file growth settings for log and data files to increase by 3 GB instead of the default setting of "By 10 percent, Limited to 2097152 MB". The process took a full day to complete.
One interesting learning was that BCP utility runs about 40% slower than BULK INSERT command. Import of 4 million rows took 5-6 minutes using BCP utility against the 3-4 minutes taken by BULK INSERT command on development environment.
I am guessing this process would have been much faster had we also introduced partitioning scheme for the table as that might have allowed faster parallel data load for rows that had different partition key ranges.
No comments:
Post a Comment