O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Faster Filtered SQL Server Imports

by Colin Goddard
12/21/2006

This article shows how to bulk load data from text files into SQL Server, using Perl's Win32::OLE module to control Microsoft's Database Transformation Services (DTS).

If you have ever tried to load large amounts of data using SQL inserts, you know how slow it method is. A million records loaded in this way could take over 40 minutes, which is why the various database providers have there own bulk-loading tools to speed up the task (SQL Server DTS, for example). If, however, you use Perl for all your data parsing, and would also like to perform database imports directly from Perl, how do you get around the speed issue?

Using Win32::OLE to tap into the DTS object model allows you to combine Perl's file handling and data parsing capabilities with the bulk loading speed of DTS to load data at the rate of over a million records per minute.

Tip: When loading large datasets into SQL Server, set the recovery model to BULK_LOGGED, as the FULL recovery models build up massive log files for large imports.

Running the Test Script

First, download the test script dtsBulkLoad.pl into your scripts directory, the Perl module dtsBulkLoad.pm into your lib/ directory (plus the data file customers.txt into C:\temp\ if you want to try the test example).

Then check that you have the necessary dependencies installed and configured on your PC:

  1. A Win32 version of Perl, such as ActiveState v5.8.6.
  2. The Win32::OLE, DBI, and DBD::ODBC modules from CPAN.
  3. A System DSN for your database. For the test example, create one called NWIND for the Northwind database that comes with SQL Server.

Talk to your local IT support if you are unsure about any of the above.

Import data into the Customers table. If, for example, your SQL Server database runs on a server called Server01, use the command:-

C:\> dtsBulkLoad.pl Server01 NWIND Customers c:\temp\customers.txt

By calling this script with your own server, database, table, and data-file options, you can bulk load data into your SQL Server database.

How does it work?

The test script dtsBulkLoad.pl walks through the process, while the module, dtsBulkLoad.pm, provides the subroutines that do the main work.

#!/usr/local/bin/perl
#
# Program dtsBulkLoad.pl for using Win32::OLE to create and run a
# dts package for importing a text file into SQL Server
    
use strict;
use dtsBulkLoad;
use Win32::OLE::Variant;

my ($server, $database, $table, $dataFile);

# Check for 4 arguments; server, database, table and data-file
if (@ARGV == 4){
    ($server, $database, $table, $dataFile) = @ARGV;
} else {
    die "Call with 4 arguments: server, database, table and data-file\n";
}

# Define variables
my $dataTable= "[$database].[dbo].[$table]";
my $sourceId = 1;
my $destId = 2;
my $rowDelim = "\r\n";
my $colDelim = ",";
my $textQual = "\"";

dtsBulkLoad.pl starts by loading the dtsBulkLoad.pm module, checking for four command line arguments; server, database, table, and data-file, then defining and setting some variables. It then makes a series of calls to dtsBulkLoad.pm to perform the tasks:

  1. Create a new Win32::OLE 'DTS.Package' object.

    my $Package = dtsBulkLoad::getPackage("DataLoader");

    This calls the getPackage() subroutine to initialise a Package object called DataLoader.

    sub getPackage {
      my ($name) = @_;
      my $Package = Win32::OLE->new( 'DTS.Package' );
      $Package->{Name} = $name;
      $Package->{FailOnError} = 0;
      return $Package;
    }
  2. Create a source connection object for the text file.

    dtsBulkLoad::connectToFile($Package,"Source",$sourceId,$dataFile,$rowDelim,
      $colDelim,$textQual);

    This calls the connectToFile() subroutine to create a DTSFlatFile source connection object and configure various parameters such as Row Delimiter, Column Delimiter, and Text Qualifier:

    sub connectToFile {
        my ($Package,$name,$id,$dataFile, $rowDelim, $colDelim, $textQual) = @_;
        my $Conn = $Package->Connections->New("DTSFlatFile");
        $Conn->{Name} = $name;
        $Conn->{ID} = $id;
        $Conn->{DataSource} = $dataFile;
        $Conn->{Reusable} = 1;
        $Conn->{ConnectImmediate} = 0;
        $Conn->{ConnectionTimeout} = 600;
        $Conn->{UseTrustedConnection} = 0;
        $Conn->{UseDSL} = 0;
        $Conn->ConnectionProperties->{"Data Source"}->{Value} = $dataFile;
        $Conn->ConnectionProperties->{"File Format"}->{Value} = 1;
        $Conn->ConnectionProperties->{"Row Delimiter"}->{Value} = $rowDelim || "\r\n";
        $Conn->ConnectionProperties->{"Column Delimiter"}->{Value} = $colDelim || ",";
        $Conn->ConnectionProperties->{"Text Qualifier"}->{Value} = $textQual || "\"";
        $Package->Connections->Add($Conn);
    }
  3. Create a destination connection object for the database.

    dtsBulkLoad::connectToDb($Package,"Destination",$destId,$server,$database);

    This calls the connectToDb() subroutine to create a SQLOLEDB destination connection object and configure various parameters such as Data Source, Initial Catalog, and Integrated Security:

    sub connectToDb {
        my ($Package,$name,$id,$server,$database) = @_;
        my $Conn = $Package->Connections->New("SQLOLEDB");
        $Conn ->{"Name"} = $name;
        $Conn ->{"ID"} = $id;
        $Conn ->{"DataSource"} = $server;
        $Conn ->{"Catalog"} = $database;
        $Conn->{"UseTrustedConnection"} = 1;
        $Conn->ConnectionProperties->{"Data Source"}->{Value} = $server;
        $Conn->ConnectionProperties->{"Initial Catalog"}->{Value} = $database;
        $Conn->ConnectionProperties->{"Integrated Security"}->{Value} = "SSPI";
        $Conn->ConnectionProperties->{"Persist Security Info"}->{Value} = 1;
        $Package->Connections->Add($Conn);
    }
  4. Add a step object to the package.

    dtsBulkLoad::addStep($Package,"Step1","CopyColumns");

    This calls the addStep() subroutine to create a step called CopyColumns:

    sub addStep {
        my ($Package,$stepName,$taskName) = @_;
        my $Step = $Package->Steps->New();
        $Step->{Name} = $stepName;
        $Step ->{TaskName} = $taskName;
        $Step ->{ExecuteInMainThread} = 1;
        $Package->Steps->Add($Step);
    }
  5. Get the column definitions for the table.

    (my $rDefs = dtsBulkLoad::getColumnName($database, $table)) || die "No def\n";

    This calls the getColumnName() subroutine which uses the DBI module to return a reference to an array of column definitions for the database table. This is use when configuring the Transformation object in the next stage.

    sub getColumnName {
        my ($database, $tableName) = @_;
        my @columnDefs = ();
        my $dbh = DBI->connect("DBI:ODBC:$database");
        my $sql = "select * from $tableName";
        my $sth = $dbh->prepare($sql) or return 0;
        $sth->execute();
        for (my $count = 0; $count < @{$sth->{NAME}}; $count++) {
          push @columnDefs, [$sth->{NAME_lc}[$count], $sth->{TYPE}[$count]];
        }
        if (@columnDefs > 0) {
          return \@columnDefs;
        } else {
          return 0;
        }
    }
  6. Add a task object and configure CustomTask and Transformation objects.

    dtsBulkLoad::addTask($Package,"CopyColumns",$dataFile,$sourceId,$dataTable,$destId, $rDefs);

    This calls the addTask() subroutine which in turn calls the addColumns() subroutine. addTask() creates a new DTSDataPumpTask called CopyColumns and a CustomTask of the same name. It configures the source and destination connections in the CustomTask, and then creates a Transform object. It then calls the addColumns() subroutine, passing the Transform object and column definitions from the previous step. This simplifies the trickiest part of the procedure, by using the column definitions to correctly configure column name and data type information in the Transform object.

    sub addTask {
        my ($Package,$taskName,$SourceObj,$sourceId,$DestObj,$destId, $rColDefs) = @_;
        my $Task = $Package->Tasks->New("DTSDataPumpTask");
        $Task ->{Name} = $taskName;
        my $CustomTask = $Task->{CustomTask};
        $CustomTask ->{Name} = $taskName;
        $CustomTask ->{SourceConnectionID} = $sourceId;
        $CustomTask ->{SourceObjectName} = $SourceObj;
        $CustomTask ->{DestinationConnectionID} = $destId;
        $CustomTask ->{DestinationObjectName} = $DestObj;
        # Create a transformation object, with column details and add to custom task
        my $Transform = $CustomTask ->Transformations->New("DTS.DataPumpTransformCopy");
        $Transform ->{Name} = "TransformData";
        $Transform ->{TransformFlags} = 20;  # AllowNullChange and AllowStringTruncation
        addColumns($Transform, $rColDefs);   # Add columns using information in $rColDefs
        $CustomTask->Transformations->Add($Transform);
        # Add task to package
        $Package->Tasks->Add($Task);
    }
    
    sub addColumns {
        my ($Transform, $rColDefs) = @_;
        my $colNum = 1;
        
        # For each entry from the column definitions array
        foreach my $def (@{$rColDefs}) {
            
          # Add source column info for the text file
          my $colName = sprintf("Col%03d", $colNum);
          my $SourceColumn = $Transform->SourceColumns->New($colName , 1);
          $SourceColumn->{Name} = $colName;
          $SourceColumn->{DataType} = 129;
          $Transform->SourceColumns->Add($SourceColumn);
            
          # Add destination column info for the database
          my $dataType = ($def->[1] ==  4) ? 3: 129;
          my $DestColumn = $Transform->DestinationColumns->New($def->[0] , 1);
          $DestColumn->{DataType} = $dataType;
          $DestColumn->{Nullable} = 1;
          $Transform->DestinationColumns->Add($DestColumn);
            
          # Increment column number
          $colNum++;
        }
    }

The script finally calls $Package->Execute() to run the package and performs some basic error checking:-

#Execute the package
$Package->Execute();

#Check for errors
my ($ErrMsg, $Err) = StepErrors();

if ($Err == -1){
    print "$ErrMsg\n";
} else {
    print "Package Completed Successfully\n";
}
$Package=undef;    
exit $Err;

#Loop through the steps and look for errors based on the ExecutionResult
sub StepErrors {
    my $ErrorCode=Variant(VT_I4|VT_BYREF, "-1");
    my $Source=Variant(VT_BSTR|VT_BYREF, "");
    my $Description=Variant(VT_BSTR|VT_BYREF, "");
    foreach my $Steps (in {$Package->{Steps}})    {
      if ($Steps->{ExecutionStatus} == 4)        {
        if ($Steps->{ExecutionResult} == 1)            {
          $Err = -1;
          $Steps->GetExecutionErrorInfo($ErrorCode,$Source,$Description);        
          $ErrMsg = "$ErrMsg \n Step $Steps->{Name} failed, error:\n  $ErrorCode \n $Description \n";
        }
      }
    }
    return $ErrMsg, $Err;
}

With an understanding of how the test script works, you can modify your own Perl scripts to use dtsBulkLoad.pm for loading any format of text file into any SQL Server data table at bulk-load speed.

Colin Goddard is a Graduate of the University of St Andrews and has worked in the Pharmaceutical / Biotech industry for the last 22 years.


Return to O'Reilly Databases


Have you used a similar technique? Share your tips here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 7 of 7.

  • Bug in test script
    2007-05-17 06:26:30  CPG [Reply | View]

    My thanks to Zhen Liu for alerting me to an error in the test script.

    If the System DSN name is not the same as the name of the database it points to (‘NWIND’ points to ‘Northwind’ in the test script) the following error results:-

    Step Step1 failed, error:
    -2147467259
    Cannot open database requested in login 'NWIND'. Login fails.

    The quick fix is to name the system DSN that attaches to the Northwind database 'Northwind' rather than 'NWIND', but to allow the program to work with dsn names that are different to the database name, make the following changes to the code.

    Modify the code that handles the command line arguments:-

    my ($server, $dsn, $database, $table, $dataFile);
    # Check for 5 arguments; server, dsn, database, table and data-file
    if (scalar @ARGV == 5){
    ($server, $dsn, $database, $table, $dataFile) = @ARGV;
    } else {
    die "Call with 5 arguments: server, dsn, database, table and data-file\n";
    }

    Change the parameter that is passed to getColumnName():-

    ### Get the column definitions for the table
    (my $rDefs = dtsBulkLoad::getColumnName($dsn, $table)) || die "No def\n";

    and call with the following command line arguments:-

    dtsBulkLoad.pl Server01 NWIND Northwind Customers c:\temp\customers.txt

  • Questions about BCP and CPAN
    2007-01-06 23:17:21  glasswalk3r [Reply | View]

    Hello there,

    Nice article! I really liked it but I cannot avoid asking two questions about the process you described:

    1 - Why not using bcp program?
    One could try to execute bcp directly using system function or any other. Just supposing, this would not be faster than using Win32::OLE to get access to DTS API? The Perl module could generate a configuration file for BCP in a fly.

    2 - Are you thinking about in creating this module of yours as a generic API in CPAN?

    Thanks!
    • Questions about BCP and CPAN
      2007-01-15 01:21:41  CPG [Reply | View]

      As an example of the flexibility of using this DTS module, you only need to adjust two parameters to import fixed width data from a file. In the connectToFile() subroutine, set the ‘File Format’ to 2, and replace the ‘Column Delimiter’ parameter with ‘Column Lengths’, set to the lengths of the fixed width fields. For example, if there are three fields of length 6, 25 and 25, use the following settings (passed as parameters):-

      $Conn->ConnectionProperties->{"File Format"}->{Value} = 2;
      $Conn->ConnectionProperties->{"Column Lengths"}->{Value} = "6,25,25";
    • Questions about BCP and CPAN
      2007-01-08 08:06:09  CPG [Reply | View]

      Thanks for the questions.

      1. The simple answer is that, before I wrote this module, I used DTS rather than BCP for my data loading. BCP probably would have been quicker for bulk loading data from flat-files, but it did not have the flexibility of DTS for controlling the different data sources, performing data transformations, etc. The article gives an example of working with the DTS object model but, due to space considerations, does not use the full range of available DTS features.

      2. I had not planned to put this on CPAN, but if people were interested, I could do.
      • Publishing on CPAN
        2007-01-16 11:26:51  glasswalk3r [Reply | View]

        Hello there,

        I've been working in a API layer over Win32::OLE and MS SQL Server API to fetch data related to DTS packages. This API hides all complex structures required from Win32::OLE and, with some degree of luck, the need to convert VBscript from the SQL Server documentation.

        Probably I'll take a namespace simple as "DTS". Till now, I have just some classes representing the DTS tasks like SendMailTask and ExecutePackageTask. I had used this API successfully to implement unit tests in the DTS packages I had to develop as ETL interfaces in a project I had worked.

        I'll probably publish it on CPAN in a week or two. Maybe you would like to extend the DTS::Task class to implement the BulkInsertTask object. ;-)
        • Publishing on CPAN
          2007-03-06 06:35:18  glasswalk3r [Reply | View]

          All right guys, here it is: http://search.cpan.org/search?query=dts&mode=all

          It would be nice to received some feedback about the new API.
          • Publishing on CPAN
            2007-03-06 07:24:50  CPG [Reply | View]

            Hi,

            Sorry for not replyint to your last email; we have just gone through a merger! I will try to take a look at this and get back to you.

            CPG


Tagged Articles

Be the first to post this article to del.icio.us

Sponsored Resources

  • Inside Lightroom

Related to this Article

SQL Hacks SQL Hacks
by Andrew Cumming , Gordon Russell
November 2006
$29.99 USD

PHP/SQL 1: Introduction to Database ProgrammingPHP/SQL 1: Introduction to Database Programming
Gain valuable skills in building SQL and mySQL applications using PHP.

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com