Sunday, 1 May 2011

Download source file from website with SSIS

Case
I have to download my source file from a website. Can SSIS do that for me?

Solution
There are a few options:
1) Third party tools such as CozyRoc or BlueSSIS.
2) Script Task with WebClient
3) Script Task with HTTP Connection Manager

I will show you the second and third option. First the WebClient solution.


The WebClient solution:
1) Start situation
I have a Data Flow that reads a CSV file (Products.csv), does some adjustments and inserts the records into a database table. The CSV file is on my harddrive, but I want a fresh download each time I run the package.
Start situation

















2) Variables
Create a SSIS string variable named DownloadURL and fill it with a correct value.
Variable with the download URL









3) Script Task
Add a Script Task to your Control Flow and give it a suitable name. Connect it to your Data Flow Task.
The Script Task


















4) ReadOnly Variable
Add the DownloadURL variable as ReadOnly to the Script Task.
ReadOnly Variable



















5) The Script
Add the following C#/VB.net Script to your Script Task. It downloads the file and uses the Flat File Connectionstring to save the file.
// C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Net;           // Added

namespace ST_929b5207bd37455a882c35e500ab2950.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            try
            {
                // Logging start of download
                bool fireAgain = true;
                Dts.Events.FireInformation(0, "Download File", "Start downloading " + Dts.Variables["DownloadURL"].Value.ToString(), string.Empty, 0, ref fireAgain);

                // Create a webclient to download a file
                WebClient mySSISWebClient = new WebClient();

                // Download file and use the Flat File Connectionstring (D:\SourceFiles\Products.csv)
                // to save the file (and replace the existing file)
                mySSISWebClient.DownloadFile(Dts.Variables["DownloadURL"].Value.ToString(), Dts.Connections["myProductFile"].ConnectionString);

                // Logging end of download
                Dts.Events.FireInformation(0, "Download File", "Finished downloading " + Dts.Connections["myProductFile"].ConnectionString, string.Empty, 0, ref fireAgain);

                // Quit Script Task succesful
                Dts.TaskResult = (int)ScriptResults.Success;
            }
            catch (Exception ex)
            {
                // Logging why download failed
                Dts.Events.FireError(0, "Download File", "Download failed: " + ex.Message, string.Empty, 0);

                // Quit Script Task unsuccesful
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
        }
    }
}

' VB.net Code
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Net      ' Added

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
 Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

 Enum ScriptResults
  Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
  Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
 End Enum
 
    Public Sub Main()
        Try
            ' Logging start of download
            Dim fireAgain As Boolean = True
            Dts.Events.FireInformation(0, "Download File", "Start downloading " + Dts.Variables("DownloadURL").Value.ToString(), String.Empty, 0, fireAgain)

            ' Create a webclient to download a file 
            Dim mySSISWebClient As WebClient = New WebClient()

            ' Download file and use the Flat File Connectionstring (D:\SourceFiles\Products.csv)                 
            ' to save the file (and replace the existing file)
            mySSISWebClient.DownloadFile(Dts.Variables("DownloadURL").Value.ToString(), Dts.Connections("myProductFile").ConnectionString)

            ' Logging end of download                 
            Dts.Events.FireInformation(0, "Download File", "Finished downloading " + Dts.Connections("myProductFile").ConnectionString, String.Empty, 0, fireAgain)

            ' Quit Script Task succesful
            Dts.TaskResult = ScriptResults.Success
        Catch ex As Exception
            ' Logging why download failed
            Dts.Events.FireError(0, "Download File", "Download failed: " + ex.Message, String.Empty, 0)

            ' Quit Script Task unsuccesful
            Dts.TaskResult = ScriptResults.Failure
        End Try
    End Sub
End Class
Note: This is a script with some basic eventlogging and error handling. See this article for more details.

6) Testing
Now run your package to test the result. You can use this CSV file to test it:
https://sites.google.com/site/ssisblogspot/products.csv

Note: See MSDN WebClient page for other properties of the WebClient Class such as Credentials and Proxy. If you don't want to add Credentials and Proxy settings in your Script Task the next solution might be better.





The HTTP Connection Manager solution:
This solution uses the HTTP Connection Manager that is standard available in SSIS. You can configure all Credentials and Proxy settings in this connection manager itself which makes is a little more easier to configure. You still have to do the download part with a Script Task.

1) Start situation
I have a Data Flow that reads a CSV file (Products.csv), does some adjustments and inserts the records into a database table. The CSV file is on my harddrive, but I want a fresh download each time I run the package.
Start situation

















2) HTTP Connection manager
You don't need the URL variable for this solution. Instead add a new connection of the type HTTP.
Right Click in the Connection Manager window and select..


















3) Configure HTTP Connection manager
Add the download URL in the Server URL box and optional set all Credentials and Proxy settings if you need them for your internet connection. I'm directly connected to the internet so I don't have to change any of the settings here.
HTTP Connection Manager

















4) Script Task
Add a Script Task to your Control Flow and give it a suitable name. Connect it to your Data Flow Task.
The Script Task


















5) The Script
Add the following C#/VN.Net Script to your Script Task. It downloads the file and uses the Flat File Connectionstring to save the file.
// C# code
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

namespace ST_6799d08685cb4ad78633d035fab12178.csproj
{
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion

        public void Main()
        {
            try
            {
                // Logging start of download
                bool fireAgain = true;
                Dts.Events.FireInformation(0, "Download File", "Start downloading " + Dts.Connections["HTTP Connection Manager"].ConnectionString, string.Empty, 0, ref fireAgain);

                // Get your newly added HTTP Connection Manager
                Object mySSISConnection = Dts.Connections["HTTP Connection Manager"].AcquireConnection(null);

                // Create a new connection
                HttpClientConnection myConnection = new HttpClientConnection(mySSISConnection);

                // Download file and use the Flat File Connectionstring (D:\SourceFiles\Products.csv)
                // to save the file (and replace the existing file)
                myConnection.DownloadFile(Dts.Connections["myProductFile"].ConnectionString, true);

                // Logging end of download
                Dts.Events.FireInformation(0, "Download File", "Finished downloading " + Dts.Connections["myProductFile"].ConnectionString, string.Empty, 0, ref fireAgain);

                // Quit Script Task succesful
                Dts.TaskResult = (int)ScriptResults.Success;
            }
            catch (Exception ex)
            {
                // Logging why download failed
                Dts.Events.FireError(0, "Download File", "Download failed: " + ex.Message, string.Empty, 0);

                // Quit Script Task unsuccesful
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
        }
    }
}

' VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
 Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

 Enum ScriptResults
  Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
  Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
 End Enum

    Public Sub Main()
        Try
            ' Logging start of download 
            Dim fireAgain As Boolean = True
            Dts.Events.FireInformation(0, "Download File", "Start downloading " + Dts.Connections("HTTP Connection Manager").ConnectionString, String.Empty, 0, fireAgain)

            ' Get your newly added HTTP Connection Manager
            Dim mySSISConnection As Object = Dts.Connections("HTTP Connection Manager").AcquireConnection(Nothing)
            ' Create a new connection                 
            Dim myConnection As HttpClientConnection = New HttpClientConnection(mySSISConnection)

            ' Download file and use the Flat File Connectionstring (D:\SourceFiles\Products.csv)
            ' to save the file (and replace the existing file)

            myConnection.DownloadFile(Dts.Connections("myProductFile").ConnectionString, True)
            ' Logging end of download                 
            Dts.Events.FireInformation(0, "Download File", "Finished downloading " + Dts.Connections("myProductFile").ConnectionString, String.Empty, 0, fireAgain)

            ' Quit Script Task succesful
            Dts.TaskResult = ScriptResults.Success
        Catch ex As Exception
            ' Logging why download failed                 
            Dts.Events.FireError(0, "Download File", "Download failed: " + ex.Message, String.Empty, 0)

            ' Quit Script Task unsuccesful
            Dts.TaskResult = ScriptResults.Failure
        End Try
    End Sub
End Class
Note: This is a script with some basic eventlogging and error handling. See this article for more details.

6) Testing
Now run your package to test the result. You can use this CSV file to test it:
https://sites.google.com/site/ssisblogspot/products.csv

Conclusion
Both methods (WebClient and HTTP Connection Manager) will have the same result, but the HTTP Connection Manager is probably a little easier to configure.


NOTE: This script example cannot do website logins or other manual actions!

51 comments:

  1. Hello there,
    i found this article very interesting.

    Qtn: using the "Http Connection Manager"
    How would one change the "ServerUrl" dynamically in order to download several files in one go(with a loop)?

    ReplyDelete
  2. @Anonymous: You can use an expression on the connection manager. The URL can be found in the ConnectionString property. Let me know if there are any problems.

    ReplyDelete
    Replies
    1. Hi Joost, I am trying to update the URL using the expression manager and cycle through several files, but it keeps simply updating the same file... it seems it doesn't actually update the URL dynamically.... can you help

      Delete
    2. Instead of using Dts.Connections("myProductFile").ConnectionString

      try using Dts.Connections("myProductFile").AcquireConnection(Dts.Transaction).ToString()

      This will allow the use of expressions on the connection manager.

      Delete
    3. i am actually trying to change the URL like this: in a foreachloop, i want to do something like this:
      https://portal.whatever.com/XXX/account.csv,
      then change to the next file download:
      https://portal.whatever.com/XXX/bankAccount.csv
      and then the next file
      https://portal.whatever.com/XXX/CustomerAccount.csv

      so in essence it is continuously changing this part:

      Dim mySSISConnection As Object = Dts.Connections("HTTP Connection Manager").AcquireConnection(Nothing)

      Do I make any sense?

      Delete
    4. Ok are you using the second script? If so then you need to add an expression on the HTTP connection manager to replace the hardcoded url with the variable from the foreach loop.

      But in your case I would use the first script that uses a variable instead of a connection manager (since you already need a variable in the foreach loop).

      But you also need to do something with the destination, else it will always use the local file.

      Delete
    5. the expressions do not seem to overwrite the http url ... hence my frustration. I am now trying the first script, but now i need to pass the credentials. Will you give me some guidance / example, please?

      Delete
    6. Hi, I did it! Thanks for your script! saved me doing 100 scripts for 100 files!

      Delete
  3. Great article. Great script. Well documented. Just what I was looking for. You helped me a lot. Thanks!

    ReplyDelete
  4. Great article! Two questions:
    1. Could this be made to work in 2005 (by using VB instead of C#)?
    2. How could this be used to pass a parameter (really a POST value) to the URL?

    Thanks!

    ReplyDelete
  5. @Anonymous: should be no problem. The HTTP Connection Manager already existed in 2005. I added a VB.Net version for SSIS 2008, but that should work in 2005 as well (maybe with some minor changes)

    ReplyDelete
  6. Hi Jost,
    top article, i m beginning to create soemthing similar to this article . just wondering is it possible to post value to the url ?
    many thanks

    ReplyDelete
  7. @Anonymous: Could you contact me via the webform to explain yourself?

    ReplyDelete
  8. I am trying to Download .zip file from web using HTTP connection manager.Howevere when I try to open the donloaded file I get 'The Compressed (Zipped) Folder is invalid' error . Did anyone experience this?

    ReplyDelete
    Replies
    1. Is it working when you manually download the file (without logging in or other actions)? Is the filesize as expected? What do you see when you open it in notepad? All weird chars or html/text?

      Delete
    2. Yes when I manually download the file from web the zip file downloads fine and I can open the content of it and file size is as expected.
      It is only when I use script to download the zip file I am getting the error

      Delete
    3. So the size when you download manually is the same when you download it by Script Task? Let me know what you see when you open the zip (that's downloaded by the Script Task) what you see: All weird chars or html/text?

      Is the zipfile on a public/internet server so I can test it myself?

      Delete
    4. Hi Joost,

      After downloading a .ZIP file successfully, I'm getting a data conversion error on the second column with a return status value 4, 'Text was truncated or one or more characters had no match...'. when attempting to load it to a DB table. Does the .ZIP file need to be extracted before attempting to load it? If so, how would that be accomplished?

      Delete
    5. You need to unzip it first before you can load the data into a Data Flow Task. Check out the SSIS Addon menu item. There you will find an unzip task.

      Delete
  9. First your article is great, but I have one question. I am using the Http version of the above explanation. When the Package runs I get no errors, and the file is populated with "Virtual User is Logged in" the original file actually has data. The file is in a folder on the site called Output. When I set up the file connection I placed the file in a folder called Output locally and it has the same name as the file I wish to download. Any idea or suggestion as to how I can have it get the actual file. additionally, I am forced to run in SQL 2005 at the moment. Thanks

    ReplyDelete
    Replies
    1. Is the document available without user login in and without doing any manual actions?

      Delete
  10. Thanks for this very valuable article. How about including codes to access website with username and password? Would greatly appreciate it.

    ReplyDelete
    Replies
    1. There are too many different login methods to describe them all, but aunt Google can help you: https://www.google.com/search?q=c%23+login+website

      Delete
  11. Thanks for the great article! I got this implemented and working for retrieving zip files, no problem.

    ReplyDelete
  12. Hi, I assume my question is quite stupid but it's the first time I try using scripting in SSIS. When I'm trying to run the above script I get the following error

    The type or namespace name 'AddIn' does not exist in the namespace 'System' (are you missing an assembly reference?)

    How do I add that namespace to system?

    Thanks.

    ReplyDelete
    Replies
    1. You're probably using SSIS 2005. This script was created with 2008. Just copy the extra imports-rows and copy the content of my main method to your main method, but keep the row where the task result is set and replace my rows (Dts.TaskResult = ScriptResults.Failure / Dts.TaskResult = ScriptResults.Success) with the one from 2005

      Delete
    2. I used to have the same "Addin" issue. It had to do with me just pasting in all the code into the Script editor. When I trimmed the code to not paste in the namespaces, and use what was generated when I opened the script editor, it worked fine.
      I will add that my auto-generated namespace from SSIS /VS 2010 did not have "Addin" in the text. It had:
      [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
      public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

      Delete
    3. Hey Joost, this is great, but doesn't work for me as I'm trying to access a page that requires a log-in and gives back a cookie. Do you have any suggestions for C# code to log-in to this site? BTW, the site is Mint.com. Trying to download all my personal financial data, so my SSIS package can process all of it automatically.

      Cheers, John

      Delete
    4. Hi jpk,

      As noted in the bold note at the bottom, this script example cannot login for you. There are just to many ways for logging in to a website. Best way is to add a question in a C# forum.

      Regards,

      Joost

      Delete
  13. Hi, great article! I try to use it for excel files, but I can't get the myConnection.DownloadFile(Dts.Connections["myProductFile"].ConnectionString, true); right, because the connectionstring for a excel source is not only the file path, and I can't use ExcelFilePath. Any ide?

    ReplyDelete
    Replies
    1. You only want the part of the connectiontring after 'Source=' until the next semicolon (;). You could do that with a substring. Something like:
      myConnectionString = Dts.Connections["myExcelFile"].ConnectionString.Substring(Dts.Connections["myExcelFile"].ConnectionString.IndexOf("Source=") + 6);
      myConnectionString = myConnectionString.Substring(1, myConnectionString.IndexOf(";") - 1);


      or you could try to get the property, something like:
      Dts.Connections["myExcelFile"].Properties["ExcelFilePath"].GetValue(Dts.Connections["myExcelFile"]).ToString();

      Delete
  14. great and helpful thanks for this..
    Please let me know what modification can i do on the Vb.net or c# script
    in order to download multiple XLS files to the system hard drive

    ReplyDelete
    Replies
    1. If you have a list of URL's in a database table. Then you could loop that table with a foreach loop. Let the foreach loop fill the DownloadURL variable and then but the script task within the foreach loop.

      Delete
  15. Hi, thanks for the guide.

    I've a question.
    Can I use the above method for each file in a path, without knowing the name of the files?
    For example all the files csv: *.csv or *.txt or....

    Thank you so much

    ReplyDelete
    Replies
    1. No. And in most cases the webserver won't allow folder browsing. So if you don't now the name and folder browsing isn't enabled you can't download all files.

      Delete
  16. Hi Joost,
    Thanks for the guide. It's excellent.
    i could follow your instruction to download the files via the path you provided,but, when i tried my own one, i found sth strange...


    i setup the http connection with "Use credential" (user name & password inputted). test connect success...
    but when i use the script to download file, it prompted error sth like "credential error/ incorrect"...

    Do u hv any ideas??

    Thanks so much..

    Alvin

    ReplyDelete
    Replies
    1. The script only gets the connectionstring (not the username and password). You have to provide the username and password to the HttpClientConnection via some .Net code. See this example on MSDN:
      http://technet.microsoft.com/en-us/library/microsoft.sqlserver.dts.runtime.httpclientconnection.serverpassword.aspx

      You can get other properties from the connection manager with something like:
      Dts.Connections["myHTTPconnection"].Properties["ServerUserName"].GetValue(Dts.Connections["myHTTPconnection"]).ToString();
      But it would probably not work for sensitive properties like passwords. The workaround is to store the password in an SSIS variable. Here is an example for SSIS variables in a Script Task: http://microsoft-ssis.blogspot.com/2010/12/how-to-use-variables-in-script-task.html

      Delete
    2. Hi Joost,

      i hv already setup the username & password in connection manager, do i still need to provide the name and password to HttpClientConnection?

      Dim mySSISConnection As Object = Dts.Connections("HTTP Connection Master").AcquireConnection(Nothing)

      Dim myConnectionHttp As HttpClientConnection = New HttpClientConnection(mySSISConnection)

      Delete
    3. try to messagebox properties like the username to see if they are available in the Script Task.
      MessageBox.Show(myConnectionHttp.ServerUserName)

      Delete
  17. I found your article very useful and was able to download the ZIP file programmatically. However, the download URL for the file is changed each month. The base path and filename are the same each month, but there is an identifier in the center of the path that changes each month. I have tried using wildcards in an expression. The download succeeds but the file seems to be corrupt. Any help would be appreciated

    ReplyDelete
    Replies
    1. Wildcards will not work on a website. If the identifier is an integer that is one higher each month then you could store the ID in a database. If it's a random integer of guid then you're probably out of luck.

      Delete
    2. I fear it is as you say "a random integer of guid". Below is this month's URL and the part after "files/" and the "?" is what is changing. The rest is the same each month:
      https://data.medicare.gov/views/bg9k-emty/files/CJP62BvKCE7mEG9ufmZCah9VMIm3bbgNVx_07wSgpbs?content_type=application%2Fzip%3B%20charset%3Dbinary&filename=DMG_CSV_DOWNLOAD20150801.zip

      Delete
    3. Sorry "random integer or guid"...

      Url seems to be random... Isn't there an index page where you can find that URL? Then you could read that page to find the correct download url...but probably not an easy solution.

      Delete
  18. This is the site that holds the button with the embedded URL in the prior post, "https://data.medicare.gov/data/nursing-home-compare". I do not know how to grab that dynamic path from the home page. I have searched google for vb script to do it, but no luck so far.

    ReplyDelete
    Replies
    1. You could just download this file (same way as above) then read it as a FLAT FILE with one big column. Add a Conditional Split to filter on that file name... after that you have one row left. With a Derived Column you could strip all the extras so that you only have an URL left.

      Now you can either store this URL in a database and then get it out again with an Execute SQL Task to store the URL in a variable, or use a Script Component to store it in a variable. Does that sounds workable?

      Delete
    2. Yep! As soon as I read your comment, I got the first part knocked out. Now working on the find and substring as I only really need to grab the GUID portion of the string and store it in a variable. Thank You SO MUCH!

      Delete
    3. Ok let me know if you got it working or if you need any help. In the PostExecute method of the Script Component you can fill a readwrite variable. Something like:

      Private URLPart As string = ""

      ' When all rows are processed, set variable
      Public Overrides Sub PostExecute()
      Me.Variables.URLPartVariable = URLPart
      End Sub

      Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
      URLPart = Row.URL
      End Sub

      Delete
  19. @Joost Hello. I'm having the exact same problem as @goingtoseed.

    But my package crashes after the first iteration of the for each loop. I posted a question about it here. any help would be greatly appreciated.

    http://stackoverflow.com/questions/33630358/dynamic-sourceurl-in-http-connection-manger-with-for-each-loop?noredirect=1#comment55037913_33630358

    ReplyDelete
    Replies
    1. Have you tried debugging to find the actual line that is failing? I would start commenting out as much code as possible and slightly add code if it is succesful. Start by only messageboxing/showing the URL from the connection manager.

      Delete
  20. @Joost : I am downloading files in iteration based on the 'LastWriteTimeUtc'. The problem now I am facing is the files which are downloaded are only 2057 KB, though they are bigger in size. But, All the files are getting downloaded. Something to deal with buffer? Any quick help will be more helpful.

    Note: I am using the second method(HTTP). and the files are of .xlsx type

    ReplyDelete
  21. Hi everyone,

    Great Article. I am other side of the road. I am looking to UPLOAD the file. I am using HTTP Connection Manager for URL and Credential. Please any advise would be great appreciate.

    Thank You

    ReplyDelete

Please use the SSIS MSDN forum for general SSIS questions that are not about this post. I'm a regular reader of that forum and will gladly answer those questions over there.

All comments are moderated manually to prevent spam.

Related Posts Plugin for WordPress, Blogger...