0845 643 64 63

Pattern matching in SSIS using Regular Expressions and the Script component

One of my favourite features of SSIS is the script component, and I know I’m not alone. Why? Because it brings the entire might of the .NET framework to SSIS, providing C# (in SQL 2008 onwards) and VB.NET extensibility for all those times where SSIS doesn’t quite have enough functionality out of the box.

Such as?

Well a problem I’ve come across a number of times is string parsing. Trying to search for and extract a specific pattern of characters from a larger text string. I’ve seen developers build crazy convoluted expressions in the derived column transform, some of which are very impressive in their complexity! This is a bad thing not good, although it shows a level of prowess in building SSIS expressions (not the most intuitive expression language!), it becomes a support nightmare for other developers.

Take for example extracting a house number from the first line of an address, we want to convert “29 Acacia Road” into “29”

Or extracting a version number from a product, converting “iPhone 4G, IOS 4.3.2, black” into “4.3.2”

Or extract the html page from a URL, converting “http://www.wibble.com/folder/page.aspx” into “page.aspx”

Regular Expressions

The .NET framework includes full support for regular expressions (Regex), in the System.Text.RegularExpressions namespace. Regex provide an incredibly powerful way of defining and finding string patterns. They take some getting used to, but once you get the hang of them you can unleash their power in your SSIS dataflow pipelines.

To find out more about regular expressions, look at the following links


Let’s look at an example

Let’s take our first example from above, extracting a house number, converting “29 Acacia Road” into “29”.

The first thing we need to do is define our Regex search pattern. In this case we know that it must be at the start of the string, and must be an integer, with any number of characters 0-9.

The pattern for this is “^[0-9]+”, which is broken down as
    ^ means the start of the line
    [0-9] means any number
    + means 1 or more of the preceding item.
    i.e. 1 or more integers at the start of the line.

What if we wanted this to also cope with a single letter after the number? i.e. “221b Baker Street”

We can add “[A-Za-z]?” to our pattern, in which
    [A-Za-z] means any character A-Z in either upper or lower case
    ? means 0 or 1 occurrences of this

We should also add “\b” to the end of this, which indicates a word boundary. This means that 221b should be a whole word, not part of a larger sequence “221BakerSt”. We can wrap up the [A-Za-z]\b together into brackets so that the ? applies to the combination, so that any single letter must be the end of a word or it will be ignored. In this way “221BakerSt” will return 221, as will “221 Baker St”, whereas “221B Baker St” will return “221B”.

So our new pattern is “^[0-9]+([A-Za-z]\b)?”

You’ve probably gathered by now that regular expressions can get quite complicated. I’m not going to go into any more detail about them here, but hopefully this gives you some idea of what they can do. There’s plenty of reading on the web if you want to know more. You should also make use of the Regex expression tester in the link above – it will save you lots of debugging!

How do we use Regular Expressions in SSIS?

Well it turns out this is the easy bit, with the help of the script component.

Step 1 – Add a script component into your data flow pipeline, configure it as a data transform. I’m using C#, but you can use VB.NET if you want

Step 2 – Give the script access to the input and output columns

Open the script component and select the input field from the “Input Columns” screen, in this case “Address1”. This can be ReadOnly.

Go to the “Inputs and Outputs” screen and add an output column to “Output 0”. We want to set the datatype to string, length 10. This new field will contain the results of our Regex pattern matching.

Step 3 – Create the script code

Click on “Edit Script” on the Script screen which will open up Visual Studio.

Add a reference to System.Text.RegularExpressions at the top of the script

      using System.Text.RegularExpressions;

Then place the necessary code in the Input0_ProcessInputRow function.

    public override void Input0_ProcessInputRow(Input0Buffer Row)
        //Replace each \ with \\ so that C# doesn't treat \ as escape character
        //Pattern: Start of string, any integers, 0 or 1 letter, end of word
        string sPattern = "^[0-9]+([A-Za-z]\\b)?";
        string sString = Row.Address1 ?? ""; //Coalesce to empty string if NULL
        //Find any matches of the pattern in the string
        Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
        //If a match is found
        if (match.Success)
            //Return the first match into the new
            //HouseNumber field
            Row.HouseNumber = match.Groups[0].Value;
            //If not found, leave the HouseNumber blank
            Row.HouseNumber = "";

When you save and exit the script, any component downstream of the script component will have access to the new HouseNumber field.


33 comments on “Pattern matching in SSIS using Regular Expressions and the Script component

  1. This is fantastic. Thank You Sooooo much for this information. It’s very powerful and works really well

  2. One question: How would you add anohter output Column “StreetName” using the same “Address1” in your C## code?

    Let assume that you want to ouptut the following columns.


    Where and how would you put that in your code?

    Thank you and sorry for my ignorance about C##


  3. Hi Fran

    That’s a great question thanks.

    You can output as many fields as you like from the code, just by adding more “Row.xxx = yyy;” lines.

    The first thing you need to do is add a second output column (as per the second screenshot in the post), for street you’d probably want a string 50 or similar.

    Then, in the code, just add another line at the end setting the new field to the value you want by using Row.StreetName = xxx;

    I’ve modified the code slightly to do this here…

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    //Replace each with \\ so that C# doesn’t treat as escape character
    //Pattern: Start of string, any integers, 0 or 1 letter, end of word
    string sPattern = “^[0-9]+([A-Za-z]\\b)?”;
    string sString = Row.Address1 ?? “”; //Coalesce to empty string if NULL
    string sHouseNumber = “”;

    //Find any matches of the pattern in the string
    Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
    //If a match is found
    if (match.Success)
    //Return the first match into the new
    //HouseNumber field
    sHouseNumber = match.Groups[0].Value;

    Row.HouseNumber = sHouseNumber;
    Row.StreetName = sString.Replace(sHouseNumber, “”).Trim();


    Good luck with it

  4. kumar on said:

    Hi Alex,

    Thanks for your help.
    I am using ssis 2005 , can you please help us to give the code in vb.net
    Thanks in Advance.

    • Hi Kumar

      No problem at all. It’s the same structure and code, just VB-ified

      Imports System.Text.RegularExpressions
      Public Overrides Sub Input0_ProcessInputRow(ByVal ROW AS Input0Buffer)
          Dim sPattern AS String = "^[0-9]+([A-Za-z]b)?"
          Dim sString AS String = ROW.Address1 & ""
          Dim MATCH AS MATCH
          MATCH = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase)
          IF MATCH.Success THEN
              ROW.HouseNumber = MATCH.Groups(0).Value
              ROW.HouseNumber = ""
          END IF
      END Sub
  5. Ira Warren Whiteside on said:

    Nice article very helpful , this is a passion of mine, here is another example a little older
    Kimball ETL (subsystem 1) – Data Profiling via SSIS Data Flow – Pipeline Column Data Profiling


  6. chandu on said:

    Great article very informative .

  7. Anthony on said:

    Hi Alex,
    Great article and thank you very much. I’m having an issue with a seperate class I created in the Vsta accessible through the Script button. Here is what I got:
    public override void Input0_ProcessInputRow(Input0Buffer Row)

    string rawAddress = Row.address + “, ” + Row.city + “, ” + Row.st + ‘ ‘ + Row.zip;
    var parser = new AddressParser.AddressParser();

    var result = parser.ParseAddress(rawAddress); // <>

    //string stdAddress = result.StreetLine + “, ” + (result.SecondaryUnit != “” ? result.SecondaryUnit + @”: ” : ” “) + (result.SecondaryNumber != “” ? result.SecondaryNumber : “”);
    Row.Address = result.StreetLine;
    Row.Suite = result.SecondaryNumber;

    The cursor just sits and the program doesn’t lock up or anything. Works perfectly in my other non SSIS projects. I have a separate class called AddressParser where I initialize my Regex Patterns. This class is in another namespace and I do have a using statement and explicitly call it as above. This class file exists within the Vsta project. I’d greatly appreciate any thought on this. Thanks a bunch.

    • Anthony on said:

      Adding to the above, my line marker got wiped ot when I posted it, the cursor sits at the line var result = parser.ParseAddress(rawAddress);

    • Alex on said:

      Hmm interesting, there’s no reason why it should stall there – seems pretty standard to me.
      Have you put any logging into the AddressParser class to see what happens inside the ParseAddress function?

  8. kavyokls on said:


  9. sujith on said:

    can you help me to convert a url to regular expression in windows form using c#

    • Alex on said:

      You can use exactly the same code, transplanted into a c# winform app. The difficult bit will be defining what the actual regex pattern should be for a URL.
      There’s no single right answer here, it depends on what you consider a valid url. i.e. which protocols are valid? (http, https, ftp, mailto, etc. etc.), whether you want to allow query parameters, etc. I’d recommend finding an existing pattern that someone else has created and tested, then adapt to your needs.

      I’ve just done a quick Google and the first link that comes up looks like a pretty good starting point for you…


  10. Raj Thakkar on said:

    Great article ! This is fantastic. Thank You Sooooo much for this information. It’s very powerful and works really well – Parse Integration Services

  11. Vipin Jha on said:

    Hi ,

    I have to perform several data checks before loading data into target table.

    Foe example I am having 1 flat file with below column

    Id Name Age
    Int Varchar(100) Int
    My requirement is to create package, checks will be performed on each record, column of the files.

    Any records which failed the checks considered as error records and will be written to the exception table.
    if i have character in int datatype column , it should get the error notification with line number
    Can any one help me to achieve the above scenario.is there any example on web , as I am searching the solution of above scenario.

  12. Keith on said:

    I have checked everything three times; but when I attempt to compile the code in the SQL Server side of things; I get an error saying: ‘ScriptMain.CreateNewOutputRows()’: no suitable method found to override @ line 135

    This line is: public override void CreateNewOutputRows()

    • Hi Keith
      The CreateNewOutputRows() function is only relevant when the Script component is configured as a source, not a transformation.
      When you create the transform you need to select either source, destination or transformation. The code in this post relates to a transformation, not source.

  13. Chandan on said:

    Can any one tell me the logic that how can i parse USA addresses by c#.

  14. Jitendra Sankhla on said:

    Hi Alex,

    This is a great code, helped me a lot. However I am getting an exception while trying to execute the SSIS package. Actually I trying to format PhoneNumbers using regex. The Input column is from a sql table and may have some Null values, therefore I am getting an exception as

    Script Component has encountered an exception in user code:
    Project name: —–
    Value cannot be null.
    Parameter name: Input
    at System.Text.RegularExpressions.Regex.Match(String input)
    at System.Text.RegularExpressions.Regex.Match(String input, String pattern, RegexOptions options)
    at ScriptMain.Input0_ProcessInputRow(Input0Buffer Row)
    at UserComponent.Input0_ProcessInput(Input0Buffer Buffer)
    at UserComponent.ProcessInput(Int32 InputID, PipelineBuffer Buffer)
    at Microsoft.SqlServer.Dts.Pipeline.ScriptComponentHost.ProcessInput(Int32 inputID, PipelineBuffer buffer)

    Please help me to fix this.

    • Alex on said:

      You just need to replace nulls with empty strings and it should fix the problem.
      You can either do this in the source query, or in a derived column task in the SSIS data flow, or in the C# script, all will work equally well.

  15. Priyadarshi on said:

    Hi Alex is there a way to select a field from sql and put that in the regular expression like var sql = Select *
    And the string sPattern = “^sql+[0-9]+([A-Za-z]\\b)?”;

    • Alex on said:

      If you wanted to embed a field value into the RegEx then yes, just select the relevant field in the input columns of the script task, then you can refer to the contents of that column within the script.
      So just build a RegEx pattern dynamically in a C# string using the field contents, something like:
      string RegEx = “^” + Row.MyField + “+[0-9]+([A-Za-z]\\b)?”;

  16. I am curious about how you would use this to extract 0 to many IP addresses from a string of text, such as might be found in a log file reporting connections to a server.

    • Hi Lynn
      You just need to define a regular expression that matches the IP address (& delimiter). Then just loop through all the matches.
      SSIS data flows require a fixed pre defined set of columns, so either define more columns than you need, or you could generate multiple rows each with a single address; it depends how you need the data structured.

  17. Rahul on said:

    Hello Alex,

    i am having a sql column having comment data as shown below:

    03/06/1995 10:39 AM Opened file and called employee for update and requested medical.

    06/01/1995 09:12 AM 2nd request for medical report. Employee still off complaining of pain in right hip and difficulty walking. Scheduling appt with Dr. Kevorkian for late June.

    07/15/1995 10:42 AM Employee mysteriously died following consultation with physician. Not thought to be related to this accident. Will document and keep file open pending outcome of autopsy.

    07/10/1996 10:46 AM Received summons today alleging employee’s death job related. Referred to counsel.

    All the above rows are in a single column. As we can see, there were comments added on 4 different dates, but they keep on appending in the same sql column.

    i need to parse them out and make 4 rows of this single row of data. (one row for each date and comment text).

    Is it possible to do this using regular expressions?

    any help would be highly appreciated.


    • Firstly you need an asynchronous script transform, where the output is a different data buffer than the input. You can then generate as many output rows as you want from each input row.
      Then you just need to determine how to split the data up.

      Then yes you can use RegEx to find each of the matches, using a pattern similar to this:
      [0-9]{2}/[0|1][0-9]/[1|2][0-9]{3} [0-2][0-9]:[0-9]{2} (A|P)M .*

      i.e. a date, then time, then any number of characters until a line break.


  18. Hemanth Sharma on said:

    Thank You, Mr.Alex Really a Nice Tutorial

  19. aravind on said:


  20. Thanks

  21. Cristián Fernando on said:

    Hola Alex, mi duda es más con el script de c# que las expresiones regulares. Quiero agregar una columna de input pero por código c#, no conozco ni su nombre ni tipo porque son parámetros pero los tengo como valores de un campo de una tabla. De igual manera lo quiero replicar pero ahora para una columna de output y todo por c#. ¿Como lo hago?, ¿ se puede hacer?, gracias

    • Alex on said:

      Hola Cristian
      En una SSIS C# tienes que conocer el nombre y el tipo de la columna de salida, no creo que haya ninguna forma de hacer esto dinámico.
      Gracias Alex

  22. Shrenb on said:

    Hello Alex,
    Extremely good and helpful article.
    I am going to implement this strategy for validating Email values in the incoming file before transferring the rows to a destination
    My question is how is the performance of this technique when parsing 100s of thousands of rows and each row has two email fields to validate?

    • We use this technique to process millions of rows without any problems, so you shouldn’t have any issue at all with only a few hundred thousand.

Leave a Reply

Your email address will not be published. Required fields are marked *


371,615 Spambots Blocked by Simple Comments

HTML tags are not allowed.

Power BI Sentinel
The Frog Blog

Team Purple Frog specialise in designing and implementing Microsoft Data Analytics solutions, including Data Warehouses, Cubes, SQL Server, SSIS, ADF, SSAS, Power BI, MDX, DAX, Machine Learning and more.

This is a collection of thoughts, ramblings and ideas that we think would be useful to share.


Alex Whittles
Jeet Kainth
Jon Fletcher
Nick Edwards
Joe Billingham
Lewis Prince
Reiss McSporran
Microsoft Gold Partner

Data Platform MVP

Power BI Sentinel
Frog Blog Out