Apr 23, 2015

Term extraction transformation SSIS example

This transformation extracts the noun and noun phrases from text. Let's understand Term extraction transformation SSIS by simple example:

Step 1: Create a new table and populate with some data:

CREATE TABLE tblDocument (
     ntDocumentID BIGINT
     ,vcDoucumentName VARCHAR(100)
     ,vcData VARCHAR(MAX)

INSERT INTO tblDocument
     (1,'Math','I am learning mathematics.')
     ,(2,'Che','I am learning chemistry.')
     ,(3,'Phy','I am learning physics.')

Step 2:
Drag any data source adapter in data flow task area. In this example in is ADO .NET source adapter. It is connected with Sql server:

Step 3:  Drag term extract transformation and connect with data source adapter. After double clicking it will pop up term exaction transformation editor. Select the any on text data type column from where we want to extract the nouns or noun phrases. 

Step 4: If we to exclude a set of nouns or noun phrases. In this example we want to exclude the noun "Chemistry".  We are creating a new table in database and populating exclude terms:

CREATE TABLE tblExclude (
     ntExcludeID BIGINT
     ,vcNoiseWord VARCHAR(100)

INSERT INTO tblExclude

Again come back to SSIS. Under exclusion tab, create a new database connect where we have just created a table named "tblExclude". Select the table and column name from drop down.

Step 5:
Under advance tab:
Term type:   Noun means a single word noun and Noun phrase means groups of words (Noun, adjective etc.) In this example we are choose "Noun"

Score Type:  Frequency means total occurrence of a word. TFIDF is a fractional type score. In this example we are choosing TFIDF.

Parameters:  Frequency threshold means minimum occurrence of word to exact from text column. In this example we are choosing only occurrence is sufficient.   

Step 6: Populate the output to any destination and run it. Sample output:

1 comment:

James Zicrov said...

I think you will not need a lookup transformation if you use SSIS Upsert. Although this is a third-party product but I still think it will be useful.

SSIS Upsert