When you have very large dimensions in SQL Server Analysis Services (SSAS) cubes, the time taken to process them can cause a problem. This post builds upon an excellent walkthrough of the ProcessAdd option written by Daniel Calbimonte on MSSQLTips.com. and shows how to automate the identification of new data to be added.
What are the main options for processing a dimension? (Technet details here)
- Process Full – the whole dimension is dropped and rebuilt
- Process Update – a new copy of the dimension is processed in full and the results compared, with changes being copied to the original dimension
- Process Add – just add new rows to the dimension, don’t do anything with existig rows
There are lots of posts around the internet that discuss the pros and cons of each of these, so I won’t go into the details. But as the title suggests, we’ll focus on the ProcessAdd option.
How do you implement ProcessAdd?
Daniel provides a fantastic explanation and code walkthrough of how to use SSIS (Integration Services) to implement a ProcessAdd solution, please read through his post before continuing. He talks us through using an SSIS data flow to create a source query to retrieve the new rows, and then the Dimension Processing component as a destination to receive the new rows.
And this works really well. The problem then becomes how to identify which rows need to be added. This is a problem that Daniel overcomes by having a seperate table of records to be added. The source query just selects the entire table. This table could be populated by the ETL logic, a trigger, or any number of other processes. But can we improve upon this and make the proccess more seamless and transparent?
Getting all MDX
For me, the best way of enhancing this is to build a solution which self maintains. So, we need to look inside the cube dimension, find the last record that has already been added, and then use that to find any new records from the warehouse.
Step 1 – Find the maximum key for a dimension
WITH MEMBER [Measures].[MaxKey] AS
MAX([Source Currency].[Source Currency Code].ALLMEMBERS
, StrToValue([Source Currency].[Source Currency Code].currentmember.MEMBER_KEY))
SELECT
{[Measures].[MaxKey]} ON 0
FROM
[Adventure Works]
The MDX above creates a calculated member called MaxKey, which finds the maximum Key (the surrogate key identifier for the dimension) across all members of the dimension. Note that we have to use StrToValue() to convert the key from a string to an integer so that it sorts numerically not alphabetically. If you run this in SSMS against the AdventureWorks2012DW database the result should be 105. You can verify this by querying the DimCurrency table in the warehouse and you’ll find that 105 is the largest CurrencyKey value.
Step 2 – How do we use MDX within SSIS?
In the Control Flow of the SSIS package, add an ‘Execute SQL’ task. Connect it to a new OLE DB connection to Analysis Services, as per the following screenshot.
Set the SQL Statement of the Data Source to the MDX query, and set the ResultSet property to ‘Single Row’.
On the Result Set tab, set the [Measures].[MaxKey] result to be saved to a new variable; ‘User::MaxKey’ (An Int32).
The Execute SQL task should precede the existing Data Flow Task.
Step 3 – Dynamically create the SQL Query from the MDX results
Now we have the maximum Key value stored in a variable, we can create another variable, called ‘SQL’ (a string), with an expression which includes the Key value.
The expression should be
"SELECT CurrencyKey, CurrencyAlternateKey, CurrencyName
FROM DimCurrency
WHERE CurrencyKey>" + (DT_WSTR,20)@[User::MaxKey]
This builds a SQL statement dynamically, based upon the maximum key identified from the MDX statement.
We can then set the source query in the Data Flow to use the resulting SQL Query, by setting its ‘Data Access Mode’ to ‘SQL Command From Variable’, and the ‘Variable Name’ to ‘User::SQL’.
And that’s it. This process will automatically find any new dimension members in the warehouse table, and add them into the cube dimension.
This process does rely on a number of things – so watch out…!
- The dimension must only contain inserts, not updates or deletes (you’ll need to perform a ProcessUpdate for those)
- The Key must always be incrementing – new records with a smaller key will not get picked up!
- You’ll need to ignore duplicate key errors during processing. Plenty of discussion around why in other blog posts…
- SSIS is not great at handling MDX connections – unless you want to go down the linked server route, only ever have a single MDX query per SSIS package.
And finally, another shout out to Daniel Calbimonte and MSSQLTips for providing the basis of this post.
Frog-Blog-Out
really really an excellent presentation presented on the dimension processadd in SSAS ………..
I love the way you’ve approached this, but I’m stuck on the initial MDX query.
The Dimension Key is an Integer and although I have over 1,000 rows in the dimension; it always pulls 9999 as the MAX value. I’ve tried this on other dimensions (over 100 rows, returns 99) and it consistently seems like it looks at the first character in every key to initially determine the highest value.
Can you post an example where the Key is an int for those of use who use SSIS a lot, but never touch MDX?
Thanks a bunch.
Hi Dale
In step 1 I mention StrToValue(), which forces MDX to treat the value as an int, and should solve the problem.
Regards
Alex
Alex,
Not certain we are on the same page… So you are saying I should be able to use the following for the Address Dimension (which has an integer for the key):
WITH MEMBER [Measures].[MaxKey] AS
MAX([SOURCE ADDRESS].[SOURCE ADDRESSID].ALLMEMBERS
,StrToValue([SOURCE ADDRESS].[SOURCE ADDRESSID].currentmember.MEMBER_KEY))
SELECT
{[Measures].[MaxKey]} ON 0
FROM
[Adventure Works]
Thanks again,
Dale
Yes that should work.
This looks to work with a narrow dimension where the values on the new row would be new adds to that attribute. I’m trying to set this up on a little wider dimension where there is a new row, however there may be a value in one of the columns that already exists as a member in the dimension.
I’m getting an error when trying to run this that I have duplicates on this column. Does the ProcessAdd ONLY work when all column values on the new rows being pulled don’t already exist as members in the associated attributes in the dimension?
Yes you’re absolutely right. ProcessAdd will only work where all new attribute values are new, which is usually only in very narrow dimensions.
I’ve not found a way around this for wide dimensions yet.
This should be edited into the blog post itself. I have wasted 2 days before seeing this response.
Have you, by any chance, worked out a solution for wide dimensions ?
Hi.
I have a number of dimension tables each with their own dimension_id field.
Using your code to find the maximum id currently in the cube I can retrieve the maximum IDs currently in each dimension.
However, I want to write XMLA code myself to process each dimension using a Process Add. I can easily retrieve the maximum id from the cube into SSMS by using the Openquery command, so knowing which new dimension ids to process is easy.
I don’t know how to construct the XMLA to select the new dimensions though. Any ideas or examples?
If I use your method in full I can only process one dimension via Process Add since you mentioned that one should only set one MDX connection per SSIS package?
Any ideas?
You can create a single xmla command containing multiple ‘Process’ elements, all within a single ‘Parallel’ element. This will then process multiple dimensions in a single batch, in a single command, with a single connection.
Thanks
Alex
Wow that was quick 🙂
That is exactly what I do for measures and partitions in the fact table at the moment. I identify which partitions were updated and only ProcessAdd new data in the partitions.
The syntax there is quite easy and I’ve found quite a number of examples online.
Basically under one batch I have multiple process commands for the measure group for each updated partition.
Then under the binding command I have a querydefinition that specifies the newly added data.
I essentially want to find an example of something similar to the binding where I can specify which dimension ids should be added.
For dimensions with relatively small numbers of entries it’s okay, I can just set ErrorConfiguration to ignore duplicate keys. I do however have another dimensions with millions of IDs where I would like to only send the new ids to the process add similar to your solution.
Essentially, I need to know what to specify in addition to the Process command – ie how process add can specify how to select only new ids
MyDatabase
MyCube
MyDimension
ProcessAdd
UseExisting
Regards
Thank You Alex for the wonderful post, this trick works absolutely perfect.
This will help us to reduce the processing size of the large dimensions.
Regards,
Shovan
Hi,
I’ve trying to use ProcessAdd in one of my partition for incremental load instead of full cube processing and fast processing of partition.
However, every time I process the partition my data is getting doubled/duplicated. Can you please suggest the reason & solution on this.
waiting for your soonest response.
Thanks
ProcessAdd adds any records that you provide to the existing data in the partition, so you need to only give it the new records that you know aren’t already in the partition.
Hi Alex,
Can you please tell me how to automate this.
What I’m doing is:
While processing the partition, set the process option to ProcessAdd and in configuration settings a query as:
Select * from TableName;
Is this correct! Please share your inputs.
Response awaited.
Thanks
Do you only have a single partition within this measure group? If so then you need to do a ProcessFull.
You could do a ProcessAdd but you would have to restrict the query to only contain new records that you want to add, make sure none of these records have previously been processed into the partition. This can just be a WHERE clause filtering by a date/time etc.
In reality it’s usually easier to create multiple partitions, and then just selectively process those partitions which need updating. Normally partitions are created by week, month or year, depending on your data volume.
Automating this is not that straightforward and requires a more detailed discussion. If you’d like our help in this let me know and I can email you.
Yes please email me.
Also, we’re in test phase by creating only single partition for automatic update or process incremented data so as to minimize the processing time, however we’re already using multiple partitions approach but because of the very huge volume of data, processing time is very high i.e. a day or two sometimes because of partition “ProcessFULL” or “ProcesUpdate” option hence we thought to implement this ProcessAdd approach so that only newly added data get refresh and in turn require less processing time.
Below is the direct link for ProcessAdd:
https://www.mssqltips.com/sqlservertip/3073/sql-server-analysis-services-processadd-for-a-partition/
Also, can you tell me if Tabular and Multidimensional can be implemented together. so right know we’re using multidimensional model would it be good to migrate to tabular or can a single cube have both the approach.