This site will no longer be updated and my blog has been redirected to
Thanks and Happy New Year!
This site will no longer be updated and my blog has been redirected to
Thanks and Happy New Year!
I wanted to write this around Halloween as the title would have been more fitting, also thought about changing it to the “Grinch Who Stole the BI Project Delivery Date”, but didn’t really like it even though it is funny, or not so funny when it happens to your project.
If you have ever been on a Business Intelligence Project or any Information Technology project for that matter it is no secret that scope creep is a killer and can send any project into a tailspin. Some of you might not agree with my next statement about who is responsible to ensure this does not happen, which is okay. I welcome any healthy debate on the subject. The person responsible is the Project Manager.
One of the many duties of a Project Manager is to ensure that the delivery date of a BI project is met. For this to happen a project plan must be crafted with milestones that collectively add towards meeting the delivery date with some buffer room because things happen. The following milestones are examples and should not be followed verbatim. One milestone (first) is to have a fully defined scope that contains business requirements from which technical specifications (second) can be created for the engineers to architect or design (third), build (fourth), then test and verify (fifth) a solution. So when the VP of Marketing etc. who is paying for the project out of his budget comes to you when you’re near the third or fourth milestone and says “you know…the marketing spend per product measure we talked about two months ago, we need to add this that and the other because its not really complete how we defined it so please do that and add these additional figures” (via email). This very situation has happened to me more than once and sometimes you are lucky to get an email and not just a phone call. This is how it usually starts and then immediately you go to the team and ask if we can do this because the VP said it needs to be done…and so on and so forth. The best way to handle this situation is to document the exact conversation of what is desired, get most of questions out of the way, then have a meeting with your project team and explain the situation. Have the team provide an honest assessment with some buffer room and then communicate back to the VP the detailed feedback from your team. I am not saying he or she will be happy if you tell them the team needs an additional three weeks, but they will understand if you communicate the facts to them and you just did your job of managing the project. Just saying yes has the potential to blowup your well oiled plan to deliver on time. Also, be creative and explain that three weeks can be added to the plan but that will delay the delivery date, or we can circle back and address this in an additional phase after the initial delivery date.
There are also some things you can do to mitigate scope creep. Keep the time between scope and rollout as short as possible with smaller more iterative subject area releases. The longer it takes for your team to deliver a solution, the more of a chance you are exposing your project to business changes that were valid six months ago but not today. I have had conversations with business folks that say it all needs to be released at the same time as it is all very important. When you discuss the possibility of the business changing if you wait six months, and that you have determined that their business measures really can be segregated into four areas that can be released separately, they will see what you mean. They want the solution earlier rather than later in most cases and in my experience it is not all (collectively) critical, rather certain pieces are more than others. Here is another area where you just managed the project successfully.
When I go into clients to assess a project that is in trouble it usually has some scope creep that happened once or multiple times. You can throw more resources at a project to increase the time to delivery but there still needs to be some re-scope or it has the potential to be another source of scope creep. Your original plan was probably devised for your planned resources (if done correctly) and now tasks have to be assigned to the new resources to make it useful. Don’t forget it is going to take the new resources some time to get up to speed. I hope this helps those out who manage BI Projects or IT Projects in general.
Happy Holidays as well. I want to write another blog entry before the Holidays get here but who am I kidding this one was supposed to be done for Halloween 🙂
You ever have one of those conversations that makes you say, huh thats a strange coincidence? I figured I would blog this conversation I had about what the differences are between mirrored databases and an ODS. This might seem like, come on whats this guy talking about its obvious, yep I thought the same thing until a client asked me to explain further and we got into it with a surprise.
So lets define each:
My client said we have five different databases that do not use the same data. But, they can all be tied together as one database captures transactions that feed another database for a different portion of our business, then that feeds another, and so on and so forth. If we don’t combine the sources is it really truly an ODS or is it just a mirror? I had a good long pause on this one and then smiled and said yes. My client said what do you mean yes, which one is it? I said pick one, your database systems are inherently designed in a way that if mirrored you essentially have an ODS. The databases don’t have any duplicate data elements in different structures, and are all contained in one location (server in this case). They can be queried with the data combined easily so you have an ODS but really all it took to get you one was to build a mirror of your five databases. I also explained that they have a rare case as usually data is duplicated in different systems with a twist on it (more or different attributes, and data), and some transformation or cleansing is usually required.
So, my point is that because you mirror your transactional (OLTP) source systems doesn’t mean you have an ODS, but in rare instances like this one you may. Jen Stirrup has a really good blog post about SQL Server Denali and the place of the ODS and gets into a lot of ODS detail.
I ask this question at every client that hires me to enhance their Microsoft SQL Server Business Intelligence solution. It seems like a pretty easy and straight forward question right? You would be surprised at how many companies or BI teams within companies struggle to answer this question easily or have up to date information. The answer usually requires a DBA to find, copy, or manually create an excel sheet containing the data file information, then use formulas to sum it all up. There is nothing wrong with this approach but it is time consuming.
There are usually several data marts or a warehouse that is physically split up over many different databases, and it can take a few days to figure this out if there are many data files. Another question I ask is how full are your data marts or warehouse? You would be surprised at how many companies or BI teams struggle with this information or just don’t know that it is important. This info can help with capacity planning and performance as well. If your data files have to grow a few times a week to meet increasing data needs you are taking a performance hit each time the data file has to grow.
To simplify these questions, I wrote a script a few years ago that make the answers easy to attain. Disclaimer: I gathered some of the guts of the #datainfo T-SQL from a SQL Server forum a few years ago. Credit goes to the author of it as it got me thinking to create this script and his / her thought process was very cool.
I hope this script helps those out that struggle with these questions. This script can obviously be used to answer any database space question or modified to suit your needs rather easily.
--This script will gather all the database sizes and log file sizes on a SQL server rolled up. set nocount on create table #datainfo ( [databasename] varchar(250), [datafile] varchar(250), [totaldbspace(mb)] numeric(10,4), [useddbspace(mb)] numeric(10,4), [freedbspace(mb)] numeric(10,4)) create table #loginfo ( [databasename] varchar(150), [totallogspace(mb)] numeric(10,4), [logspaceused(%)] numeric(10,4), [status] int) insert into #loginfo exec ('dbcc sqlperf(logspace) with no_infomsgs') select database_id, name into #dbs from sys.databases declare @dbname varchar(250) declare @dbid int declare @sql varchar(5000) set @dbid = 1 while @dbid <= (select max(database_id) from #dbs) begin select @dbname = name from #dbs where database_id = @dbid set @sql = 'use '+@dbname+' create table #showfilestats ( fileid int, filegroup int, totalextents float, usedextents float, name varchar(150), filename varchar(150)) create table #helpfile ( name varchar(150), fileid int, filename varchar(500), filegroup varchar(100), size varchar(150), maxsize varchar(150), growth varchar(150), usage varchar(100)) insert into #helpfile exec (''sp_helpfile'') insert into #showfilestats exec (''dbcc showfilestats with no_infomsgs'') insert into #datainfo select '''+@dbname+''', h.name, (totalextents*64/1024), (usedextents*64/1024), ((totalextents - usedextents)*64/1024) from #showfilestats s join #helpfile h on h.fileid = s.fileid drop table #showfilestats drop table #helpfile' exec (@sql) set @dbid = @dbid + 1 end --This can get you all the raw data if you want to not sum it up and see individual files, etc. --select * --from #datainfo d --join #loginfo l on l.[DatabaseName]=d.[DatabaseName] select d.DatabaseName as DatabaseName, SUM(d.[TotalDBSpace(MB)]) as [TotalDBSpace(MB)], SUM(d.[UsedDBSpace(MB)]) as [UsedDBSpace(MB)], SUM(d.[FreeDBSpace(MB)]) as [FreeDBSpace(MB)], SUM(l.[TotalLogSpace(MB)]) as [TotalLogSpace(MB)], SUM(cast(l.[LogSpaceUsed(%)]*l.[TotalLogSpace(MB)] as numeric(10,4))) as [UsedLogSpace(MB)], SUM(cast(l.[TotalLogSpace(MB)] - (l.[LogSpaceUsed(%)]*(l.[TotalLogSpace(MB)]/100)) as numeric(10,4))) as [FreeLogSpace(MB)], SUM(d.[TotalDBSpace(MB)]+l.[TotalLogSpace(MB)]) as [TotalSpaceLogAndDB(MB)] from #datainfo d join #loginfo l on l.[DatabaseName]=d.[DatabaseName] group by d.DatabaseName drop table #loginfo drop table #datainfo drop table #dbs GO
The results look like:
A common question that my clients ask when they purchase hardware for the ETL Layer of a Business Intelligence architecture using SQL Server Integration Services (SSIS) is usually, “does this server need to be attached to SAN disk or a dedicated external disk, or will the server’s local disk suffice?” The answer is “it depends” and I will explain why.
Disclaimer: I usually recommend a dedicated SSIS server for exclusively running ETL packages for larger environments, or for smaller environments that have the potential to become large. For throughput, recovery, etc. SAN disk or a dedicated external disk is usually a better option than local server disk. With budgets being tight in this economy it is not always an option so I won’t make this a “SAN or not” article, because “it also depends”.
I find that there are two schools of thought regarding SSIS package design in the last few years. The first perspective is usually from the Database Engineers – DBAs who think that anything related to data should use execute SQL Tasks for the most part, utilizing the database engine with optimized transact SQL (T-SQL). The second perspective is from the SSIS Developer that believes packages should use lookups, data flow tasks, script tasks, raw data files, etc., and execute sql tasks should be used as little as possible.
Another Disclaimer: I can relate to both perspectives as I have filled both roles on BI projects and this is a debate that I have seen at more than one client and am sure others have this discussion as well, so why not blog about it and hopefully it can help.
Drum roll please – the answer is both the Database Engineers – DBAs and SSIS Developers are correct. From the SSIS Developers point of view everything is done and contained within the SSIS packages and local disk will suffice since most operations are done in memory so there is no need for SAN or external dedicated disk. From the Database Engineers – DBAs point of view, the SSIS server can be configured to have the database engine present with attached SAN or external dedicated disk and use staging database(s) that run optimized T-SQL within execute SQL tasks.
The point is that it is not a clear cut question and answer and it really depends on a lot of factors. If you are the person making the architectural decisions for the SSIS server, you should understand how the packages will be designed first to answer this question correctly.
This is my first blog post and surely won’t be my last. I have talked about creating a SQL BI blog for a few years and over the last few weekends I finally gathered my thoughts and documents I wrote over the years to share, so here goes the RTM version. If there are any errors, they will be fixed in SP1 🙂
A common situation I encounter at various clients is that they perform heavy reporting out of the same transactional databases that power revenue driving sites. This usually hinders the performance and decreases revenue generation, not a good situation. The cost of additional hardware or additional SQL Server licenses is sometimes hard to justify, but if its possible to show that slow transactional throughput hinders revenue generation, or if the lack of timely reporting is limiting visibility for additional revenue opportunities, it might be a good place to start.
So now that you have wowed the folks with budgets that can buy servers and SQL Server licenses with this amazing information, how do you take the data and have it duplicated to your new SQL Server in real time so reporting can be performed there? real time was the requirement for these particular clients The answer is simple, transactional replication. I am using transactional replication in this example because the last two times I provided this architectural solution to my clients, one was still using SQL 2000 and the other only needed a few tables of a very large SQL 2005 database. You can use snapshot replication, database mirroring for SQL 2005 or later, or even log shipping. It all depends on the requirements but that won’t be talked about in this example and you should research each option for your situation.
To implement SQL Server Replication you need to setup the database server that you would like to replicate as a publisher, you need a distributor this manages the transactions in terms of reading, storing and sending, and you also need a subscriber this contains your duplicated database(s) you will use for future reporting. Clients that I have provided consulting services for struggled with the next step of the replication architecture on more than one occasion; that is where to configure the distributor and subscriber. This is where it gets interesting, you may or may not need a separate server for each. As a general rule of thumb for smaller publisher database implementations small in size, small number of concurrent users, small number of transactions, I usually place the distributor and subscriber on one server separate from the publisher. For larger database implementations large in size, large number of concurrent users, large number of transactions, I usually place the distributor on its own server and the subscriber on its own server both separate from the publisher. I’ll say it again, always separate the distributor and subscriber from the publisher as that usually avoids additional overhead on the publisher. Your original goal was to take load off the publisher with this architectural solution and there is a reason you are dividing and conquering, not adding then dividing and conquering in my opinion. You should test your scenario thoroughly before deciding on the final replication architecture and see what works best for you.
Another transactional replication recommendation I have is that after your publisher, distributor, and subscriber are all setup and you begin to add the tables as articles you would like to replicate I would ensure that you DO NOT use separate distribution agents per publication if you set up your transactional replication as per table publications as I call them. The reason is parent child relationships. For example, a parent table is heavily transactional and the child table is lightly transactional, the child records can be delivered to the subscriber database before the parent records are delivered and reports that need to join the two for a complete data set will not be able to make the join and lead to missing results.
I feel like I could write a lot more on this topic with code examples etc. but am trying to keep it architectural in nature. Also, there are always exceptions to the above but generally speaking in my experience this is what worked for my clients. I hope this helps you divide and conquer your transactional database performance and reporting.