Tuesday, 8 September 2015

Strategies for managing an enterprise - Policy Based Management (T-SQL Tuesday)

You may be aware of T-SQL Tuesday; it's a concept conceived by Adam Machanic ( b | t ) some years ago, to encourage a number of posts on a different specific topic on a monthly basis. This month's topic, as hosted by Jen,McCown, is Strategies for managing an enterprise.

A Strategy is a plan of action designed to achieve a long term goal or aim, so I thought I'd write about how I'd planned and achieved a solution to the problem of assuring myself and the business that our SQL servers met the policies and best practices we aspired to achieve.

I had recently been faced with the need to gain confidence in the state of a set of SQL Servers whose management hitherto has at best been a part time activity, and not owned by any one person or group. This had meant that, understandably, consistency and best practice have suffered, to the extent that a base level of audit and assurance is required. There's a few ways to achieve this, and I needed to come up with a strategy that would hit several goals.

Easy to implement

Technically advanced technologies are great, but sometimes you just need basic checks and the ability to build on the basis they provide. In our case, we needed to use the existing skills base to achieve our goal.


I didn't have the luxury of assuming there will be a full time DBA in the organisation in the future, and in any case we are always pushed to do more with less resource, so it is likely that any solution that involves manual steps will not be carried out regularly. We therefore needed an automated solution.


For any audit tool to be effective, particularly a tactical one, we needed to be able to provide a level of reporting, to allow the business to perceive the health of the assets, and have confidence that issues are addressed.


There's no point in having a solution that doesn't cover all the bases. We needed one that will allow us to expand on the 'out of the box' functions to do custom metrics.

Cost effective

There are a lot of valuable solutions out there that really do provide great monitoring for large server estates, allowing real-time automated resolution of issues, and confirmation of policy adherence. Our needs were more modest however, and it should be noted that the goal was to achieve the monitoring without spending out on licences; this is something we can aspire to once the base level of audit is in place.

So once these goals are defined, I looked at some options. Third party solutions like Nagios, which is a great monitoring tool I've used in the past, were not an option due to existing skills bases. I needed to use something that would use existing developer or administrative skillsets. I explored Policy Based Management (PBM) in SQL server, and decided this was a good avenue because it allowed simple best practices to be easily monitored, in some cases bad practices prevented, and could be easily rolled out to the enterprise via a central management server

Using PBM, you can write policies that prevent some behaviours, but due partly to existing codebases, and in some cases third party products that we can't modify, as well as a relatively relaxed regulatory framework, I've chosen to alert rather than prevent. I can write conditions that only target certain server classes, and have taken to using extended properties on databases to denote whether certain conditions apply (and on the master database to denote server environment characteristics, such as Production/UAT/Dev). This gives the flexibility to write one policy for a condition and have it apply wherever an applicable system is hosted in the enterprise.

This setup was still missing one part however - the visible bit. By default, when you have a non-compliant server, all you get is an icon in SQL Server Management Studio, and an error code fired which means you can alert via an event. This isn't even very visible if you know what you're looking for - and doesn't tell you what's wrong.

The solution was to use the Enterprise Policy Management Framework, a combination of Powershell and SSRS to query an estate of servers and present the results in a report, which could be displayed centrally (think intranet, or screen in the management office), or scheduled and circulated like other SSRS reports, and is simple to visually check on a regular basis - a screen full of red is bad!

Did it work? Certainly, and it allowed identification of issues such as backups being missed off the schedules, security permissions which were against policy, orphaned users and the like. This allowed the strategy to achieve the aim of giving confidence in the level of servers across the enterprise.

There are some areas for further thought, not least in terms of the security permissions required to run reports, which are necessarily high - however any system which monitors areas such as sensitive configuration will require a comparatively high level of access. I'm also looking at modifying some of the reports to better fit our needs, but this is a minor change given the data is now in place.

I found that adopting a strategy of implementing a monitoring solution to check the basics are in place has given me the confidence that we have a firmer base to build on. The strategies and goals they aim to achieve will evolve, as they all should, as time is freed up and the needs of the business dictate.  I've not given up on implementing a more integrated monitoring solution in the longer term, but this is a great way of assuring the business that the service we provide is fit for purpose, and can be relied upon.

Monday, 20 July 2015

Solving data problems cost effectively

I believe that most bad data can be categorised as due to one of:

  • Human data entry issues
  • Bad processing of data we hold
  • Corruption of data in storage
The last of these is a longer topic, and whilst there are mitigation strategies they are mostly a cost / spending exercise to achieve the required level of risk reduction. Entry issues have a few mitigation strategies - double entry can be used (principally used historically in finance audit systems) to prevent typos, but in an increasingly real-time data collection this is less practical. However what can be used are methods such as field validation rules, and repeating back to the user related data, to check the details. An example might be to look up the postcode a user entered and present a selection of valid addresses (assuming you have a source of these) for that post code. Thus, if an incorrect postcode has been entered the mistake should be obvious immediately, allowing it to be checked. Another method would be to indicate the location on a map. The method of checking will depend on the way that we have of communicating with the person from whom the data is collected. Part of the goal here is not only to ensure that our data is not erroneous, but that we do actually have the correct data. If you think of a scenario where we are collecting data from people in Plymouth, who will all have a post code starting PL... then if a users enters a post code starting PO.. (Portsmouth) then this is clearly incorrect - however it doesn't tell us what the correct post code actually is. If looking at questions of a more view based type, it may be possible to ask a similar question in two ways and cross check the answers.

Once we have the correct data then we need to ensure the correct processing of it, and it is here that the issues can get costly. The cost of getting data is small compared to the cost of getting it *again* if that data is lost - not only is there a monetary cost but also a cost to reputation. So this implies that we need to ensure our processing of data is valid. The best defence we have against the human quality of fallibility is testing, and more generally quality control. I would suggest that any software (most data processing is now done in software, but the principles can still be applied to manual processes) that is not meeting a basic quality standard should not be judged fit for use. Such standards, whilst defined to meet your project / organisational needs, should consider:
  • Have we identified the requirement this fulfils?
  • Does the documentation explain how we fulfil it?
  • Does the testing check all aspects of the requirement?
  • Does the testing check for entry of unusual data?
  • Does the unit of code work well on it's own?
  • How about if we put it within the overall system?
  • Is it maintainable (i.e. according to policies, design standards, etc.)?
  • Has someone else code-reviewed it, and the documentation and test, to check for errors?
The last point is important because a test failure would simply imply that the results didn't match what the test expected - it is up to the requirements documentation (or business representative) to arbitrate as to which is incorrect, the test or the code. I believe that all of these are valid and should form your definition of when code is "done" however you develop your solution - a traditional waterfall approach, an Agile methodology, or a more ad-hoc process. By catching these issues before a product is released, and actively processing real data, we remove the costs associated with loss of data, or correcting for accuracy problems which result from our processing.

Whilst we have above covered some basics this is clearly a much broader topic; it amazes me that I still see organisations who do not have robust processes to ensure their data processing does not only meet the legal obligations such as the Data Protection Act (such as Principal 4, accuracy), but also avoid compiling information inaccurately and therefore leave themselves open to making decisions on a flawed set of data, potentially invalidating the decision, and risking significant financial and reputation damage.