Musings on Data, Data Management and SQL Server

Tuesday, 12 September 2017

Using PowerShell to drive ETL

You may be aware of T-SQL Tuesday; it's a concept conceived by Adam Machanic ( b | t ) some years ago, to encourage a number of posts on a different specific topic on a monthly basis. This month's topic, as hosted by Rob Sewell ( b | t ), is Let's All Get POSH.

For some time I've used SSIS to move data from one place to another, as part of a wider ETL strategy. Recently however, I've faced some challenges to this model, and have been using PowerShell as an alternative.

So firstly, here's some of the challenges I've faced:

Support for the multiple data formats I need to connect to

SSIS is a Microsoft product, and therefore focused on manipulating data within the Microsoft environment. It does attempt to interact with other systems, for example by use of ODBC to allow connection to read data, but this is not always as successful as might be hoped.

Flat files, and Excel files etc. are great, as long as the file format remains the same. Increasingly however, the data files we are loading are changing in format, and incorporate factors such as multiple line fields which SSIS requires a lot of work to utilise properly.

Programmability and support

SSIS uses VB.Net or C# programming to provide its interactivity, and this does provide a level of flexibility. That said, it can be tricky to get the right level of function whilst not making the solution so fragile that it is an active barrier to change and presents a problem for support teams.

Testing

I like to know that what I'm deploying works as I expect it to. When making changes, I'd like to know I've not created any problems whilst I've been doing it. This means testing, and ideally in an automated way, so it's repeatable whenever a change to the system is required. However, I've not found a good way to unit test within SSIS, and this means that there is a testing overhead, or a larger risk to accept whenever there is a change to deploy.

A possible replacement

Well, the modular approach of PowerShell can give an ability to reuse smaller aspects of the system to allow you to create re-usable building blocks, and assemble them in any order, in a similar way to the tools you might have used in SSIS to interact with different DBMS systems.

PowerShell also has a unit testing framework (Pester) built in, which allows us to check each building block works as expected.

PowerShell modules are developed by the community and benefit from the collective experiences of all those contributors. This also means that we have the things that those contributors wanted - and for ETL, that often means an element of automatic discovery of the structure of data sources, whilst allowing a manual specification where that is helpful.

PowerShell is increasingly used in the Windows arena, lending to increased knowledge of the language for support purposes.

A flexible way forwards

So - how can I use this to get a solution to tackle the above limitations, and generate a flexible solution?

Well, we have built a number of functions to read from, and write to, the specific DBMSs that we interact with most - Microsoft SQL Server, MariaDB, Excel and flat files. These allow a simple read from, and write to, of each of the DBMSs. To introduce flexibility, we pull in the query to run as a parameter to each function.

These functions can then be called, in whatever order is required, to read a query from a configuration file, run it against a given DBMS to read data, and store it in a PowerShell object. Any necessary manipulation can be done at this point, before the file is then written into the target DBMS. The process can then read a second query from a configuration file and use it to perform any required transformation - for example to write into a destination file. This allows us to take either an ETL or ELT approach (transforming either after extract, or after load.

This use of functions have allowed us to test each one works in isolation, and now we have a set of proven functions, we can then re-use these building blocks, to repeatedly move data, whether the data is in the same format, or needs manipulation between the systems.

As an unintended benefit, the modular approach allowed a distribution of the development work among colleagues, to a defined way of working, providing a useful focus to colleagues learning to use PowerShell.

From a support point of view, the use of a standardised language understood by other teams such as DevOps, as well as operation support teams, means that we have more confidence that issues can be resolved promptly should any issues occur.

Sunday, 30 July 2017

Keeping track of databases

In most production settings, you end up with a number of databases. Some conform to a naming scheme - perhaps a name of a well used application, perhaps according to customer, perhaps according to some internal identifier. However, in most cases, the current scheme has not always been in place, and there's a collection of unknown databases (or ones that simply don't fit the pattern), which don't cause much trouble, but simply sit there, waiting to trip up the person who has rarely if ever interacted with that database, but is charged with supporting it.

As regulations such as the EU's GDPR come into focus, we are increasingly called to account for all the data we hold, particularly if it is personally identifiable. What this means, is we need to be able to identify what's in these 'miscellaneous' databases, and to what they relate. Frankly, it's a good practice anyway, so you can better manage the data that you hold. It's also helpful for maintenance work to understand the effect if you take a database instance offline; even if your organisation is a 24x7 operation, you will probably have databases that don't need to be online all the time - for example the payroll database (as the team who look at this only work in the day). You may have varying definitions of 'day', particularly if you cross timezones.

The question then is, how to identify what these systems relate to. You may have a documentation source for each database (I often find ours to be regrettably out of date), but even if this is in date, it is designed to be looked at for a specific system, so this still means trawling lots of documentation to find out what we want to know in this case - and the documentation doesn't automatically update as databases move.

The solution I've started to use is extended properties on the database. These are a place to put an sql variant value (which can hold text or numbers for example) properties of an entity, and which forms part of that object and can be scripted out with it - so they can be source controlled. They move with the database, so as it is backed up / moved, the properties stay with the object to which they relate.

So why doesn't everyone do this? Well, firstly, it's a schema change. This is a blocker in some organisations, or for some supplier-supported systems. I've found that asking, and explaining why, goes a long way here. Secondly, there's the question of how to use them.

Extended properties can be viewed in the SQL Server Management Studio GUI, however this is clunky at best, and I tend to use scripts to manipulate them, to ensure consistency.

Extended properties can be added, edited, or deleted using system stored procedures. To read them, a function (fn_listextendedproperty) is called, or a system view is available. The key problem here is how to read from all the databases you have in an estate, both within an instance and across multiple instances.
I use policy based management, with a check that looks for the specific extended property I use, to gather this information, but you could equally well query this from Powershell, as a property on the database object as outlined by Buck Woody.

Using properties like this mean you can also tie to other data you might have - like who is responsible for which systems in the wider business - to give a list of people to notify about potential issues to database instances, which can be updated without updating lots of documents; this means it's more likely to be up to date, when you need it most.

This list of databases, purposes and potentially business owners may also be helpful to your information assurance & governance teams, as they look to ensure they have a good understanding of what databases you hold, and why.