Tuesday 12 September 2017

Using PowerShell to drive ETL

You may be aware of T-SQL Tuesday; it's a concept conceived by Adam Machanic ( b | t ) some years ago, to encourage a number of posts on a different specific topic on a monthly basis. This month's topic, as hosted by Rob Sewell ( b | t ), is Let's All Get POSH.

For some time I've used SSIS to move data from one place to another, as part of a wider ETL strategy. Recently however, I've faced some challenges to this model, and have been using PowerShell as an alternative.

So firstly, here's some of the challenges I've faced:

Support for the multiple data formats I need to connect to

SSIS is a Microsoft product, and therefore focused on manipulating data within the Microsoft environment. It does attempt to interact with other systems, for example by use of ODBC to allow connection to read data,  but this is not always as successful as might be hoped.
Flat files, and Excel files etc. are great, as long as the file format remains the same. Increasingly however, the data files we are loading are changing in format, and incorporate factors such as multiple line fields which SSIS requires a lot of work to utilise properly.

Programmability and support

SSIS uses VB.Net or C# programming to provide its interactivity,  and this does provide a level of flexibility.  That said,  it can be tricky to get the right level of function whilst not making the solution so fragile that it is an active barrier to change and presents a problem for support teams.

Testing

I like to know that what I'm deploying works as I expect it to. When making changes, I'd like to know I've not created any problems whilst I've been doing it. This means testing, and ideally in an automated way, so it's repeatable whenever a change to the system is required. However, I've not found a good way to unit test within SSIS, and this means that there is a testing overhead, or a larger risk to accept whenever there is a change to deploy.

A possible replacement

Well, the modular approach of PowerShell can give an ability to reuse smaller aspects of the system to allow you to create re-usable building blocks, and assemble them in any order, in a similar way to the tools you might have used in SSIS to interact with different DBMS systems.

PowerShell also has a unit testing framework (Pester) built in, which allows us to check each building block works as expected. 
PowerShell modules are developed by the community and benefit from the collective experiences of all those contributors. This also means that we have the things that those contributors wanted - and for ETL, that often means an element of automatic discovery of the structure of data sources, whilst allowing a manual specification where that is helpful.

PowerShell is increasingly used in the Windows arena, lending to increased knowledge of the language for support purposes.

A flexible way forwards

So - how can I use this to get a solution to tackle the above limitations, and generate a flexible solution?

Well, we have built a number of functions to read from, and write to, the specific DBMSs that we interact with most - Microsoft SQL Server, MariaDB, Excel and flat files.  These allow a simple read from, and write to, of each of the DBMSs. To introduce flexibility, we pull in the query to run as a parameter to each function.

These functions can then be called, in whatever order is required, to read a query from a configuration file, run it against a given DBMS to read data, and store it in a PowerShell object. Any necessary manipulation can be done at this point, before the file is then written into the target DBMS. The process can then read a second query from a configuration file and use it to perform any required transformation - for example to write into a destination file. This allows us to take either an ETL or ELT approach (transforming either after extract, or after load.

This use of functions have allowed us to test each one works in isolation, and now we have a set of proven functions, we can then re-use these building blocks, to repeatedly move data, whether the data is in the same format, or needs manipulation between the systems.

As an unintended benefit, the modular approach allowed a distribution of the development work among colleagues, to a defined way of working, providing a useful focus to colleagues learning to use PowerShell.

From a support point of view, the use of a standardised language understood by other teams such as DevOps, as well as operation support teams, means that we have more confidence that issues can be resolved promptly should any issues occur.

No comments:

Post a Comment