Monday, 20 July 2015

Solving data problems cost effectively

I believe that most bad data can be categorised as due to one of:

  • Human data entry issues
  • Bad processing of data we hold
  • Corruption of data in storage
The last of these is a longer topic, and whilst there are mitigation strategies they are mostly a cost / spending exercise to achieve the required level of risk reduction. Entry issues have a few mitigation strategies - double entry can be used (principally used historically in finance audit systems) to prevent typos, but in an increasingly real-time data collection this is less practical. However what can be used are methods such as field validation rules, and repeating back to the user related data, to check the details. An example might be to look up the postcode a user entered and present a selection of valid addresses (assuming you have a source of these) for that post code. Thus, if an incorrect postcode has been entered the mistake should be obvious immediately, allowing it to be checked. Another method would be to indicate the location on a map. The method of checking will depend on the way that we have of communicating with the person from whom the data is collected. Part of the goal here is not only to ensure that our data is not erroneous, but that we do actually have the correct data. If you think of a scenario where we are collecting data from people in Plymouth, who will all have a post code starting PL... then if a users enters a post code starting PO.. (Portsmouth) then this is clearly incorrect - however it doesn't tell us what the correct post code actually is. If looking at questions of a more view based type, it may be possible to ask a similar question in two ways and cross check the answers.

Once we have the correct data then we need to ensure the correct processing of it, and it is here that the issues can get costly. The cost of getting data is small compared to the cost of getting it *again* if that data is lost - not only is there a monetary cost but also a cost to reputation. So this implies that we need to ensure our processing of data is valid. The best defence we have against the human quality of fallibility is testing, and more generally quality control. I would suggest that any software (most data processing is now done in software, but the principles can still be applied to manual processes) that is not meeting a basic quality standard should not be judged fit for use. Such standards, whilst defined to meet your project / organisational needs, should consider:
  • Have we identified the requirement this fulfils?
  • Does the documentation explain how we fulfil it?
  • Does the testing check all aspects of the requirement?
  • Does the testing check for entry of unusual data?
  • Does the unit of code work well on it's own?
  • How about if we put it within the overall system?
  • Is it maintainable (i.e. according to policies, design standards, etc.)?
  • Has someone else code-reviewed it, and the documentation and test, to check for errors?
The last point is important because a test failure would simply imply that the results didn't match what the test expected - it is up to the requirements documentation (or business representative) to arbitrate as to which is incorrect, the test or the code. I believe that all of these are valid and should form your definition of when code is "done" however you develop your solution - a traditional waterfall approach, an Agile methodology, or a more ad-hoc process. By catching these issues before a product is released, and actively processing real data, we remove the costs associated with loss of data, or correcting for accuracy problems which result from our processing.

Whilst we have above covered some basics this is clearly a much broader topic; it amazes me that I still see organisations who do not have robust processes to ensure their data processing does not only meet the legal obligations such as the Data Protection Act (such as Principal 4, accuracy), but also avoid compiling information inaccurately and therefore leave themselves open to making decisions on a flawed set of data, potentially invalidating the decision, and risking significant financial and reputation damage.

Tuesday, 14 April 2015

Information Vs Data Vs Security

A subject that has been close to my mind recently, is as to when data becomes information, and how the security of that asset is managed - and by whom.

We have probably all seen the triangle:
By Longlivetheux (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

Which as you will gather, denotes that Information is comprised of (useful sets of) Data. There's also the concept, at least in UK law, of data, which comprises of Information:

Data means information which –(a) is being processed by means of equipment operating automatically in response to instructions given for that purpose,
(b) is recorded with the intention that it should be processed by means of such equipment,
(c) is recorded as part of a relevant filing system or with the intention that it should form part of a relevant filing system,
(d) does not fall within paragraph (a), (b) or (c) but forms part of an accessible record as defined by section 68, or
(e) is recorded information held by a public authority and does not fall within any of paragraphs (a) to (d).
https://ico.org.uk/for-organisations/guide-to-data-protection/key-definitions/

So it is clear that these terms are not defined exclusively, and that the terminology alone may present a problem.

The business often has a nominated Information Asset Manager, who looks after the types of information an organisation processes, and is responsible (perhaps to a Senior Information Risk Officer, or SIRO). The organisational policy may well give this person (delegated from the SIRO) the last word on policy - however this doesn't mean this person has the technical background to ensure that policy is implemented - or even practical.

There may also be someone in the IT infrastructure whose remit is security - either a dedicated IT Security Officer, or perhaps someone who has this responsibility as part of their work. Again, this is quite a wide ranging remit, and this person may feel it is better left to subject matter experts in specific systems, and lay guidance as to the principles to be observed.

This leaves the DBA with two potential places to go for security instructions, yet neither may understand the detailed technical processes necessary to actually undertake the remit, as their remit is either business focused or too wide for the focus on the specific technology of databases. That's where the DBA brings his or her expertise to bear, and as such brings value to the business.

Often the DBA role is also to mediate where the tension exists between data security and information use. This is because they are the guardian of the security on the data - and so the first place that gets approached when technical permission is denied. This brings to light the difference between a technical prevention measure, and a business policy allowing, forbidding, or laying restrictions on a course of action.

Whilst clearly data security will almost always win the balance between security and flexibility, the role of the data professional is increasingly to suggest a way to merge the data once permission is secured and in accordance with best practice. This is a difficult balancing act, which may not be fully understood by information users.

So how can we alleviate the issues this generates? Firstly - publicity. Explain what is necessary, and have a ready to roll out example of why you wouldn't like your data to be misused. Perhaps an example might be of sensitive personal data being unexpectedly available to the world for misuse, or the potential for processing of data in a way it wasn't gathered for - both breaches of the DPA, but still things that get requested disturbingly often.

A second avenue is to engage with the other data professionals in a proactive way, and ensure that clear paths are laid down to get issues circulated for discussion - and engage with the business to resolve them. Generally, there's a good way to do things, and a bad way - for example, would anonymised data do for a task? Could the risk be mitigated somehow? Perhaps we can't do something with the data we have, but changing how it is captured to ensure the correct permissions are granted gives us the ability to do this in the future. If the policies don't exist (or haven't been reviewed recently) then these can be bolstered to ensure they are fit for purpose - and in line with recent changes to regulation.

With these approaches, we lay down a safe way of working, that also keeps us within legal boundaries and hence providing value to the business with both of these points.