Monday, 20 July 2015

Solving data problems cost effectively

I believe that most bad data can be categorised as due to one of:

  • Human data entry issues
  • Bad processing of data we hold
  • Corruption of data in storage
The last of these is a longer topic, and whilst there are mitigation strategies they are mostly a cost / spending exercise to achieve the required level of risk reduction. Entry issues have a few mitigation strategies - double entry can be used (principally used historically in finance audit systems) to prevent typos, but in an increasingly real-time data collection this is less practical. However what can be used are methods such as field validation rules, and repeating back to the user related data, to check the details. An example might be to look up the postcode a user entered and present a selection of valid addresses (assuming you have a source of these) for that post code. Thus, if an incorrect postcode has been entered the mistake should be obvious immediately, allowing it to be checked. Another method would be to indicate the location on a map. The method of checking will depend on the way that we have of communicating with the person from whom the data is collected. Part of the goal here is not only to ensure that our data is not erroneous, but that we do actually have the correct data. If you think of a scenario where we are collecting data from people in Plymouth, who will all have a post code starting PL... then if a users enters a post code starting PO.. (Portsmouth) then this is clearly incorrect - however it doesn't tell us what the correct post code actually is. If looking at questions of a more view based type, it may be possible to ask a similar question in two ways and cross check the answers.

Once we have the correct data then we need to ensure the correct processing of it, and it is here that the issues can get costly. The cost of getting data is small compared to the cost of getting it *again* if that data is lost - not only is there a monetary cost but also a cost to reputation. So this implies that we need to ensure our processing of data is valid. The best defence we have against the human quality of fallibility is testing, and more generally quality control. I would suggest that any software (most data processing is now done in software, but the principles can still be applied to manual processes) that is not meeting a basic quality standard should not be judged fit for use. Such standards, whilst defined to meet your project / organisational needs, should consider:
  • Have we identified the requirement this fulfils?
  • Does the documentation explain how we fulfil it?
  • Does the testing check all aspects of the requirement?
  • Does the testing check for entry of unusual data?
  • Does the unit of code work well on it's own?
  • How about if we put it within the overall system?
  • Is it maintainable (i.e. according to policies, design standards, etc.)?
  • Has someone else code-reviewed it, and the documentation and test, to check for errors?
The last point is important because a test failure would simply imply that the results didn't match what the test expected - it is up to the requirements documentation (or business representative) to arbitrate as to which is incorrect, the test or the code. I believe that all of these are valid and should form your definition of when code is "done" however you develop your solution - a traditional waterfall approach, an Agile methodology, or a more ad-hoc process. By catching these issues before a product is released, and actively processing real data, we remove the costs associated with loss of data, or correcting for accuracy problems which result from our processing.

Whilst we have above covered some basics this is clearly a much broader topic; it amazes me that I still see organisations who do not have robust processes to ensure their data processing does not only meet the legal obligations such as the Data Protection Act (such as Principal 4, accuracy), but also avoid compiling information inaccurately and therefore leave themselves open to making decisions on a flawed set of data, potentially invalidating the decision, and risking significant financial and reputation damage.