Monday 30 September 2013

A word on data types

I came across a situation today where there was clearly confusion as to data types and numeric representation.

The basic issue at hand was how to represent a large number in SQL Server.

Now, SQL Server has a number of data types, which are essentially ways of representing data so that SQL Server can understand something about them. Each bit of data in SQL server needs to have one of these types. Some of these are used for numbers, and that is where our problem starts.

There are specific ranges of data that can be represented in each, and a corresponding amount of storage used. Clearly this is a concern for DBAs and DB designers who need to calculate storage space for millions of potential records. For example, the Numeric / Decimal types have the following possibilities:

Precision
Storage bytes
1 - 9
5
10-19
9
20-28
13
29-38
17

What this really means is that a number with a precision of 10 takes up nearly twice the storage space of a number with a precision of 9, but that storage space is only affected when you get beyond certain values - which may mean that you can use more precision for no storage space cost. Precision here is the number of digits that are stored (the total number including those both to the left and right of the decimal point).

The general rule I use is that one should use an appropriate type for the meaning of the data (i.e. if you're storing a number, use a type that is recognised as a number, if you are storing text, use a character-based type, if you are storing a date and time, use a type designed for the purpose, etc).

Of course, there is an obvious problem here. What can we do to store a number that has 39 digits in it?

Well, you cannot store a number in a data type that is outside the range of it. You need to pick an appropriate value. There are two solutions here - The application could report an error as the data is not able to be represented, or (more sensibly) the appropriate data type should be used. For some applications, you may only care about the significant digits (and not the final accuracy) so a float data type may be appropriate.

These are decisions that ideally would be taken by a database designer before an application is commissioned with full knowledge as to the intended use of the field, and certainly changing data types takes communication between various parties as you would ideally change the data type in the application, and everywhere that data is used in the database, to minimise implicit conversions and any resulting performance issues or rounding / truncation.It is something to be very careful of in an agile environment where you may find that data changes during development.

A word on the storage of numbers. If you are looking to store the data '9E10' then this could be a string data type (a digit nine, an E, a one and a zero), perhaps an equipment model number, or it could be representing 90000000000 (9*10^10). This is one reason why the data type is important, and typed data is more meaningful than untyped data. It would be stored differently, depending on the data type, and to SQL Server these different interpretations are not necessarily the same information (although if implicitly converted they could be perceived as such). This can lead to unexpected behaviour from your application.

Dates have a similar ambiguity to them, but with the added ambiguity that you don't know if 3/2/13 represents the 3rd of Feb (UK format) or 2nd of March (US format). For this reason it is always a good idea to use a specific date-based data-type (there are some such as the Date type that only store the day, and others such as datetime, datetime2 or smalldatetime that store various precisions of time as well). If you must use a string type to represent data (for example in printed output) then it is advisable to use an unanbiguous format, such as ISO8601 detailed - CCYY-MM-DD.

I hope this has clarified some of the uncertainty over date types.