Pitfalls of assigning a wrong data type to a database column

A recent debugging session on a web application surfaced a recurrent issue in database design – that of assigning a wrong data type to a database field.

Storing numeric data in a numeric field, when is it not recommended?

Many programmers, when designing a database schema, default to using a numeric data type for a numeric field – like zip codes, telephone numbers, age, price etc. However, not every numeric value needs to be stored as a numeric type. Many times it is prudent to store numeric values as a ‘text’ type – ‘varchar’ to be specific.

Take the case of a zip code – it usually gets assigned a numeric type. However, if you check zip codes in the US you can see that many of them start with a zero. A few random examples are given below.

07097	JERSEY CITY
07403	BLOOMINGDALE
00610	ANASCO
01541	PRINCETON

I’ve a database of around 43,000 US zip codes, out of which 3,000 (~ 7%) zip codes start with a zero. Saving this zip codes in a numeric field will truncate the initial zeros, losing information in the process.

This actually happened with a recent client project. I was debugging a legacy web application which was used by ground service teams to assign service personnel for home TV installations. Upon arrival of a new request the technician closest to the client location was assigned the job. This required calculating distances using the client zip code and the technicians location. However the application was sometimes giving wrong distances, giving the location distance as thousands of miles when actually it was just 12 miles. Digging around it was noticed that the corresponding zip codes were truncated and hence the wrong distances.

The data loss probably occurred when importing a ‘zip codes’ CSV into the database. As the database ‘zip codes’ field was set to a numeric type, while importing the initial zeros got truncated.

So when should numeric data be stored in a numeric type field and when in a ‘text’ type. The rule of thumb is that if you are going to do calculations on the numeric data – addition, subtraction, averaging etc. than use a numeric field type to store the data. You do not, or rather cannot, make any meaningful calculations on a zip code or a telephone number, so it is wise to store it as a text ‘char’ type.

Contrast zip codes with a ‘age’ field. One can do various calculations on it – average, min, max, standard deviation etc. So one can, or should, use a numeric field type for ‘age’.

6 thoughts on “Pitfalls of assigning a wrong data type to a database column

  1. Sameer,

    I may be wrong, yet, text type must usually be avoided when possible for a better performance.

    As far as I know, MySQL can not create temporary tables in memory when a table/query includes text types and creates them on-disk which is a huge performance killer on active databases.

    When the max character size of a column can be estimated, using varchar for text input is usually a better choice to avoid this.

  2. Thanks for the view. However, I was referring to a ‘text’ type in a generic sense, which includes varchar, text, tinytext etc.

  3. importing the initial zeros got truncated

    Mathematically, the leading zero is irrelevant. It is only an output formatting issue.

    In my opinion, putting an inherently numeric value into a CHAR field is an inappropriate “fix,” which should have been done by properly formatting the output and properly cleansing the input.

    As for importing CSV data, it must be properly “cleaned” before import. I think it is more likely that basic 5-digit Zip and extended nine-digit Zip codes were mixed in the input, possibly even including a dash in between the two sub-fields. That would indeed cause “truncation,” whereas a leading zero should not.

    A more important reason to use a numeric field is computational efficiency. If you’re doing sorting and selecting on zip code fields, comparing INTs will be hundreds of times faster than comparing CHARs. VARCHAR has other computational overhead, in memory allocation and string-end-finding. And working with fixed-length CHAR fields is still potentially subject to “truncation” if the size is incorrect and if CSV input data has not been cleansed. (Is it five bytes, or nine bytes? Does it include a dash, or not?)

  4. Thanks. Some valid points made. However, although leading zero is irrelevant mathematically, it is still relevant in a zip code. “ZIP codes aren’t numbers — they just happen to be coded with a restricted alphabet”. More reference on this issue:

    https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas

    Also, agreed on the optimization issue. However for me, it is better to be correct than fast.

  5. The article doesn’t say much about the core of the issue. “The corresponding zip codes were truncated and hence the wrong distances. The data loss probably occurred when importing a ‘zip codes’ CSV into the database. As the database ‘zip codes’ field was set to a numeric type, while importing the initial zeros got truncated.” is a very poor description. It isn’t clear why changing the type would solve the truncation. Maybe the truncation occurred in the import script and had nothing to do with the database type. Also, why truncating the data (e.g. 07097 -> 7097) should cause any problems if other parts of the app take it into account?

    Storing ZIP codes in text columns leads to risks of storing invalid ZIP codes such as ‘ABC’. It’s also inefficient. The correct datatype for US ZIP codes is DECIMAL(5) ZEROFILL.

Leave a Reply

Your email address will not be published. Required fields are marked *