Semantic Normalization of data

Every database programmer knows about database normalization – the process of organizing the columns and tables a relational database to minimize data redundancy. Semantic Normalization is the process of reshaping free form entered data. I first encountered the concept in Nathan Marzs Big Data: Principles and best practices of scalable realtime data systems.

For example different users may enter ‘San Francisco’ as ‘SF’, ‘San Francisco’, ‘SF CA’, or ‘North Beach’ (a neighborhood in the northeast of San Francisco). The semantic normalization algorithm should store all as ‘San Francisco’ before processing. The main thing being that data should be normalized before storing in the database.

Usually in web development we use a pre-filled drop-down with city or state names to avoid this kind of ambiguity. However, in some cases with free form data input – like for example when entering addresses, it is better to normalize the data before storing. Of course this is easily said than done, as the algorithm can get quite complex. An example I encountered in the past of semantic normalization was when entering shipping address on Amazon.com. Amazon nicely corrected my entered address – adding a few numbers, deleting some etc.

The question now is whether to store the corrected address or the original entered by the user. As with the example given for Amazon, it is obviously the corrected address that should be stored. However with some other cases it is not quite clear. For example when extracting data from unstructured HTML text, should we store the complete HTML string or the data required. As Nathan Marz explains in his book , it depends.

As a rule of thumb, if your algorithm for extracting the data is simple and accurate, like extracting an age from a HTML page, you should store the results of that algorithm. If the algorithm is subject to change, due to improvements or broadening the requirements, store the unstructured form of data.

In this age of big data, with records running into millions, semantic normalization algorithms can be hard to code, many times requiring external data sources to validate and normalize. I’m yet to design a satisfactory algorithm for one of my projects and only the experience coding one will give me insights for a better version.