Quite often we come across situations where someone has a slow running database and it needs optimizing to make a web application or site work faster.

A common thing we see in a number of inefficient databases is a lack of thought on the datatypes used to define fields in database tables, particularly in tables that are just used to verify information.

An example of a table like this might be one used to verify that a user’s request came from the right place (e.g. when submitting a web form).

Initially, you might consider trying to verify the user’s request against user ip, server ip (in the case of a load balanced system), user agent, referring url, and so on, leading to table which may look something like this:

Field Name Field Type Comment
ip varchar(15) At most an IP might be 15 chars
server_ip varchar(15) At most an IP might be 15 chars
user_agent varchar(255) UA probably shouldn’t be more than 255, but if it is this could be a text field
url varchar(255) Very unlikely to be over 255 chars

Which at first glance seems reasonable.

However, what happens when this table grows or is submitted to high traffic loads leading to it growing and shrinking rapidly? The answer is it’s potentially quite inefficient, and will become slow.

So, how can we make it faster?

Well, first lets consider  how varchar works in MySQL.

The best analogy for php developers is probably that it stores information in a similar form to that which get when serialize an object. i.e. MySQL stores a header before the field telling MySQL how long the field is and then stores the data. This means it’s great for saving disk space in tables which have very different field lengths in every row, but for tables where the field length is approximately the same and consistently short, the overhead of using varchar is likely to be sub-optimal.

So, considering the fields used to store IP addresses, consider the data they will store. IP addresses will at least be 8 characters long and at their most 15, and quite likely to average out at being generally 12-15 characters long (not many people have a 1.1.1.1 ip address!). So, at the extremes this data doesn’t fluctuate much in size and as it’s biased towards the longer end of the scale char(15) might be a better option here. Unlike a varchar field char is a fixed length and as a result mysql can navigate around this data faster (since it doesn’t have to continually recalculate where it is/should be in the table due to variable length fields.

So our table becomes:

Field Name Field Type Comment
ip char(15) At most an IP will be 15 chars
server_ip char(15) At most an IP will be 15 chars
user_agent varchar(255) UA probabaly shouldn’t be more than 255, but if it is this could be a text field
url varchar(255) Very unlikely to be over 255 chars

Next, let’s consider the user_agent and url fields.

In these cases we want to verify that this information matches known information about the current request elsewhere in our app (in $_REQUEST for instance) but from a system functionality point of view we don’t actually need to know the full detail of the user agent or the url (though this may help us from a human point of view when debugging), as they’re just unique strings for us to compare against. As we don’t actually need to know the original information contained here, we can use a message digest algorithm such as md5 to create a unique representation of the user agent and url strings which, in most cases will be much shorter than the original string. Additionally as md5 always produces a 32 character string we can set these fields to be char(32) rather than varchar and therefore save both space and access time, making our table look like this:

Field Name Field Type Comment
ip char(15) At most an IP will be 15 chars
server_ip char(15) At most an IP will be 15 chars
user_agent char(32) md5 of the user agent string to be used as an identifier
url char(32) md5 of the url to be used as an identifier
by Tom
June 22, 2011