THE BOTTLENECK

In every website, the slowest part is the part which interacts with a database. Another bottleneck is any module which depends on data from external sources like an RSS feed or an xml feed or crawling data from another website. In this day and age, speed is essential to give a good end-user experience. People are not likely to use a website if it is consistently slow.

Look at Google. One of the main reasons why it became the number 1 search engine in the world , way back in the 90s was not because it had more sites indexed than the other search engines existing at that time (AltaVista, Lycos, Mamma, Dogpile, Yahoo etc) but it was the fastest search engine. If I search for something in search engine A and in search engine B and both return more or less the same results, then I will go for the one which brings me the results faster.

The same applies to your website. Given the huge technical improvements in hardware and software, it seems sacrilege if your website takes ages to show content.

Even if you know some parts of your code are time-consuming , its unlikely that you can remove that code because its most likely a core part which is required.

What are the option available to you when faced with a module which is a bottleneck?

Refactor the code to make it more efficient
Completely rewrite the application logic to remove the bottleneck
Use caching to improve the existing code

Caching is the simplest and fastest way to improve code performance without altering the basic logic. But there is a caveat here.

Caching only applies if the bottleneck is to do with data retrieval. Caching is irrelevant if the bottleneck is just pure code .

CACHING IS NOT A NEW CONCEPT

The concept of Caching has been around since the early days of computer science, so its not something new. The earliest CPUs implemented caching at the hardware level to make their execution faster. What does Caching do and mean anyway?

If something is to be retrieved repeatedly and the retrieved data requires the same processing logic, then process it only the first time, store the results and the next time , directly get the results from the store instead of repeating the same processing again.

Example cases:

Retrieving a list of countries from a database. The list of countries is not likely to change for months, so it makes sense to cache it.

Showing matching jobs for a skillset in a jobsite. If you add jobs only once a day, then it makes sense to cache job searches for at least a few hours.

Showing RSS feeds from an external website. If the external website updates its feed only once every 5 hours then its efficient to cache the fetched rss data for at least a couple of hours.

The most crucial aspect of efficient caching is to know how often to cache the data. Without that knowledge its best not to do any caching.

Understanding caching is quite similar to database query optimisation. Identify the most frequently requested data and then optimise it – in this case cache it.

LETS GET STARTED

What I am going to do is show you a real-world example of caching in operation. This does not mean that the caching logic used here is the best or perfect for all situations. The objective is to show you the process of implementing your own caching logic.

THE APPLICATION

5 years back we made a crawling module as part of a website which was used to search and browse movies and TV shows from various movie and torrent sites. There is no database involved here. We will use the server filesystem to do caching.

The website takes in a search phrase from a form and executes the search across several external websites, gets the results, parses them and presents them in a single aggregated search result listing.

For eg. I type in “Spiderman” and it will fetch matching results from imdb.com, isohunt.com and megaupload.com (there are more sites, but not required for this example).

So after the search is executed, we end up with three sets of data (in array format) from each of the above sites. These arrays are then parsed and then presented on the front end to the user.

We know that these sites will return the same results for at least 30 minutes for the same search term. So what we do is create a cache for each of the three results . The cache will be valid for a period of 30 minutes. Till thirty minutes, it will continue retrieving data from the cache for “Spiderman” and after that it will get data from the actual sites.

In this case when data is retrieved from the sites, it takes about 7 seconds to display it. When its fetched from the cache it takes less than 1 second. This means for the next 30 minutes, anyone searching “Spiderman” will get instant results. After that the first person to search “Spiderman” again will have to wait for 7 seconds and then for the next 30 minutes, anyone searching “Spiderman” will get it in less than 1 second.

THE CACHING CLASS

Since we are using the filesystem to implement our caching, we need to implement the following things:

Function to write data to a cache
Function to retrieve data from a cache
Each cache is written as a file in a designated location on the server
Each cache must be have a uniquely identifiable filename
Each cache must have a predefined Time-To-Live (TTL) value in terms of seconds
There should be way of clearing the entire cache if needed

Each of these points are explained in detail below:

Cache Location – This is a folder in the server with write permissions. Lets call it /cache’

Cache Extension – Not very important but serves to give a custom extension to all cache files created. We will use the extension of .temp

Write function – The function takes in two arguments – a unique string which identifies the cache and this string will become the filename.This string will have two components – the identifier and the TTL value in seconds. So for our ‘spiderman’ search results fetched from imdb.com, which we want to cache for 30 minutes the filename would be “spiderman_imdb_1800.temp” . Remember to urlencode the filename string first so as to remove spaces , special characters etc which cant be part of a filename. We use the underscore as a delimiter to later retrieve the identifier and the TTL from the filename. You can use any delimiter as long as its not part of the search identifier.

The second argument is the data which will be written to the cache.The format of the data can be anything – string,array, objects as long as the data is serializable.You cant write php memory variables or objects to disk if they are not serializable.

Read function – This reads the data from a cache file and returns it back to the application. It takes in a single argument which is the search identifier string. In this case it will be spiderman_imdb_1800″.

The read does the following things:

Find if a cache file exists with this identifier. If not then it simply returns nothing
If the cache file exists, parse the TTL from the filename, compare the current system time with the file creation time and see if the TTL time has expired.
If TTL is still valid, it will read the contents from the file, unserialize it and return the data to the application.
If TTL has expired, it will delete the cache file and return nothing. This will be a signal for the calling application to know that the cache has expired and then retrieve data from the actual website.

Clearing the Cache – This function is not implemented as such as it depends on the application logic. But a very basic implementation would be to simply delete all the cache files in the cache folder.

THE CACHING CLASS AND HOW TO USE IT

The code for the caching is given below. Its a small class with less than a 100 lines of code.

<?php

class clsCache {
	var $m_path;
	var $m_extn;
	
	/***
	 * Constructor method
	 * Parameters: path
				   extn (optional)
	 * Returns   : None
	 ****/
	function clsCache($path, $extn = NULL) {
		$this->m_path = $path;
		if($extn == NULL)
			$this->m_extn = 'temp';
		else
			$this->m_extn = $extn;

	}
	

	/***
	 * put method to write into cache
	 * Parameters: id->string contains time to live and search string to uniquely identify
				   data->data in form of array from imdb
	 * Returns   : None
	 ****/
	function put($id,$data)
	{
		$id = urlencode($id);
		$file = $this->m_path . $id . "." . $this->m_extn;
			// write to folder
		$fid = fopen($file, 'w');
		fwrite($fid, serialize($data));
		fclose($fid);


	}

	/***
	 * get method to get contents from cache folder
	 * Parameters: id
	 * Returns   : array of object returned if present else false
	 ****/
	function get($id)
	{
		$fileName = $this->m_path.$id.'.'.$this->m_extn;
		if (file_exists($fileName))
		{
			date_default_timezone_set(DEFAULT_TIME_ZONE);

			$file_create_time = filemtime($fileName);   //file creation time
			$current_time = time();  //current time
			$difference = $current_time - $file_create_time;
			
			$expiry = explode("_", $id);  //to find out expiry interval
			foreach($expiry as $item)
			{
				$expiry_interval = $item;
			}
			
			if($expiry_interval >= $difference)
			{
				$data = file_get_contents($fileName);
				if($data != "")
					return unserialize($data);
				else
					return "";
			}
			else
			{
				if(unlink($fileName))
					return "";
			}
		}
		else
			return "";
		

	}


	/***
	 * clear the cache folder
	 * Parameters: none
				   
	 * Returns   : None
	 ****/
	function clear()
	{
		
	}
}
?>

Here is an example of the code which uses the class:

include_once("classes/clsCache.php");

$search_string = "Spiderman";

//two parameters path, extn. Note that path expects a trailing slash. 
$dataCache = new clsCache("/var/websites/mysite/cache/","temp");  

$data = $dataCache->get($search_string.'_1800');
if($data != "")
	{
		// whatever you want to do with the retrieved cached data
	}
else {
             // cache is not there or expired so fetch data from actual source and put it in new cache
              $searchData = fetchSearchResults(); // dummy function
              $dataCache->put($search_string.'_1800',$searchData);
}

One important thing to note here is that , when specifying the identifier string in the put() method it must have the correct TTL value, otherwise it will not find the cached file and will create a new cached file. So in case you are using different TTL values for various caching files, you have to make sure the TTL values are the same for writing and retrieving the same cache data.

As you can see the caching logic is very small and simple. But its impact is huge. It can dramatically speed up your website without having to alter the core logic. What you have above is a simple implementation of caching. You can build on this to make it as complex as you want. One thing not mentioned here, is when to clear out your cache. The caching class does not have any logic implemented to clear out expired cache files which have not been used. You can easily add in a method to iterate through all the cache files and delete them if their TTL has expired. If you dont clear the cache periodically then you may end up with a few hundred cache files which are of no use.

Always remember, caching is like optimisation. Its best done when the application is completed and running. Premature optimisation and premature caching are evils to be avoided.

Truelogic Blog

Notes from the world of software development, technology and strategy

A simple caching class to turbo-charge your php website.

THE BOTTLENECK

CACHING IS NOT A NEW CONCEPT

LETS GET STARTED

THE CACHING CLASS

THE CACHING CLASS AND HOW TO USE IT

1 Comment

Leave a Reply Cancel reply