Sunday, May 11, 2008

PHP Tag Cloud

I was trying to develop a tag cloud for one of my sites, and had some difficulty finding the best way to calculate what the size of each tag should be. Since I couldn't find a good solution online, I wrote my own.

A little bit of Googling landed me at Prism-Perfect. Here we find a very basic PHP script that really doesn't cut the mustard. The problem is that she takes a strictly linear approach -- the tag proportion (percentage) is used directly to determine the font size of the tag cloud. The reason why this is a poor choice to create a tag cloud is that tag frequency tends to not to be evenly distributed. Instead a few tags will usually have a very high frequency of use, and then it drops off rapidly. A tag cloud created using a tag frequency directly to determine font size will have a few tags that are very large, many tags that are very small, and almost nothing in between.

In the comments section of the Prism-Perfect post, I found a link to Echo Chamber where they take a non-linear approach to calculating tag size. Unfortunately, their calculations are wrong -- or at least they do not accomplish what we are trying to do, which is to more evenly distribute font size within the tag cloud. The equations with the post are flawed, and the corrected equations posted in the comments attempt to fix the problem, but still do not achieve the linearized font distribution that we are looking for. (I was unsuccessful in my attempts to register on Echo Chamber to post a comment to this post, the site appears to be dead.)

Finally, I came across an article on Dr Dobbs on constructing tag clouds. I do not know exactly what language their sample scripts are in (VB?), but they do exactly what we want. I have taken the script that linearizes the Pareto distribution and converted it to PHP.

Here is how it works:

  1. I pull the tags from a database and put them into an array. The array key is the tag itself, and the value is its frequency. I pull other info needed to construct a functioning cloud at the same time and save it into other arrays.


    $query = "SELECT news_categories.category, news_categories.idx, COUNT(news_tag.news_idx) AS quantity FROM news_tag JOIN news_categories ON news_tag.category_idx = news_categories.idx GROUP BY news_tag.category_idx ORDER BY quantity DESC $limit";
    $result = mysql_query($query) or trigger_error("MySQL error nr ".mysql_errno().": ".mysql_error());
    while (list($tagname,$tagid,$quantity) = mysql_fetch_row($result))
    {
    $tagname = trim(stripslashes($tagname));
    $tags[$tagname] = $quantity;
    $tagcount[$tagname] = $quantity;
    $category_id[$tagname] = $tagid;
    }
    $tags = FromParetoCurve($tags, $minSize, $maxSize);


  2. I pass the array containing the tag frequencies to the Dr Dobbs function to linearize it.

    function FromParetoCurve($weights, $minSize, $maxSize)
    {

    $logweights = array(); // array of log value of counts
    $output = array(); // output array of linearized count values

    // Convert each weight to its log value.
    foreach ($weights AS $tagname => $w)
    {
    // take each weight from input, convert to log, put into new array called logweights
    $logweights[$tagname] = log($w);
    }

    // MAX AND MIN OF logweights ARRAY
    $max = max(array_values($logweights));
    $min = min(array_values($logweights));

    foreach($logweights AS $lw)
    {
    if($lw < $min) { $min = $lw; } if($lw > $max)
    {
    $max = $lw;
    }
    }

    // Now calculate the slope of a straight line, from min to max.
    if($max > $min)
    {
    $slope = ($maxSize - $minSize) / ($max - $min);
    }

    $middle = ($minSize + $maxSize) / 2;

    foreach($logweights AS $tagname => $w)
    {
    if($max <= $min) { //With max=min all tags have the same weight. $output[$tagname] = $middle; } else { // Calculate the distance from the minimum for this weight. $distance = $w - $min; //Calculate the position on the slope for this distance. $result = $slope * $distance + $minSize; // If the tag turned out too small, set minSize. if( $result < $minSize) { $result = $minSize; } //If the tag turned out too big, set maxSize. if( $result > $maxSize)
    {
    $result = $maxSize;
    }
    $output[$tagname] = $result;
    }
    }
    return $output;
    }

  3. The returned array is then sorted and looped through to construct the the tag cloud.

You can download the complete scripts here. I hope this helps you in constructing your own tag cloud!

3 comments:

matt said...

Rhodopsin, thanks for posting this. Your FromParetoCurve() function upgraded my tag cloud from 12 lines of mousetype with 3 legible terms to something that actually resembles a tag cloud.

With less actual size range, the log approach gives a substantially more variable distribution of sizes, and is significantly more effective at communicating the relative sizes of item collections behind each tag. I knew with a glance at my original (linear) implementation that I'd need to use a log distribution, and you've saved me the pain of having to write my own. (Fortunately, I didn't write the linear one either.)

If you're interested in microphones, you'll see your own code in action. :)

Steven Trevino said...

Call it picky asthetics (that's my job as a son), but I would like to suggest using [code]labels/css on your post. See SyntaxHighlighter

Glad you finally got this working and decided to post your results!

rhodopsin said...

Thanks for the feedback, guys.

I tried SyntaxHighlighter, but ran into the line break issue (Blogger inserts line break tags into my posts). I think would have to add paragraph tags my old posts before I can use this solution.