Code Snippet 3 - Create post slugs

If you’re used to WordPress, you must have noticed that usually a blog doesn’t use the default permalink structure (like http://site.com/?p=43, where 43 is the post ID store in the database). Instead, almost all blog owners tend to use the built-in option form to set the permalinks to something similar to http://site.com/a-great-post and leave the rest to Apache’s mod_rewrite to handle. In this case, a-great-post is called a post slug, or to be short, a slug. According to WordPress Codex:

A slug is a few words that describe a post or a page. Slugs are usually a URL friendly version of the post title (which has been automatically generated by WordPress), but a slug can be anything you like. Slugs are meant to be used with permalinks as they help describe what the content at the URL is.

In case you are wondering, slugs play a really, really important part in SEO. This is due to the fact that search engines like Google analyze an URL, and if it is relevant to the page’s content, the page’s rank point may be increased. Just like to us human, ?p=43 doesn’t tell anything, but how-to-create-post-slugs surely does.

So how is a slug generated? Most of the time, the post/page title is involved.

  1. First, notice that all slugs are in lowercase format
  2. Second, all non-alphanumeric characters - those ugly &?#$()… - are removed.
  3. Third, spaces get replaced by dashes

Let’s codify them into PHP:

function create_slug($post_title)
{
    // 1. convert into lower case
    $post_title = strtolower($post_title);
    // 2. only accepts alphanumerical characters (a-z, 0-9), spaces, and dashes
    // to do that, we use some RegEx magic
    $post_title = preg_replace('/[^a-z0-9 -]/', '', $post_title); 
    // 3. replace the spaces with dashes
    $post_title = str_replace(' ', '-', $post_title);
    return $post_title;
}

Now test the function - we’ll be using some real world examples from Digg:

echo create_slug("High-Speed 'Other' Internet Goes Global ") . '';
echo create_slug('Proposed Anti-Piracy Legislation is Flawed, ISP   Says') . '';
echo create_slug('Cheetah, Gecko and Spiders Inspire Robotic Designs (PICS)') . '';
echo create_slug('In-App Sales & iTablet: The Killer Combo to Save Publishing?');

The above code produces:

<code>high-speed-other-internet-goes-global-
proposed-anti-piracy-legislation-is-flawed-isp---says
cheetah-gecko-and-spiders-inspire-robotic-designs-pics
in-app-sales--itablet-the-killer-combo-to-save-publishing</code>

Not bad huh? There are some problems however. First, if the title has continuous spaces, the slug will contains continuous dashes, which is not quite right. Second, much more important, we didn’t take into account a concept called stop words - long story short, stop words are words that don’t contain important information and are often filtered out from search queries by search engines. A list of English stop words can be found here.

With this information on hand, we improve our code a bit:

function create_slug($post_title)
{
    // 1. convert into lower case
    $post_title = strtolower($post_title);
    // 2. only accepts alphanumerical characters (a-z, 0-9), spaces, and dashes
    // to do that, we use some RegEx magic
    $post_title = preg_replace('/[^a-z0-9 -]/', '', $post_title); 
    // 3. replace the spaces with dashes
    $post_title = str_replace(' ', '-', $post_title);
    // 4. deal with stop words. I added '' (empty string) into the stop words array too.
    $stop_words = array('', 'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although','always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',  'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves');
    $slug = array();
    // explode() the post title into single words
    $segments = explode('-', $post_title);
    foreach ($segments as $segment)
    {
        // if the segment is not a stop words, add it into $slug array
        if (!in_array($segment, $stop_words))
        {
            $slug[] = $segment;
        }
    }
    // now convert the $slug array into a string with dashes being the connector
    $slug = implode('-', $slug);
    return $slug;
}

Run the previous example again, we have:

<code>high-speed-internet-goes-global
proposed-anti-piracy-legislation-flawed-isp-says
cheetah-gecko-spiders-inspire-robotic-designs-pics
app-sales-itablet-killer-combo-save-publishing</code>

That’s better, and this time much more usable, isn’t it? Now the next step should be the database part - create a `slug` field as a unique key, and start querying on it instead of the ID. You handle it!

You can follow any responses to this entry through the RSS 2.0 feed.