Article URLs week: Principles

JULY 27, 2003, 10:27 pm

Article URLs week: Principles

Too many news sites still post articles at ugly URLs like

http://www.al.com/news/birminghamnews/index.ssf? /xml/story.ssf/html_standard.xsl? /base/news/105929756463150.xml

rather than simple, pretty ones like

http://www.nytimes.com/2003/07/27/business/27MCI.html

Each day this week, I will evaluate typical article URLs at a bunch of news sites, concluding Saturday with recommendations for what the “ideal” format for article URLs might be. To repeat classic recommendations still not always followed, here are some principles of good URL design:

URLs should be human-readable. From the nytimes.com URL above, I can correctly guess that it goes to a business story about MCI published on July 27. But with the al.com URL, all I can guess is that it was published in the news section. While the page is still loading, an informative URL gives your readers important confirmation that they’re getting what they want.
URLs should be short. Short URLs encourage e-mailing and linking, which bring in traffic. Long URLs often get split onto multiple lines in e-mails, generating frustrating 404 errors when they’re clicked. If your site sends out e-mail newsletters, you may have problems if your own site has too-long URLs.
Corollary to the first two principles: URLs should not contain useless parts. Why does all the index.ssf garbage need to be in the al.com URL? In addition to confusing people, the content management system mumbo-jumbo just wastes bandwidth on the site’s home page, which has to link to all the articles like that.
URLs should be hierarchical. As you read from left to right, they should move from general to specific. This lets them be “hackable” so users can move to a more general level of the hierarchy by chopping off the end of the URL. The nytimes.com URL is not hackable, but it does present the date in the correct hierarchical way. Other date formats some sites use, like the American month-day-year, are not hierarchical.
URLs should be permanently unique. An article tomorrow or next week should not have the same address as an article today. What if someone clicked a link in one of your own e-mails a few days late and got a completely different article than he or she was expecting? Even if your site removes articles after a period of free availability, never reuse those URLs.

Tomorrow I’ll begin judging many more news sites’ article URLs on these principles.

Comment by Adrian, posted July 28, 2003, 12:27 am

Sweet! I look forward to this series.

Comment by Richard Tallent, posted August 2, 2003, 2:58 pm

Good stuff... my only criticisms:
1. Stories often fall into multiple categories.
2. Making reporters, etc. come up with a filename is silly.

I'd like to see a CMS that would work with URLs like this:

http://www.acme.com/article?news,business,MCI,2003-08-02,fraud

IOW, all it has is some combination of keywords that create an "Google-I'm-feeling-lucky"-sort of fingerprint. Thus, the publisher can reuse the existing "keywords" field, and if the keyword fingerprint ever becomes non-unique, the CMS can just spit out a list of all matching articles--this adds a step for the user, but they can also see other articles they might be interested in.

Comment by Dominic Mitchell, posted August 2, 2003, 5:33 pm

Heh, if more URLs were short, we wouldn't need those damnable shortening "services" that everybody is so keen on... The ones where you have no idea where you're going when you click on them (shades of slashdot trolls), and that might not be around next week/month/year.

-Dom

Comment by Nathan Ashby-Kuhlman, posted August 2, 2003, 7:50 pm

Richard, your idea of using a combination of keywords to identify an article is an interesting idea. I’ve been arguing that news sites need to fit the section, date and article name/number into a hierarchical structure, but if your keyword system would let the keywords come in any order that would add a lot of flexibility for people guessing at URLs.

I’m not sure exactly what you mean that coming up with a filename is silly. I’ve never run across a newspaper that does not use such filenames (slugs) to produce its print publication, so all I’m saying is I’d rather URLs use those existing descriptive names than invent their own meaningless sequential ID numbers. In your sample URL someone would have to come up with those keywords — is the only difference in what we’re talking about that your URLs would allow more than one?

Comment by Nancy McGough, posted August 3, 2003, 3:00 am

Another thing that I like to do with my URLs is to have them point to the directory, for example like this:

http://www.ii.com/internet/messaging/imap/isps/

and then let the HTTP server determine the file name, which at the moment is index.html but in the future might be index.shtml or something that hasn't yet been invented.

Another thing that's useful about using this type of hierarchical naming, where the hierarchy names are meaningful, is that Google seems to use the words in the path as part of its indexing algorithm. Of course, this may change.

Comment by Már Örlygsson, posted August 5, 2003, 9:26 pm

"URLs should be hierarchial", sure, but you should add, "but the hierarchy should *not* neccessarily reflect the site's navigation tree".

Site structures (i.e. sitemaps) change all the time, so it's usually best to avoid having the URLs reflect some navigational structure that will most likely be revamped 6 months from now.

Comment by roman orszanski, posted November 20, 2003, 8:06 pm

You might want to add another principal: dates should be optional.
While you might want to view an article as it was published on a given date, you might equally want to view the latest version of an article.
Thus while the first draft of an article might be
http://fred.org/theory/urls/2002/nov/13/strange.htm,
the updated article would be
http://fred.org/theory/urls/2003/feb/3/strange.htm,
but the permanant link would be to
http://fred.org/theory/urls/strange.htm, which always shows
the current version of an article (or the version with all comments to date, etc).
If your blog automatically used the last update date, the earlier URL
may well fail — ideally it should point to the same article in its current state (possibly with a "diffs" link to take you back to the original).

It really depends upon whether you view the two drafts as separate articles, or separate views in the evolution of a single article.

In either case, the hierarchy minus the date should form
the permalink. The article itself should link to the "other" versions.

Comment by nyob, posted January 9, 2004, 2:06 am

whatever nathan. it's real easy to tell everyone what they "should" do and not tell them "how" to do it. you must want to be a consultant or something.

Comment by Nathan Ashby-Kuhlman, posted January 9, 2004, 2:35 am

Nyob, since the techniques for implementing cleaner URLs are well-documented elsewhere, I did not want to rehash them here. Instead, I wanted to discuss the goals of using those techniques.

Comment by Christoph, posted February 12, 2004, 5:31 pm

Yeah, checkout weblog.cemper.com for really great url codings.

POST A COMMENT on “Article URLs week: Principles”