Friday, August 17, 2012

Why relative URLs should be forbidden for web developers

Why relative URLs should be forbidden for web developers:
Twitter is the website / web service most of us love and sometimes hate, a service that has become an integral part of most of our online identities. It’s one of the services we expect to be there when we Google ourselves or other people. So when you Google yourself and instead of twitter.com, you see a weird result, you think “huh”. Then, when you’re like me, you try to figure out what caused that and once you figure it, you think “d0h!”. You’d think the people at Twitter would know better than to use relative URLs or even worse, a HOST header to determine the domain, resulting in this result when you search for my name:
yoast - twitter
Relative URLs stink. They really do. All sorts of SEO problems on the web are caused by the use of relative URLs in links, canonicals and more. We find issues with them in our website reviews on a regular basis, but as you can see bigger sites like Twitter also have massive issues because of them. I’ll try to explain why you shouldn’t use them and what you could do instead, as it might be simple things like this that hold you back from performing well with your website.


What are relative URLs?


Relative URLs are all URLs that do not contain a fully qualified domain name and path, but instead just the path or a portion of the path. So when your website is example.com, you could be linking to your contact page from your homepage like this:
<a href="contact.html">Contact</a>

And back to your homepage from your contact page like this:
<a href="/">Home</a>

The / refers to the directory / on the domain. So even when you’re three levels deep in a directory structure, linking to / would link to the frontpage. Lastly, when you’re on the corporate page of your about section, for instance example.com/about/corporate.html, you could link to your contact page like this:
<a href="../contact.html">Contact</a>

All the resulting URLs are calculated by your browser based on the base URL. By default, this is the current URL that’s in your location bar, but using the base element, you could set it to something else, like this:
<base href="http://www.example.com/subdirectory/">

Doing this would make the second link above, the link to /, resolve to http://www.example.com/subdirectory/.
This was all fine when HTML was invented and websites consisted of real static HTML pages in directory structures. Now though, most of the web is built with content management systems, changing URLs is easier and some URLs might behave differently than what you’d expect. Because of that, relative URLs can cause a few different types of issues, all of which can be pretty detrimental for your SEO and your server performance.


Why are relative URLs still being used?


Relative URLs are often used because developers have a test environment on another hostname and it makes it easy for them to move stuff between their test environment and their live environment. Other reasons include that it’s “just easier in website maintenance”. They’re also, in my opinion falsely, promoted by some websites about site speed because they’re “shorter” and thus “faster”.
In reality, all of these reasons are false when you look at the bigger picture. The few minutes a developer might save by using relative URLs are offset by countless hours an SEO might be spending to solve the issues caused.


Some of the problems caused by relative URLs


Issues caused by the use of relative URLs are vast and plentiful, and any seasoned SEO can probably give you a few examples of clients that have had huge losses because of them. Let me show you a couple of them:


A completely indexed test environment


When you have a menu structure that relies on relative URLs, one wrong link in your content to your test environment would cause the entire test environment to be spidered and indexed, causing massive duplicate content issues. This happens more often than you think, in fact, have you checked whether the test environment you used to test your last few development projects are indexed by Google? I bet some of you will now find out that they are indexed.


Spider traps


Most of the times I’ve found what we call “spider traps” they’re caused by wrongly used relative URLs. Let me show you an example: a site linking to ./example instead of ../example/, from the /contact/ page. A link to ./ means you’re linking to the current directory. When the current URL ends in /contact/ this means that a link to ./example/ resolves to /contact/example/. So clicking that link would take me to http://www.example.com/contact/example/, if your CMS is set up to serve the same page for /contact/example/ as it serves for /contact/, which is a very common case, you’ll now have a spider trap. Because that /content/example/ page also links to ./example/, which now resolves to /contact/example/example/, which then links to ./example/ again and thus links to /contact/example/example/example/ etc. etc. etc. You probably get the issue, and I hope you also understand why this could be very detrimental for your search engine rankings.
These kinds of issues are very easily found using a tool like Screaming Frog, which I think every webmaster should have in its arsenal.


Relative canonical URLs


Issues can also be caused by using relative canonical URLs. A canonical URL is supposed to link to the “perfect” URL for a piece of content on your website. If you use a relative link and also have a subdomain or test environment that’s indexed, you suddenly have several versions of a piece of content that all proclaim themselves as the canonical version of that piece of content… You can understand a search engine having a hard time dealing with this.


A little knowledge is a dangerous thing…


At Twitter, they figured out that they shouldn’t use relative canonicals. So a developer there thought he was smart and probably defined the domain part of the canonical URL using the HOST header information. This causes the very issue that I talked about in the introduction above, because now the IP result in the screenshot above has a canonical URL pointing to itself, causing Google to show Twitter’s IP’s in search results everywhere instead of the proper domain…


Protocol-relative URLs


Another issue is the so-called protocol relative URL. This is a URL that leaves of the http:// or https:// bit. This type of relative URL does have its uses, but it should not be deployed outside of those useful cases. The useful cases are when it’s used inside JavaScript or CSS, so files are served over the same protocol as the current page, especially because when you’re on a https URL, serving anything over http basically breaks the security. Using protocol relative URLs within links or canonical URLs is a very bad idea though, because you can still have duplicate content issues between http and https versions of a website.


The solution


WordPress core has this solution solved in a very nice way, using a couple of solutions:


Absolute URLs everywhere


Whenever WordPress outputs a URL, it’s always a full, absolute URL. For the domain name part of that it uses the domain you set in the General settings. This is the type of solution everyone should use: the domain name should be in a configuration file, this would allow you to still easily migrate between development environment and live environment by just using different configuration files.


Canonical Redirects


Whenever WordPress detects that you are on a specific article but are not using the proper “canonical” URL, it’ll try to 301 redirect you to the correct version. For the cases when it doesn’t detect this (it for instance ignores query parameters added to the URL), there is:


The canonical link URL element


When you’re on a single post or page, WordPress puts out a canonical link element, based on what the URL of the current article should be, irregardless of what’s in your browsers location bar. My WordPress SEO plugin extends this functionality to display canonical link elements just about everywhere within WordPress, and you should do this in your CMS too.


Conclusion


Twitter’s issue could be rather easily resolved, as we’ve discussed, by using proper absolute URLs everywhere in their code. There are no real good arguments against not doing that. While Twitter is not a direct e-commerce site and might not have the biggest of issues with losing a bit of traffic, I’ve had issues with relative URLs and relative canonicals at clients that have cost those clients upwards of a hundred thousand euro’s. The very small gain in web development time, if any, is never, ever, worth that.
So you should be using absolute URLs at all times, canonical redirects when possible and canonical link elements should ideally be on every page you serve out. After all, when you’re building a brand, do you really want to lose that brand in the search result pages? I think that’s a waste and I’m guessing you do too.
Why relative URLs should be forbidden for web developers is a post by on Yoast - Tweaking Websites.A good WordPress blog needs good hosting, you don't want your blog to be slow, or, even worse, down, do you? Check out my thoughts on WordPress hosting!


DIGITAL JUICE

No comments:

Post a Comment

Thank's!