5 lines of code that could ruin your internal linking

Alex Moran SEO Lead

5 min read

Media

For years, I have provided technical SEO support for sites that had anything from thousands to millions of pages.

This experience taught me that, although technical SEO is undoubtedly important, the importance of certain factors are often overstated. There is a lot of stock put into things like crawl budget and parameterised URLs, but these aren’t the make-or-break factors that will help you overtake another page in the rankings, and they often don’t supersede creating relevant content and earning authoritativeness.

Having said that, there are aspects of technical SEO that are critical to get right, and without them, your site or page may never be found, rendering all of your relevant and authoritative content redundant.

Often these may be just small bits of code, unnoticeable to your average user…but not to Google.

Here are a few examples of easy to overlook errors that can prevent a bot from finding your page(s) and potentially ruin your internal linking too:

Robots.txt Disallow

Let’s start with one that most people familiar with technical SEO will be aware of, however, for the benefit of those who don’t work on SEO, here’s a quick summary of Robots.txt:

Robots.txt helps control what search engines crawl. You might want to stop a certain type of bot, or all bots, from crawling a certain section of your site. E.g. anything behind /admin/ or in a situation where you do have lots of search pages that can be found by a search engine.

This does not mean the page won’t be found or indexed by Google. It just means Google won’t crawl URLs containing those parameters after entering your site.

This is an important distinction to make because if other pages link to your page with descriptive text, Google could still index your page’s URL without visiting it. Google is also incredibly good at finding URLs if they are somewhere out there in the giant ether of the internet. A good example of this is people thinking using a Robots.txt is enough to block a staging domain. It isn’t. The URL below is proof:

https://www.google.com/search?q=site:staging.*.com

Indexing aside, when it comes to internal linking, the most devastating thing you can do within your robots.txt are including these two lines:

User-agent: *

Disallow: /

This will tell all search engines not to crawl any URL on your site. Again, though it might still be able to find some URLs and index them. It won’t be able to read the data, and the page will end up looking like this:

The other thing to consider is how the rules work.

Disallow: / means ALL URLs after the slash i.e. all URLs, hence it being the most devastating and (often) quickly spotted. But say you are a property website, and you want to block the test pages built by your web developers from being crawled, you might look to add a rule like this:

Disallow: /dev

However, this will also block any URLs called /development/ as well, which could be almost as damaging as blocking the whole site if that’s where the lion’s share of the traffic comes from.

So how can you test your robots.txt rules?

Use Googles robots.txt tester

This website allows users to safely test a rule on their robots.txt or a URL to see if any issues arise from existing robots.txt rules.

As a tool, its intuitiveness makes it fairly helpful but, unfortunately, it only allows you to test one URL at a time, which can be fairly slow.

Use custom robots.txt rules by using crawling tools

You can also test your whole site before you add in any rules using a technical SEO tool like Screaming Frog – a personal favourite of mine.

Use Google Search Console to see what Google themselves, have found

And what about what is already blocked? To check this, log in to Google Search Console and review your coverage tab. The coverage tab shows all the URLs blocked by robots.txt but are still being indexed, labelled under ‘Warnings’ as ‘Indexed, blocked by Robots.txt’.

It’s worth looking at these URLs and checking if they should be blocked, and if other directives such as a noindex tag are required to stop them from being indexed.

There’s a lot more to Robots.txt, the tools that can help you with implementing the right rules, and indexability in general, but that’s not what I’m looking to address with this post.

However, if you would like to know more about this, I definitely encourage you to check out some of the guidelines Google has published. They’re really helpful!

Misusing relative links

If you’ve ever crawled a site and spotted this error in a URL, it might not be immediately clear what it is referring to.

Let me explain how relative links work:

Absolute links: link = full URL:

Relative links (if you are on the homepage)

Relative links (if you are on https://spaceandtime.co.uk/blog/):

The last relative link is often misused as the blog element can cause confusion. You see, relative links simply append what you’ve already added onto the loaded URL.

For example, /blog/how-to-cope-without-views-in-ga4/ is the correct link on the homepage because:

https://spaceandtime.co.uk + /blog/how-to-cope-without-views-in-ga4/ = https://spaceandtime.co.uk/blog/how-to-cope-without-views-in-ga4/

However, when used with this URL – https://spaceandtime.co.uk/blog/ – it will create a link like https://spaceandtime.co.uk/blog/blog/how-to-cope-without-views-in-ga4/.

In this instance, our website has code in place to protect against any such error. However, should that URL or the broken page containing the relative link load, this can lead to odd, repetitive URLs like the one below:

https://spaceandtime.co.uk/blog/blog/blog/blog/blog/blog/how-to-cope-without-views-in-ga4/

How can you find broken relative links?

The best way to do this is to crawl your site using a tool like Screaming Frog. This process allows you to see the internal linking structure of your site, and it will be apparent when these URLs start to appear, as they are so long!

Other common relative link issues involve putting slashes in the wrong place or thinking that you are writing an absolute link (i.e., spaceandtime.co.uk) but actually missing out key details like the https://. Without this, it’s not a complete URL and your link will end up looking like this:

https://spaceandtime.co.uk/spaceandtime.co.uk

Absolute links being moved over from staging

If you are wondering why we even use relative links, let me give you an example of when they are actually useful.

If you are using a staging website (one hidden to the public view for testing) then using relative links will make it so much easier when you’re ready to move them over to the live version. If you were to use absolute URLs on your staging site – like https://staging.spaceandtime.co.uk – then when you migrate you might miss some and they will pop up in your crawls and on the web.

This happens more than you’d think as we demonstrated with our example in point one:

https://www.google.com/search?q=site:staging.*.com

How to check staging links on your site:

Once again, it all comes down to crawling. It is crucial that when any changes to a site go live, you do an immediate crawl of your entire site.

Simply forgetting to change one single link can lead to your staging site being found and crawled by Google. It could be something as simple as forgetting to update your sitemaps, hence why you’ve got to crawl it all.

Client-side JavaScript

Another one that is a much-trodden subject in the technical SEO world. While Google can read JavaScript… it can’t in all circumstances (as clear as mud as they say).

To demonstrate what Google can and can’t do, here are some examples:

You want to use some JavaScript to help open up a carousel, but the content within the page is already HTML – this is fine as users can see it by interacting with the JavaScript and the bots can read it in the code. Great stuff, no issues for Google there.
However, when you hover over a button and the content changes on your page, this is content the bots can’t read, and isn’t read when bots first look at the page. It can change key elements on your page, such as its header or internal links to other parts of your site. It’s important to remember that Google’s mouse doesn’t move when reading the HTML, it can only follow the links it can see.

How to find client-side JavaScript issues?

Crawling bots behave much like Google, therefore these things are often harder to pick up simply by crawling. Therefore, a manual check is required to see how what you see on the site compares to what comes back in your crawl. Is there content on the site a crawler hasn’t found? Or pages within the site you have internally linked to, but have not appeared via crawling? Has your search engine visibility changed for certain pages when you’ve implemented some new functionality?

Links with data-url instead of href

Although this rarely occurs, it is still important and was my inspiration for writing this blog.

Nike have actually used this code to stop their various filters from being read by crawlers. While I can’t be sure if this works successfully for Nike (or if it was even their intention in the first place), I do know that I’ve seen it negatively impact sites.

Links need to be written using the <a href=”www.example.com”> format for Google to understand them. You can, however, have a functioning site for users even when you call these links something else. Our testing has found that these links aren’t often followed by bots.

If you do decide to use ‘data-url’ (or any variation of this) instead of ‘href’ across your site, you may find that none of your internal links are followed. As a result, your pages will be greatly devalued and makes them hard, or even impossible, for Google to find.

How do you find incorrectly tagged internal links?

Because this occurs so rarely, they can be quite difficult to find using a standard crawling tool. The links could be called ‘data-url’, ‘data-urls’, ‘url-info’ (or almost anything else).

As a result, unfortunately there’s no easy way to pick them up, but if you know what pages should be live on the site, you can compare them with the ones you’ve found whilst crawling the site. An easier way to quickly check all the actual URLs associated with your site is by downloading them from your content management system (CMS), or by using Google Analytics or Google Search Console data to see what has previously been found. You can then compare them to the URLs identified during your crawl…

Occasionally you’re lucky and it’s quite obvious (I.e., if you crawled a site with 1000 pages and only 100 came back). However, I’ve also been incredibly unlucky in the past and had to manually review every code on a website to find the error!

Conclusion

As an SEO team, our job is to understand the key principles that can impact your brand’s visibility online. There are many tools out there that will comb through all the possible issues a website has – there are very few with none.

However, it’s important for a skilled SEO team to convey what the bigger priorities are when it comes to improving your site from a technical perspective, to ensure that the fixes made actually impact Google’s rankings and, ultimately, the bottom line.

So, if any of what I’ve talked about concerns you, we have a team of specialists that do just that. We can perform an audit on your site and ensure it’s optimising as it should, with all links and coding intact. Get in touch here!