PHP: Dealing with absolute and relative URLs

Published:

I'm currently writing a post on how to get image tags from a remote HTML page using PHP. One sticky issue with that is that the image URLs you find is a joyful mix of absolute and relative URLs.

Luckily, I found a function on nashruddin.com which seem to handle them alright. After a bit of clean up and fixing an error, we have this function:

<?php
function make_absolute($url, $base)
{
    // Return base if no url
    if( ! $url)
      return $base;

    // Return if already absolute URL
    if(parse_url($url, PHP_URL_SCHEME) != '')
      return $url;

    // Urls only containing query or anchor
    if($url[0] == '#' || $url[0] == '?')
      return $base.$url;

    // Parse base URL and convert to local variables: $scheme, $host, $path
    extract(parse_url($base));

    // If no path, use /
    if( ! isset($path))
      $path = '/';

    // Remove non-directory element from path
    $path = preg_replace('#/[^/]*$#', '', $path);

    // Destroy path if relative url points to root
    if($url[0] == '/')
      $path = '';

    // Dirty absolute URL
    $abs = "$host$path/$url";

    // Replace '//' or '/./' or '/foo/../' with '/'
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}

    // Absolute URL is ready!
    return $scheme.'://'.$abs;
}

I can sort of read through and see what it does, but I can't explain it very well. So, I'll just leave it at that. So far it has worked fine for me. Maybe some corner cases that are missing, and if there are, please let me know!

💡 What I added to the original function was line 5 and 17. The first to prevent it from crashing if the url is null or empty, and the second to prevent it from crashing if parse_url finds no path. For example if the url is http://www.example.com (no /whatever at the end).

The base tag

A tag that is easy to forget about is the base tag. The above function gets the base path from the URL given as base. For example if you gave it http://www.example.com/directory/file.html as base, it would use http://www.example.com/directory/. However, if file.html included the following base tag:

<base href="http://www.example.com/" />

Then the base path would be http://www.example.com/ instead. Fun, eh?

As long as you know about it, it's not to hard to deal with though. You just need to get a hold of it and provide that as base instead when using the function above.

Works On My Machine™! And if it doesn't on yours, let me know. If it's a mistake in the function, I'd like to fix it!