A PHP Site Mapper.

Introduction.

First I should be honest, and mention that his page was originally written a year or so ago. I post it again here now because I am trying to regularise the structure of my software pages, and because I have given it a small makeover, giving it a CSS file of its own, and some initialization variables in the PHP to tailor it to your site.

My initial ambitions were grandiose. I would offer a page that would let anyone create a site map of their own web site, or of anybody else's. At the same time I would be able to generate my own site map in a way that I could verify. The site map would include page addresses, titles, and the content of the description and keywords meta tags.

The Outcome.

OK, after about six iterations, I have lowered my target for the moment, though I have not yet written off the original possibility. In fact if I run the program on the BEV host machine in New Jersey, which is well connected, I can scan sites in the US fairly quickly.

I have something working now that you can see here against the BEV web site. The PHP file is in a state where you could put it on your own web site, and get similar results. But is not yet an effective general-purpose crawler. It just isn't fast enough. In Iteration one I used the PHP-5 DOMDocument class, and I recursively scanned pages starting from the web site root URL.

First Pitfall.

The first thing I learned was that it is very easy to run into circular references. Some of my pages have next/prev links, and a good number of them have a link to Jan 2003. So the first page that had both went into an infinite recursion that at first exceeded the default maximum script execution time. When I extended that, it ran the server out of memory - making it difficult to stop.

So first of all, keep a map of pages that you have already parsed, and skip over the ones you've already visited. This is dead obvious, but if you're starting from scratch ...

Speeding it up.

Even when that was fixed, the blunt, immediate, depth-first recursion approach, seemed very slow.

So then I tried a staged approach whereby I scanned pages for links, but then just stored the links on a stack for processing later. This approach made the execution time much more manageable. Actually it's a two-tier stack - an array corresponding to pages, populated with arrays of the pages' links. This was an improvement.

An alternative approach.

My next thought was that the whole thing could be quicker if I used an approach where instead of reading in pages, and creating a DOMDocument - which contains much more information than I want, I simply read the URL text, then used a few regular expressions to pull out the document title, appropriate meta tags, and links.

This works quite well, but did not give a substantial speed improvement because most of what I was doing was in interpreted PHP, while the DOMDocument stuff is presumably a component written in C, which runs quite quickly. To get full speed I'd need to write a PHP extension to do just the minimal stuff required. This would have to use a C library that could read remote URL files, and a RegEx library. As things stand, I'm not keen to go there, at least as far as my own objectives are concerned. Going back to C is not on my list of fun things to do.

Excluding unwanted paths.

So anyway, then, having got something working that would produce an HTML page describing the structure of my web site, I realized that there were some parts of the site that had complex HTML hierarchies (specifically the HTML help for some legacy Windows program) where I did not want the site map to go. So I had to include some exclusion capability. I guess I should at some point make the program look for a robots.txt file, and honour that, but for the moment is just uses a query string parameter.

A stick for my own back!

A big chunk of work followed when I then observed just how few of my pages had meaningful titles and any meta tags at all. But that discussion doesn't really belong on this page.

I mention this because if you add the faciity to your site, you should probably alow for some time to clean up things like that, otherwise the output from the site mapper will look rather crappy.

XML site map.

After that I was able to generate the site map that you can now access from the BEV home page top menu and elsewhere. However I also wanted to be able to create an XML site map as per http://www.sitemaps.org/. The sitemap protocol has optional parameters lastmod, changefreq, and priority. I wanted at least the lastmod date for my own information, and some stab at the change frequency.

The last modification date posed a problem. I could get a last-modified header from the PHP stream_get_meta_data function on my development machine, but that wasn't working on the machine where the site is hosted. So I had to add yet another query string parameter to provide a machine level path to the web root directory. I guess that this approach will only work when the PHP is running on the target machine, but for what I wanted, that's not a problem.

The change frequency thing should be dealt with by calling a webmaster provided function, but at the moment I'm just doing it based on the number of joints in the path to a page. This gets me close enough so that very little editing of the generated XML is required to pick out pages that I know have an unusual change frequency. Unless it was a very cunning function, I'd probably have to do that anyway.

PHP style.

For the record, I have to confess that this is the first time I have done any object-oriented style programming in PHP (this might well be obvious to afficionados). I congratulate PHPs designers for making that process quite straightforward and obvious to anyone who's done C++.

Customization for your site

If you'd like to use the site mapper on your site, the necessary files are in this zip file. There are just two files:
  • crawl.php,
  • sitemapper.css.
At the top of the php file you'll find a small collection of globals that you can modify to suit your needs:

// Title for <title> tag and for top of page
$title = "Brits Eye View Site Map";
// Path and filename for sitemapper.css
$csspath = "/common/sitemapper.css";
// URL of some php to log visit or undertake other admin tasks. This will be used to create a
// <script> tag, and as such should return some json or a simple list of script variables.
// If empty no script tag is created
$logvisiturl = "/php/hs_logvisit.php?pid=SiteMap&path=/sitemap";
// Path and name of an image to put at the top of the page
$bannerpath = "/common/skyline.jpg";
// Title for same if provided
$bannertitle = "Manhattan seen across the Hudson river from New Jersey.";
Other tweaking can be done by modifying the sitemapper.css file. More savage surgery will have to be done by editing the php source.

The link in your page to the site mapper will be of the form:

<a href="/php/crawl.php?url=http://britseyeview.com/">Site map</a>

Note that a full URL is required in the query string. If there are sections of the site that you want exclude - e.g. admin pages and such - you can add a query string providing a comma separated list of directories to be ignored:

<a href="/php/crawl.php?url=http://britseyeview.com/?exclude=ptyhelp,admin">Site map</a>

Source code.

If it's of interest to anyone, here's the current source code (globals omitted - use the zip file):

<?php
//          Copyright Steve Teale 2011 - 2012.
// Distributed under the Boost Software License, Version 1.0.
//    (See accompanying file LICENSE_1_0.txt or copy at
//          http://www.boost.org/LICENSE_1_0.txt)

set_time_limit(3600);

// Globals as above

class PgScanInfo
{
   public $title = "";
   public $description = "";
   public $keywords = "";
   public $links = null;
}

class PageInfo
{
   public $file = "";
   public $path = "";
   public $title = "";
   public $description = "";
   public $keywords = "";
   public $referencedby = "";
   public $lmdate = "";

   private function get_indent($depth)
   {
      $depth = $depth*2;
      $w = "".$depth;
      return '<span style="padding-left:'.$w.'em"></span>';
   }

   function toString($indent)
   {
      $pad = $this->get_indent($indent);

      if ($this->title === "")
         $this->title = $this->path.$this->file;
      $indent *= 2;
      $pad = $indent."em";


      $s = "<table style='width:100%; padding-left:$pad;'>\n";
      $s .= "<tr><td colspan='2'><a href='".$this->path.$this->file."'>".$this->title."</a></td></tr>\n";
      $s .= "<tr><td style='width:8em;'><span class='desc'>Relative path:
             </span></td><td>".$this->path.$this->file."</td></tr>\n";

      if ($this->description)
      {
         $s .= "<tr><td style='vertical-align:top;'><span class='desc'>Description:
                </span></td><td>".$this->description."</td></tr>\n";
      }
      if ($this->keywords)
      {
         $s .= "<tr><td style='vertical-align:top;'><span class='desc'>Keywords:
                </span></td><td>".$this->keywords."</td></tr>\n";
      }

      $s .= "</table>\n";
      return $s;
   }

   function toXMLString($domain)
   {
      $freq = ($this->path ==="/")? "daily": "yearly";
      $a = explode("/", $this->path);
      if (count($a) == 3 && $this->file === "")
         $freq = "monthly";
      $s = "   <url>\n";
      $s .= "      <loc>$domain".$this->path.$this->file."</loc>\n";
      $s .= "      <lastmod>".$this->lmdate."</lastmod>\n";
      $s .= "      <changefreq>$freq</changefreq>\n";
      $s .= "   </url>\n";
      return $s;
   }

}

class URLParts
{
   public $entire = "";
   public $domain = "";
   public $structure = null;
   public $file = "";
   public $query = "";
   public $original = "";
   public $nofollow = false;
   public $excluded = false;
   public $title = "";
   public $description = "";
   public $keywords ="";
   public $links = null;
   public $lmdate = "";

   function path()
   {
      if ($this->structure == null)
         return "/";

      $s = array();
      $s[] = "";
      for ($i = 0; $i < count($this->structure); $i++)
         $s[] = $this->structure[$i];
      $s[] = "";
      return join("/", $s);
   }
}

class Node
{
   public $pda;
   public $children;

   function __construct()
   {
      $this->pda = array();
      $this->children = array();
   }
}

class SiteMapper
{
   public $stack = null;
   private $url = "";
   private $url_parts = null;
   private $top = null;
   public $canonical = "";
   private $scanned = null;
   private $exclusions = null;
   private $hostpath = "";
   private $deffile = "index.html";
   private $xml = false;
   private $supplement = "";

   function get_lm_date($path)
   {
      $time = filemtime($path);
      $d = date("Y-m-d", $time);
      return $d;
   }

   private function decompose($url, $context)
   {
      $parts = new URLParts();
      $parts->entire = $url;
      $parts->original = $url;
      $t = str_replace("://", "^", $url);
      $p = strpos($t, "?");
      if ($p !== false)
      {
         $parts->query = substr($t, $p+1);
         $t = substr($t, 0, $p);
      }
      $a = explode("/", $t);
      if (count($a) === 1)
      {
         $t = $context.$t;
         $parts->entire = $t;
         $a = explode("/", $t);
      }
      $parts->domain = ($a[0] === "")? "": str_replace("^", "://", $a[0]);
      $parts->file = $a[count($a)-1];
      if ($parts->file === "index.html" || $parts->file === "index.htm")
         $parts->file = "";
      $parts->structure = array();
      for ($i = 1; $i < count($a)-1; $i++)
      {
         $parts->structure[] = $a[$i];
      }

      if (count($parts->structure))
      {
         if (array_key_exists($parts->structure[count($parts->structure)-1], $this->exclusions))
         {
             $parts->nofollow = true;
             return $parts;
         }
      }

      if (strlen($parts->file))
      {
          $dot = strpos($parts->file, ".");
          if ($dot === false)
             $parts->excluded = true;
          else
          {
             $ext = substr($parts->file, $dot+1);
             if (!($ext === "htm" || $ext == "html"))
                $parts->excluded = true;
          }
      }
      return $parts;
   }

   private function get_canonical($parts)
   {
      if ($parts->domain === "")
         return "";
      return $parts->domain."/";
   }

   function parse_page($url, $canonical, $nofollow)
   {
      $input = file_get_contents($url);
      if ($input == null)
         return null;
      $info = new PgScanInfo();

      $title = $url;
      $titlerex = "<title>(.*)<\/title>";
      if (preg_match("/$titlerex/siU", $input, $tm))
      {
         $info->title = $tm[1];
      }
      $metarex = "<meta(.*)\/?>";
      $namerex = "name\s*=\s*([\"\']??)(.*)\\1";
      $contentrex = "content\s*=\s*([\"\']??)(.*)\\1";

      if (preg_match_all("/$metarex/siU", $input, $matches, PREG_SET_ORDER))
      {
         foreach ($matches as $match)
         {
            $name = "";
            $content = "";
            $s = $match[1];
            if (preg_match_all("/$namerex/siU", $s, $m2, PREG_SET_ORDER))
            {
               if (preg_match_all("/$contentrex/siU", $s, $m3, PREG_SET_ORDER))
                  $content = $m3[0][2];
               $name = $m2[0][2];
               if ($name == "description")
                  $info->description = $content;
               else if ($name == "keywords")
                  $info->keywords = $content;
               if ($info->description && $info->keywords)
                  break;
            }
         }
      }
      if ($nofollow) return $info;

      $comrex = "<!--.*-->";
      $input = preg_replace("/$comrex/siU", "", $input);
      $scriptrex = "<script[^>]*>.*<\/script>";
      $input = preg_replace("/$scriptrex/siU", "", $input);
      $links = array();
      $regexp = "<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>.*<\/a>";
      if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
      {
         foreach($matches as $match)
         {
            $href = $match[2];
            $lc = strtolower($href);
            if (strpos($lc, "javascript") === 0)
             continue;
            if (strpos($lc, "mailto") === 0)
             continue;
            if (strpos($lc, "#") === 0)
             continue;

            // Test to see if it is prefixed by the canonical url, or simply by "/"
            $pos = strpos($href, $this->canonical);
            if (substr($href, 0, 1) == "/" || $pos === 0)
            {
             // it's a site internal link
             if ($pos === 0)
                $href = substr($url, strlen($this->canonical)-1);
            }
            else
            {
             // So maybe it's just a bare file name
             if (strpos($href, "/") !== false)
                continue;
            }
            $links[] = $href;
         }
      }
      $info->links = $links;

      return $info;
   }

   public function get_page_info()
   {
      return $this->parse_page($this->url, $this->canonical);
   }

   private function scan_page($up)
   {
      $url = $up->entire;
      if ($url[0] == "/")
         $url = $this->canonical.substr($url, 1);
      $psi = $this->parse_page($url, $this->canonical, $up->nofollow);
      if ($this->xml)
      {
         $file = $up->file;
         if (!$file)
            $file = $this->deffile;;
         $path = $this->hostpath.$up->path().$file;
         $up->lmdate = $this->get_lm_date($path);
      }

      if ($psi == null)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }

      $title = $psi->title;
      if (strpos($title, "403") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - 403 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }
      if (strpos($title, "404") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - 404 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }
      if (strpos($title, "301") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Redirected link - 301 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }

      if (title == "")
         $title = $up->domain.$up->path();
      $up->title = $title;
      $up->description = $psi->description;
      $up->keywords = $psi->keywords;
      if ($up->nofollow)
         return;
      $up->links = $psi->links;
      array_push($this->stack, $up);
   }

   private function do_build_map($tup)
   {
      $this->scan_page($tup);
      $pd = $this->top->pda[$this->url_parts->file];
      $pd->title = $tup->title;
      $pd->description = $tup->description;
      $pd->keywords = $tup->keywords;
      $pd->lmdate = $tup->lmdate;
      while (true)
      {
         $up = array_pop($this->stack);
         if ($up === null)
            break;
         while (true)
         {
            $link = array_pop($up->links);
            if ($link == null)
               break;
            $lup = $this->decompose($link, $up->path());
            $uri = $lup->path().$lup->file;
            $toscan = $this->canonical.substr($uri, 1);
            if (array_key_exists($toscan, $this->scanned))
               continue;
            if ($lup->excluded)
            {
               $this->scanned[$toscan] = "";
               continue;
            }
            $cn = $this->top;

            $joints = count($lup->structure);
            for ($j = 0; $j < $joints; $j++)
            {
               if (!array_key_exists($lup->structure[$j], $cn->children))
               {
                  $n = new Node();
                  $cn->children[$lup->structure[$j]] = $n;
                  $cn =$n;
               }
               else
                  $cn = $cn->children[$lup->structure[$j]];
            }
            // Hopefully we are now at a node that comprises an array of file names - possibly empty
            $npd = null;
            if (!array_key_exists($lup->file, $cn->pda))
            {
                  $npd = new PageInfo();
                  $npd->file = $lup->file;
                  $npd->path = $lup->path();
                  $cn->pda[$lup->file] = $npd;
            }
            $this->scan_page($lup);
            $this->scanned[$toscan] = "";
            $npd->title = $lup->title;
            $npd->description = $lup->description;
            $npd->keywords = $lup->keywords;
            $npd->lmdate = $lup->lmdate;
         }
      }
   }

   public function build_map()
   {
      $this->do_build_map($this->url_parts);
   }

   private function get_indent($depth)
   {
      $depth = $depth*2;
      $w = "".$depth;
      return '<span style="padding-left:'.$w.'em"></span>';
   }

   public function do_report($n, $indent)
   {
      foreach ($n->pda as $pd)
      {
         echo $pd->toString($indent);
      }
      foreach($n->children as $child)
         $this->do_report($child, $indent+1);
   }

   public function report()
   {
      global $title, $csspath, $logvisiturl;
      if ($this->xml)
      {
         $this->report_xml();
      }
      else
      {
         $s =
         "<!DOCTYPE html>\n".
         "<html><head>".
         "<title>$title</title>".
         "<link rel='stylesheet' href='$csspath' type='text/css' />";
         if ($logvisiturl)
            $s .= "<script src='$logvisiturl'></script>";
         $s .="</head><body><div class='background'>".
         "<h1>$title</h1>".
         "This is a dynamic site map generated in much the same way as the information collected by
          web crawlers is obtained. ".
         "If the pages on the site have a link to some other page in the site, then there will be 
         (or should be ;=)) an entry here.<p>";
         echo $s;

         $this->do_report($this->top, 0);

         if ($this->supplement)
         {
            echo "<p><hr><h3>Supplementary Information:</h3>";
            echo $this->supplement;
         }
         echo "</body></html>";
      }
   }

   function do_report_xml($n, $domain)
   {
      foreach ($n->pda as $pd)
      {
         echo $pd->toXMLString($domain);
      }
      foreach($n->children as $child)
         $this->do_report_xml($child, $domain);
   }

   function report_xml()
   {
      header("Content-type: text/xml");
      $domain = substr($this->url, 0, strlen($this->url)-1);
      echo '<?xml version="1.0" encoding="UTF-8"?>'."\n";
      echo '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'."\n";
      $this->do_report_xml($this->top, $domain);
      echo "</urlset>\n";
   }

   function __construct($url, $exclude, $hostpath = "", $deffile = "index.html")
   {
      $this->url = $url;
      $this->url_parts = $this->decompose($url, "/");

      $this->top = new Node();
      $pd = new PageInfo();
      $pd->file = $this->url_parts->file;
      $pd->path = "/";
      $pd->title = "Home page";
      $this->top->pda[$pd->file] = $pd;

      $this->canonical = $this->get_canonical($this->url_parts);
      $this->scanned = array();
      $this->stack = array();

      if ($exclude)
      {
         $a = explode(",", $exclude);
         $this->exclusions = array();
         foreach ($a as $ex)
         {
            $this->exclusions[$ex] = "";
         }
      }
      if ($hostpath)
      {
         $this->hostpath = $hostpath;
         $this->deffile = $deffile;
         $this->xml = true;
      }
   }

}

$url = $_GET['url'];
$exclude = $_GET["exclude"];
$hostpath = $_GET["hostpath"];
// If you provide 'hostpath', XML sitemap is assumed
$sm = new SiteMapper($url, $exclude, $hostpath);
$sm->build_map();
$sm->report();
?>