New York Panorama

Life in Arusha, Tanzania, through the eyes of an Englishman

 BEV Home Page     Contact     Comment    Site Map

A PHP Site Mapper

Readers who have watched the BEV home page, and look at this one (at this point it's not clear that there are any readers in both categories) will be aware that for a couple of days I have been working on a hopefully tiny bit of software to generate a site map.

My initial ambitions were grandiose. I would offer a page that would let anyone create a site map of their own web site, or of anybody else's. At the same time I would be able to generate my own site map in a way that I could verify. The site map would include page addresses, titles, and the content of the description and keywords meta tags.

OK, after about six iterations, I have lowered my target for the moment, though I have not yet written off the original possibility. In fact if I run the program on the BEV host machine in New Jersey, which is well connected, I can scan sites in the US fairly quickly.

I have something working now that you can see here against the BEV web site. The PHP file is now in state where you could put it on your own web site, and get similar results. But is not yet an effective crawler. It just isn't fast enough. In Iteration one I used the PHP-5 DOMDocument class, and I recursively scanned pages starting from the web site root URL.

The first thing I learned was that it is very easy to run into circular references. Some of my pages have next/prev links, and a good number of them have a link to Jan 2003. So the first page that had both went into an infinite recursion that at first exceeded the default maximum script execution time. When I extended that, it ran the server out of memory - making it difficult to stop.

So first of all, keep a map of pages that you have already parsed, and skip over the ones you've already visited. This is dead obvious, but if you're starting from scratch ...

Even when that was fixed, the blunt, immediate, depth-first recursion approach, seemed very slow.

So then I tried a staged approach whereby I scanned pages for links, but then just stored the links on a stack for processing later. This approach made the execution time much more manageable. Actually it's a two-tier stack - an array corresponding to pages, populated with arrays of the pages' links.

My next thought was that the whole thing could be quicker if I used an approach where instead of reading in pages, and creating a DOMDocument - which contains much more information than I want, I simply read the URL text, then used a few regular expressions to pull out the document title, appropriate meta tags, and links.

This works quite well, but did not give a substantial speed improvement because most of what I was doing was in interpreted PHP, while the DOMDocument stuff is presumably a component written in C, which runs quite quickly. To get full speed I'd need to write a PHP extension to do just the minimal stuff required. This would have to use a C library that could read remote URL files, and a RegEx library. As things stand, I'm not keen to go there, at least as far as my own objectives are concerned. Going back to C is not on my list of fun things to do.

So anyway, then, having got something working that would produce an HTML page describing the structure of my web site, I realized that there were some parts of the site that had complex HTML hierarchies (specifically the HTML help for some legacy Windows program) where I did not want the site map to go. So I had to include some exclusion capability. I guess I should at some point make the program look for a robots.txt file, and honour that, but for the moment is just uses a query string parameter.

A big chunk of work followed when I then observed just how few of my pages had meaningful titles and any meta tags at all. But that discussion doesn't really belong on this page.

After that I was able to generate the site map that you can now access from the BEV home page top menu and elsewhere. However I also wanted to be able to create an XML site map as per http://www.sitemaps.org/. The sitemap protocol has optional parameters lastmod, changefreq, and priority. I wanted at least the lastmod date for my own information, and some stab at the change frequency.

The last modification date posed a problem. I could get a last-modified header from the PHP stream_get_meta_data function on my development machine, but that wasn't working on the machine where the site is hosted. So I had to add yet another query string parameter to provide a machine level path to the web root directory. I guess that this approach will only work when the PHP is running on the target machine, but for what I wanted, that's not a problem.

The change frequency thing should be dealt with by calling a webmaster provided function, but at the moment I'm just doing it based on the number of joints in the path to a page. This gets me close enough so that very little editing of the generated XML is required to pick out pages that I know have an unusual change frequency. Unless it was a very cunning function, I'd probably have to do that anyway.

For the record, I have to confess that this is the first time I have done any object-oriented style programming in PHP. I congratulate PHPs designers for making that process quite straightforward and obvious to anyone who's done C++.

If it's of interest to anyone, here's the current source code:


<?php
error_reporting(0);
set_time_limit(3600);

class PgScanInfo
{
   public $title = "";
   public $description = "";
   public $keywords = "";
   public $links = null;
}

class PageInfo
{
   public $file = "";
   public $path = "";
   public $title = "";
   public $description = "";
   public $keywords = "";
   public $referencedby = "";
   public $lmdate = "";
   
   private function get_indent($depth)
   {
      $depth = $depth*2;
      $w = "".$depth;
      return '<span style="padding-left:'.$w.'em"></span>';
   }

   function toString($indent)
   {
      $pad = $this->get_indent($indent);

      if ($this->title === "")
         $this->title = $this->path.$this->file;
      $indent *= 2;
      $pad = $indent."em";
         
         
      $s = "<table style='width:100%; padding-left:$pad;'>\n";
      $s .= "<tr><td colspan='2'><a class='heading' href='".$this->path.$this->file."'>".
              $this->title."</a></td></tr>\n";
      $s .= "<tr><td style='width:8em;'><b>Relative path:</b></td><td>".$this->path.
              $this->file."</td></tr>\n";
      
      if ($this->description)
      {
         $s .= "<tr><td style='vertical-align:top;'><b>Description:</b></td><td>".
                 $this->description."</td></tr>\n";
      }
      if ($this->keywords)
      {
         $s .= "<tr><td style='vertical-align:top;'><b>Keywords:</td><td>".
                  $this->keywords."</td></tr>\n";
      }
      
      $s .= "</table>\n";
      return $s;
   }
   
   function toXMLString($domain)
   {
      $freq = ($this->path ==="/")? "daily": "yearly";
      $a = explode("/", $this->path);
      if (count($a) == 3 && $this->file === "")
         $freq = "monthly";
      $s = "   <url>\n";
      $s .= "      <loc>$domain".$this->path.$this->file."</loc>\n";
      $s .= "      <lastmod>".$this->lmdate."</lastmod>\n";
      $s .= "      <changefreq>$freq</changefreq>\n";
      $s .= "   </url>\n";
      return $s;
   }
   
}

class URLParts
{
   public $entire = "";
   public $domain = "";
   public $structure = null;
   public $file = "";
   public $query = "";
   public $original = "";
   public $nofollow = false;
   public $excluded = false;
   public $title = "";
   public $description = "";
   public $keywords ="";
   public $links = null;
   public $lmdate = "";
   
   function path()
   {
      if ($this->structure == null)
         return "/";
      
      $s = array();
      $s[] = "";
      for ($i = 0; $i < count($this->structure); $i++)
         $s[] = $this->structure[$i];
      $s[] = "";
      return join("/", $s);
   }
}

class Node
{
   public $pda;
   public $children;
   
   function __construct()
   {
      $this->pda = array();
      $this->children = array();
   }
}

class SiteMapper
{
   public $stack = null;
   private $url = "";
   private $url_parts = null;
   private $top = null;
   public $canonical = "";
   private $scanned = null;
   private $exclusions = null;
   private $hostpath = "";
   private $deffile = "index.html";
   private $xml = false;
   private $supplement = "";

   function get_lm_date($path)
   {
      $time = filemtime($path);
      $d = date("Y-m-d", $time);
      return $d;
   }

   private function decompose($url, $context)
   {
      $parts = new URLParts();
      $parts->entire = $url;
      $parts->original = $url;
      $t = str_replace("://", "^", $url);
      $p = strpos($t, "?");
      if ($p !== false)
      {
         $parts->query = substr($t, $p+1);
         $t = substr($t, 0, $p);
      }
      $a = explode("/", $t);
      if (count($a) === 1)
      {
         $t = $context.$t;
         $parts->entire = $t;
         $a = explode("/", $t);
      }
      $parts->domain = ($a[0] === "")? "": str_replace("^", "://", $a[0]);
      $parts->file = $a[count($a)-1];
      if ($parts->file === "index.html" || $parts->file === "index.htm")
         $parts->file = "";
      $parts->structure = array();
      for ($i = 1; $i < count($a)-1; $i++)
      {
         $parts->structure[] = $a[$i];
      }

      if (count($parts->structure))
      {      
         if (array_key_exists($parts->structure[count($parts->structure)-1], $this->exclusions))
         {
             $parts->nofollow = true;
             return $parts;
         }
      }

      if (strlen($parts->file))
      {
          $dot = strpos($parts->file, ".");
          if ($dot === false)
             $parts->excluded = true;
          else
          {
             $ext = substr($parts->file, $dot+1);
             if (!($ext === "htm" || $ext == "html"))
                $parts->excluded = true;
          }
      }
      return $parts;
   }

   private function get_canonical($parts)
   {
      if ($parts->domain === "")
         return "";
      return $parts->domain."/";
   }

   function parse_page($url, $canonical, $nofollow)
   {
      $input = file_get_contents($url);
      if ($input == null)
         return null;
      $info = new PgScanInfo();

      $title = $url;
      $titlerex = "<title>(.*)<\/title>";
      if (preg_match("/$titlerex/siU", $input, $tm))
      {
         $info->title = $tm[1];
      }
      $metarex = "<meta(.*)\/?>";
      $namerex = "name\s*=\s*([\"\']??)(.*)\\1";
      $contentrex = "content\s*=\s*([\"\']??)(.*)\\1";

      if (preg_match_all("/$metarex/siU", $input, $matches, PREG_SET_ORDER))
      {
         foreach ($matches as $match)
         {
            $name = "";
            $content = "";
            $s = $match[1];
            if (preg_match_all("/$namerex/siU", $s, $m2, PREG_SET_ORDER))
            {
               if (preg_match_all("/$contentrex/siU", $s, $m3, PREG_SET_ORDER))
                  $content = $m3[0][2];
               $name = $m2[0][2];
               if ($name == "description")
                  $info->description = $content;
               else if ($name == "keywords")
                  $info->keywords = $content;
               if ($info->description && $info->keywords)
                  break;
            }
         }
      }
      if ($nofollow) return $info;

      $comrex = "<!--.*-->";
      $input = preg_replace("/$comrex/siU", "", $input);
      $scriptrex = "<script[^>]*>.*<\/script>";
      $input = preg_replace("/$scriptrex/siU", "", $input);
      $links = array();
      $regexp = "<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>.*<\/a>";
      if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
      { 
         foreach($matches as $match) 
         { 
            $href = $match[2];
            $lc = strtolower($href);
            if (strpos($lc, "javascript") === 0)
             continue;
            if (strpos($lc, "mailto") === 0)
             continue;
            if (strpos($lc, "#") === 0)
             continue;
             
            // Test to see if it is prefixed by the canonical url, or simply by "/"
            $pos = strpos($href, $this->canonical);
            if (substr($href, 0, 1) == "/" || $pos === 0)
            {
             // it's a site internal link
             if ($pos === 0)
                $href = substr($url, strlen($this->canonical)-1);
            }
            else
            {
             // So maybe it's just a bare file name
             if (strpos($href, "/") !== false)
                continue;
            }
            $links[] = $href;
         }
      }
      $info->links = $links;

      return $info;
   }

   public function get_page_info()
   {
      return $this->parse_page($this->url, $this->canonical);
   }

   private function scan_page($up)
   {
      $url = $up->entire;
      if ($url[0] == "/")
         $url = $this->canonical.substr($url, 1);
      $psi = $this->parse_page($url, $this->canonical, $up->nofollow);
      if ($this->xml)
      {
         $file = $up->file;
         if (!$file)
            $file = $this->deffile;;
         $path = $this->hostpath.$up->path().$file;
         $up->lmdate = $this->get_lm_date($path);
      }
      
      if ($psi == null)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }

      $title = $psi->title;
      if (strpos($title, "403") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - 403 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }
      if (strpos($title, "404") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Broken link - 404 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }
      if (strpos($title, "301") !== false)
      {
         $this->scanned[$url] = "";
         $this->supplement .= "Redirected link - 301 can't load $url<br>";
         $up->description = "Broken link!";
         return false;
      }

      if (title == "")
         $title = $up->domain.$up->path();
      $up->title = $title;
      $up->description = $psi->description;
      $up->keywords = $psi->keywords;
      if ($up->nofollow)
         return;
      $up->links = $psi->links;
      array_push($this->stack, $up);
   }   

   private function do_build_map($tup)
   {
      $this->scan_page($tup);
      $pd = $this->top->pda[$this->url_parts->file];
      $pd->title = $tup->title;
      $pd->description = $tup->description;
      $pd->keywords = $tup->keywords;
      $pd->lmdate = $tup->lmdate;
      while (true)
      {
         $up = array_pop($this->stack);
         if ($up === null)
            break;
         while (true)
         {
            $link = array_pop($up->links);
            if ($link == null)
               break;
            $lup = $this->decompose($link, $up->path());
            $uri = $lup->path().$lup->file;
            $toscan = $this->canonical.substr($uri, 1);
            if (array_key_exists($toscan, $this->scanned))
               continue;
            if ($lup->excluded)
            {
               $this->scanned[$toscan] = "";
               continue;
            }
            $cn = $this->top;

            $joints = count($lup->structure);
            for ($j = 0; $j < $joints; $j++)
            {
               if (!array_key_exists($lup->structure[$j], $cn->children))
               {
                  $n = new Node();
                  $cn->children[$lup->structure[$j]] = $n;
                  $cn =$n;
               }
               else
                  $cn = $cn->children[$lup->structure[$j]];
            }
            // Hopefully we are now at a node that comprises an array of file names
            // - possibly empty
            $npd = null;
            if (!array_key_exists($lup->file, $cn->pda))
            {
                  $npd = new PageInfo();
                  $npd->file = $lup->file;
                  $npd->path = $lup->path();
                  $cn->pda[$lup->file] = $npd;
            }
            $this->scan_page($lup);
            $this->scanned[$toscan] = "";
            $npd->title = $lup->title;
            $npd->description = $lup->description;
            $npd->keywords = $lup->keywords;
            $npd->lmdate = $lup->lmdate;
         }
      }
   }
    
   public function build_map()
   {
      $this->do_build_map($this->url_parts);
   }
   
   private function get_indent($depth)
   {
      $depth = $depth*2;
      $w = "".$depth;
      return '<span style="padding-left:'.$w.'em"></span>';
   }

   public function do_report($n, $indent)
   {
      foreach ($n->pda as $pd)
      {
         echo $pd->toString($indent);
      }
      foreach($n->children as $child)
         $this->do_report($child, $indent+1);
   }
   
   public function report()
   {
      if ($this->xml)
      {
         $this->report_xml();
      }
      else
      {
         $s = 
         "<html><head>".
         "<title>Brits Eye View site map.</title>".
         "<link rel='stylesheet' href='/common/bev.css' type='text/css' />".
         "</head><body class='bevstd' style='padding-left: 50px;'>".
         '<div style="width: 802px; margin: 0 auto; text-align: left">'.
         '<img class="tips" style="border:none;" src="/common/skyline.jpg"'.
         'width="795" height="82" '.
         'title="Manhattan seen across the Hudson river from New Jersey." '.
         'alt="New York panorama image" /></a>'.
         "<h1 class='bighead'>Brits Eye View Site Map</h1><hr>".
         "This is a dynamic site map generated in much the same way ".
         "as the information collected by web crawlers is obtained. ".
         "If the pages on the site have a link to some other page in the site, ".
         "then there will be (or should be ;=)) an entry here.<p />";
         echo $s;
         
         $this->do_report($this->top, 0);
         
         if ($this->supplement)
         {
            echo "<p><hr><h3>Supplementary Information:</h3>";
            echo $this->supplement;
         }
         echo "</body></html>";
      }
   }
   
   function do_report_xml($n, $domain)
   {
      foreach ($n->pda as $pd)
      {
         echo $pd->toXMLString($domain);
      }
      foreach($n->children as $child)
         $this->do_report_xml($child, $domain);
   }
   
   function report_xml()
   {
      header("Content-type: text/xml");
      $domain = substr($this->url, 0, strlen($this->url)-1);
      echo '<?xml version="1.0" encoding="UTF-8"?>'."\n";
      echo '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'."\n";
      $this->do_report_xml($this->top, $domain);
      echo "</urlset>\n";
   }
   
   function __construct($url, $exclude, $hostpath = "", $deffile = "index.html")
   {
      $this->url = $url;
      $this->url_parts = $this->decompose($url, "/");
      
      $this->top = new Node();
      $pd = new PageInfo();
      $pd->file = $this->url_parts->file;
      $pd->path = "/";
      $pd->title = "Home page";
      $this->top->pda[$pd->file] = $pd;
      
      $this->canonical = $this->get_canonical($this->url_parts);
      $this->scanned = array();
      $this->stack = array();
      
      if ($exclude)
      {
         $a = explode(",", $exclude);
         $this->exclusions = array();
         foreach ($a as $ex)
         {
            $this->exclusions[$ex] = "";
         }
      }
      if ($hostpath)
      {
         $this->hostpath = $hostpath;
         $this->deffile = $deffile;
         $this->xml = true;
      }
   }
   
}

$url = $_GET['url'];
$exclude = $_GET["exclude"];
$hostpath = $_GET["hostpath"];
// If you provide 'hostpath', XML sitemap is assumed
$sm = new SiteMapper($url, $exclude, $hostpath);
$sm->build_map();
$sm->report();
?>