Quickly extract urls from a xml sitemap file

The following short PHP code will enable you to extract urls from a standard WordPress xml sitemap or any other website sitemap adhering to the sitemap schema.


# extract-urls.php
#
# Extract only URLS from a XML sitemap.
# Sitemap schema : https://www.sitemaps.org/protocol.html


if(count($argv) < 2) {
    exit("Error: Invalid number of arguments. Specify an input XML file.");
}

$xml_filename = $argv[1];

if (file_exists($xml_filename)) 
{
    $xml = simplexml_load_file($xml_filename);
    
    if($xml->getName() == 'urlset')
    {
        $children = $xml->children();
        foreach($children as $child) 
        {
            if($child->getName() == 'url')
            {
                echo $child->loc . PHP_EOL;
            }
        }
    }
} else {
    exit('Failed to open XML file.');
}

You can then use it from the command-line. This will extract urls from the example ‘XML-Sitemap.xml’ file and pipe to a text file.

c:\tools>php extract-urls.php XML-Sitemap.xml > urls.txt