![]() |
||
---|---|---|
.. | ||
examples | ||
src | ||
tests | ||
.codeclimate.yml | ||
.gitignore | ||
.travis.yml | ||
LICENSE | ||
README.md | ||
composer.json | ||
phpunit.xml |
README.md
XML Sitemap parser
An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol.
The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.
Features
- Basic parsing
- Recursive parsing
- String parsing
- Custom User-Agent string
- Proxy support
Formats supported
- XML
.xml
- Compressed XML
.xml.gz
- Robots.txt rule sheet
robots.txt
- Line separated text (disabled by default)
Requirements:
- PHP 5.6 or 7.0+, alternatively HHVM
- PHP extensions:
Installation
The library is available for install via Composer. Just add this to your composer.json
file:
{
"require": {
"vipnytt/sitemapparser": "^1.0"
}
}
Then run composer update
.
Getting Started
Basic example
Returns an list of URLs only.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser();
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Advanced
Returns all available tags, for both Sitemaps and URLs.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'Sitemap<br>';
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Recursive
Parses any sitemap detected while parsing, to get an complete list of URLs
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parseRecursive('http://www.google.com/robots.txt');
echo '<h2>Sitemaps</h2>';
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '<hr>';
}
echo '<h2>URLs</h2>';
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '<hr>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Parsing of line separated text strings
Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead.
To disable strict
standards, simply pass this configuration to constructor parameter #2: ['strict' => false]
.
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
$parser->parse('https://www.xml-sitemaps.com/urllist.txt');
foreach ($parser->getSitemaps() as $url => $tags) {
echo $url . '<br>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Additional examples
Even more examples available in the examples directory.
Configuration
Available configuration options, with their default values:
$config = [
'strict' => true, // (bool) Disallow parsing of line-separated plain text
'guzzle' => [
// GuzzleHttp request options
// http://docs.guzzlephp.org/en/latest/request-options.html
],
];
$parser = new SitemapParser('MyCustomUserAgent', $config);
If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent.