DomDocument Parser Best Practice Questions

@dani

Checking these 2 working codes of your's out. I got some basic questions.

1

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);

//Dan's Code.
//Code from: https://www.daniweb.com/programming/web-development/threads/538868/simplehtmldom-failing#post2291972
//Sitemap Protocol: https://www.sitemaps.org/protocol.html

// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();

$message = file_get_contents('https://www.daniweb.com/programming/web-development/threads/538868/simplehtmldom-failing#post2288453');

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

// Fetch all <a> tags
$links = $doc->getElementsByTagName('a');

// If <a> tags exist ...
if ($links->length > 0)
{
    // For each <a> tag ...
    foreach ($links AS $link)
    {
        $link->setAttribute('class', 'link-style');
    }
}
// Because we are actually manipulating the DOM, DOMDocument will add complete <html><body> tags we need to strip out
$message = str_replace(array('<body>', '</body>'), '', $doc->saveHTML($doc->getElementsByTagName('body')->item(0)));

?>

2

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);

//Dan's Code.
//CODE FROM: https://www.daniweb.com/programming/web-development/threads/540121/how-to-extract-meta-tags-using-domdocument
$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";

// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);

//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

//EXTRACT METAS
// https://www.php.net/manual/en/domdocument.getelementsbytagname.php
$meta_tags = $doc->getElementsByTagName('meta');

// https://www.php.net/manual/en/domnodelist.item.php
if ($meta_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($meta_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
        echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
    }
}

//EXAMPLE 1: EXTRACT TITLE
//CODE FROM: https://www.daniweb.com/programming/web-development/threads/540121/how-to-extract-meta-tags-using-domdocument
$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length>0)
{
    echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
}

?>

Q1.
On the first code, you wrote new DOMDocument(); prior to file_get_contents().
While on the second code, you did vice versa. using my logic, I reckon it does not matter the order. But what is best practice to speeden-up the php interpreter to handle the job faster ?

Q2.
On both the codes, you wrote ...

// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();

... after the new DOMDocument() AND file_get_contents().

Does it have to be in this order or can I add thesese 3 error lines before the
new DOMDocument() AND file_get_contents() ?
Using my logic, I reckon it does not matter the order. But what is best practice to speeden-up the php interpreter to handle the job faster ?

But, I prefer to add them at the top instead. Is this ok ?

Q3.
On the first code, you put these error lines ...

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT);

... while on the second code, another ...

// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

... why you did like this ? What is the significance of doing like this ?

Q3A. What issue will I face if I do vice versa ?
Q3B. Anyway, what is the wisdom behind the way you did things ?
Q3C. What is the REAL difference between the two error codes ?
Q3D. LIBXML_NOENT|LIBXML_COMPACT what do these 2 mean ?

Q4. Anything else I need to know ?