在 PHP 中解析 HTML

当前位置：主页 > 学无止境 > 编程语言 > PHP >

Python PHP Java Go TypeScript C++ Vba Node.js C语言 MATLAB

在 PHP 中解析 HTML

作者：迹忆客最近更新：2023/03/27 浏览次数：

解析 HTML 允许我们将其内容或标记转换为字符串，从而更容易分析或创建动态 HTML 文件。更详细地说，它获取原始 HTML 代码，读取它，生成从段落到标题的 DOM 树对象结构，并允许我们提取重要或需要的信息。

我们使用内置库解析 HTML 文件，有时使用第三方库来进行网页抓取或 PHP 内容分析。根据方法的不同，目标是将 HTML 文档正文转换为字符串以提取每个 HTML 标记。

本文将讨论内置类 DomDocument() 和两个第三方库 simplehtmldom 和 DiDOM。

使用 `DomDocument()` 在 PHP 中解析 HTML

无论是本地 HTML 文件还是在线网页，DOMDocument() 和 DOMXpath() 类都有助于解析 HTML 文件并将其元素存储为字符串，或者在我们的示例中存储为数组。

让我们使用函数解析这个 HTML 文件并返回标题、子标题和段落。

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <meta http-equiv="X-UA-Compatible" content="IE=edge" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>Document</title>
    </head>
    <body>
        <h2 class="main">Welcome to the Abode of PHP</h2>
        <p class="special">
            PHP has been the saving grace of the internet from its inception, it
            runs over 70% of website on the internet
        </p>
        <h3>Understanding PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Using PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Install PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Configure PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>

        <h2 class="main">Welcome to the Abode of JS</h2>
        <p class="special">
            PHP has been the saving grace of the internet from its inception, it
            runs over 70% of website on the internet
        </p>
        <h3>Understanding JS</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
    </body>
</html>

PHP 代码：

<?php

$html = 'index.html';

function getRootElement($element, $html)
{
    $dom = new DomDocument();

    $html = file_get_contents($html);

    $dom->loadHTML($html);

    $dom->preserveWhiteSpace = false;

    $content = $dom->getElementsByTagName($element);

    foreach ($content as $each) {
        echo $each->nodeValue;
        echo "\n";
    }
}

echo "The H2 contents are:\n";
getRootElement("h2", $html);
echo "\n";

echo "The H3 contents are:\n";
getRootElement("h3", $html);
echo "\n";

echo "The Paragraph contents include\n";
getRootElement("p", $html);
echo "\n";

代码片段的输出是：

The H2 contents are:
Welcome to the Abode of PHP
Welcome to the Abode of JS

The H3 contents are:
Understanding PHP
Using PHP
Install PHP
Configure PHP
Understanding JS

The Paragraph contents include

PHP has been the saving grace of the internet from its inception, it
runs over 70% of the website on the internet

...

在 PHP 中使用 `simplehtmldom` 解析 HTML

对于 CSS 样式选择器等附加功能，你可以使用名为 Simple HTML DOM Parser 的第三方库，这是一个简单快速的 PHP 解析器。你可以下载它并包含或需要单个 PHP 文件。

通过此过程，你可以轻松解析所需的所有元素。使用与上一节相同的代码片段，我们将使用名为 str_get_html() 的函数解析 HTML，该函数处理 HTML 并使用 find() 方法查找特定的 HTML 元素或标记。

要查找具有特殊 class 的元素，我们需要将 class 选择器应用于每个 find 元素。此外，要找到实际文本，我们需要在元素上使用 innertext 选择器，然后将其存储在数组中。

使用与上一节相同的 HTML 文件，让我们使用 simplehtmldom 解析它。

<?php

require_once('simple_html_dom.php');

function getByClass($element, $class)
{
    $content= [];

    $html = 'index.html';

    $html_string = file_get_contents($html);

    $html = str_get_html($html_string);

    foreach ($html->find($element) as $element) {
        if ($element->class === $class) {
            array_push($heading, $element->innertext);
        }
    }

    print_r($content);
}

getByClass("h2", "main");
getByClass("p", "special");

代码片段的输出是：

Array
(
    [0] => Welcome to the Abode of PHP
    [1] => Welcome to the Abode of JS
)
Array
(
    [0] =>               PHP has been the saving grace of the internet from its inception, it              runs over 70% of the website on the internet
    [1] =>               PHP has been the saving grace of the internet from its inception, it              runs over 70% of the website on the internet
)

在 PHP 中使用 `DiDOM` 解析 HTML

对于这个第三方 PHP 库，我们必须使用一个名为 Composer 的 PHP 依赖项管理器，它允许我们管理所有 PHP 库和依赖项。DiDOM 库可通过 GitHub 获得，它提供比其他库更高的速度和内存管理。

如果没有，可以安装在这里安装 Composer。但是，如果你有，以下命令会将 DiDOM 库添加到你的项目中。

composer require imangazaliev/didom

之后，你可以使用下面的代码，它与 simplehtmldom 的结构类似，带有 find() 方法。有一个 text()，它将 HTML 元素上下文转换为我们可以在代码中使用的字符串。

has() 函数允许你检查 HTML 字符串中是否有元素或类，并返回一个布尔值。

<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = 'index.html';

$document = new Document('index.html', true);

echo "H3 Element\n";

if ($document->has('h3')) {
    $elements = $document->find('h3');
    foreach ($elements as $element) {
        echo $element->text();
        echo "\n";
    }
}

echo "\nElement with the Class 'main'\n";

if ($document->has('.main')) {
    $elements = $document->find('.main');
    foreach ($elements as $element) {
        echo $element->text();
        echo "\n";
    }
}

代码片段的输出是：

H3 Element
Understanding PHP
Using PHP
Install PHP
Configure PHP
Understanding JS

Element with the Class 'main'
Welcome to the Abode of PHP
Welcome to the Abode of JS

上一篇：检查 PHP 中 Null 的类型和值

下一篇：在 PHP 中编写 HTML

转载请发邮件至 1244347461@qq.com 进行申请，经作者同意之后，转载请以链接形式注明出处

本文地址：

迹忆客专注技术分享