在 Python 中使用 PhantomJS

当前位置：主页 > 学无止境 > 编程语言 > Python >

Python PHP Java Go TypeScript C++ Vba Node.js C语言 MATLAB

在 Python 中使用 PhantomJS

作者：迹忆客最近更新：2023/06/01 浏览次数：

这篇 Python 文章将研究 PhantomJS 以及我们如何将它与用于 Python 编程的 Selenium Web 自动化模块一起使用。我们还将研究为什么它比其他可用的自动化 Web 驱动程序更有用。

Selenium 和 PhantomJS 很有用，并且从抓取的角度来看提供了独特的优势。此外，还应遵循一些实用的编码示例，以更好地理解整个概念。

安装 PhantomJS

PhantomJS 是一款无头浏览器，可与 Selenium 网络自动化模块配合使用。与 Firefox Driver 和 Chrome Driver 相比，浏览器在整个过程中保持完全隐藏。

它的行为与其他浏览器完全一样。您可以将 web driver 切换为 Chrome Driver 或 Firefox Driver 来设计程序，一旦运行，您可以将其切换为 PhantomJS。

由于 PhantomJS 消除了 GUI 的使用，因此在为测试用例执行一些测试运行时它运行得更快。

在使用 PhantomJS 之前，我们首先需要安装它。对于 macOS 中的安装，我们运行以下命令。

示例代码：

brew install phantomjs

我们需要从网站下载它来安装它在 Windows 或 Linux 上。在这里能找到它。

使用 PhantomJS 解决的问题场景

让我们在这里讨论一个示例问题，然后尝试使用 PhantomJS 和 Selenium 解决它。

我们知道，在现代计算时代，现在大多数网站都使用 JavaScript 在其网站上动态加载内容。

让我们考虑一个加载 ATP Singles USA Tennis Results 2015 的站点。加载该站点后，我们可以看到显示了分数和比赛详细信息。

ATP 列表匹配

我们可以看到以适当的动态方式加载的内容，这一切都归功于 JavaScript。现在，让我们禁用 JavaScript，看看会发生什么。

ATPJS 禁用

我们可以看到，在我们的浏览器中禁用 JavaScript 后，内容未加载。

如果我们想使用 Python 从该网站下载所有游戏怎么办？ JavaScript 未呈现，页面主体未完成；因此，向站点发送请求并解析 HTML 的传统方法将行不通。

并行使用 PhantomJS 和 Selenium

在开始处理我们的代码之前，我们必须先设置环境。为此，我们将键入以下代码：

mkdir scraping_phantomjs && cd scraping_phantomjs
virtualenv venv
source venv/bin/activate
pip install selenium beautifulsoup4

由于现在已完成所需的安装和导入，我们继续下一步并制作一个 Python 文件，该文件应该保存我们将在代码执行结束时获得的数据。

示例代码：

touch scraper.py

创建文件后，让我们开始编写脚本以获取上述网站上第一个匹配项的 HTML。

示例代码：

import platform
from bs4 import BeautifulSoup
from selenium import webdriver
#Extensions may vary from OS to OS that is why we're considering multiple types
if platform.system() == 'Windows':
    PHANTOMJS_PATH = './phantomjs.exe'
else:
    PHANTOMJS_PATH = './phantomjs'
# We are using pseudo browser PhantomJS here but can change it to Firefox as
#per our needs
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('http://www.scoreboard.com/en/tennis/atp-singles/us-open-2015/results/')
#Now we need to parse our HTML
soup = BeautifulSoup(browser.page_source, "html.parser")
#Find all the games listed
games = soup.find_all('tr', {'class': 'stage-finished'})
# and print out the html for the first game
#You can print the next game by changing 0 to 1 in games[]
print(games[0].prettify())

我们必须给出 PhantomJS 的位置，这个脚本才能运行。确保从上面的链接中获取适合您操作系统的 PhantomJS 版本。

之后，将其解压缩以显示 bin 文件夹中的 phantomjs 文件。该文件应与 scraper.py 脚本位于同一文件夹中。

让我们尝试运行我们的脚本，看看我们是否获得了所需的输出。要运行脚本，请键入以下命令。

示例代码：

python scraper.py

输出将为我们提供第一个匹配项的 HTML。它看起来像这样：

<tr class="The odd no-border-bottom stage-finished" id="g_2_2DtOK9O8">
 <td class="cell_ib icons left ">
 </td>
 <td class="cell_ad time ">
  14.09. 02:20
 </td>
 <td class="cell_ab team-home bold ">
  <span class="padl">
   Djokovic N. (Srb)
  </span>
 </td>
 <td class="cell_ac team-away ">
  <span class="padl">
   Federer R. (Sui)
  </span>
 </td>
 <td class="cell_sa score bold ">
  3 : 1
 </td>
 <td class="cell_ia icons ">
  <span class="icons">
  </span>
 </td>
</tr>