如何解决 Python 中 Urllib HTTP Error 403 Forbidden Message 错误

当前位置：主页 > 学无止境 > 编程语言 > Python >

Python PHP Java Go TypeScript C++ Vba Node.js C语言 MATLAB

如何解决 Python 中 Urllib HTTP Error 403 Forbidden Message 错误

作者：迹忆客最近更新：2023/05/16 浏览次数：

今天的文章解释了如何处理错误消息（异常），urllib.error.HTTPError: HTTP Error 403: Forbidden，当它遇到一个被禁止的资源时，由错误类代表请求类产生。

Python 中的 urllib 模块

urllib Python 模块通过不同的协议处理 python 的 URL。它以想要从特定网站获取数据的网络抓取工具而闻名。

urllib 包含执行某些操作（例如读取、解析 URL 和 robots.txt）的类、方法和函数。有四个类，request、error、parse、robotparser。

检查 robots.txt 以防止 urllib HTTP 错误 403 禁止消息

当使用 urllib 模块通过请求类与客户端或服务器交互时，我们可能会遇到特定的错误。其中一个错误是 HTTP 403 错误。

我们在读取 URL 时收到 urllib.error.HTTPError: HTTP Error 403: Forbidden 错误消息。 HTTP 403，即 Forbidden Error，是一个 HTTP 状态代码，表示客户端或服务器禁止访问所请求的资源。

因此，当我们看到这种错误消息时，urllib.error.HTTPError: HTTP Error 403: Forbidden，服务器理解请求但决定不处理或授权我们发送的请求。

要了解为什么我们正在访问的网站没有处理我们的请求，我们需要检查一个重要文件 robots.txt。在网络抓取或与网站交互之前，通常建议查看此文件以了解会发生什么，并且不会面临任何进一步的麻烦。

要在任何网站上查看它，我们可以按照以下格式进行操作。

https://<website.com>/robots.txt

例如，检查 Google、Amazon 和 Baidu robots.txt 文件。

https://www.google.com/robots.txt
https://www.amazon.com/robots.txt
https://www.baidu.com/robots.txt

检查 Google robots.txt 给出以下结果。

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid-'90s wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.google.com/sitemaps/sitemap.xml
Sitemap: https://www.google.com/product/sitemap.xml

我们可以注意到那里有很多 Disallow 标签。此 Disallow 标记显示网站的区域，该区域不可访问。因此，对这些区域的任何请求都不会被处理并且被禁止。

在其他 robots.txt 文件中，我们可能会看到一个允许标记。例如，http://google.com/comment 禁止任何外部请求，即使使用 urllib 模块也是如此。

让我们编写代码从访问时返回 HTTP 403 错误的网站抓取数据。

示例代码：

import urllib.request
import re

webpage = urllib.request.urlopen('https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi').read()
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.findall(findlink, webpage)

print(len(row_array))

输出:

Traceback (most recent call last):
  File "c:\Users\akinl\Documents\Python\index.py", line 7, in <module>
    webpage = urllib.request.urlopen('https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi').read()
  File "C:\Python310\lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python310\lib\urllib\request.py", line 525, in open
    response = meth(req, response)
  File "C:\Python310\lib\urllib\request.py", line 634, in http_response
    response = self.parent.error(
  File "C:\Python310\lib\urllib\request.py", line 563, in error
    return self._call_chain(*args)
  File "C:\Python310\lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
  File "C:\Python310\lib\urllib\request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

原因是我们被禁止访问该网站。但是，如果我们检查 robots.txt 文件，我们会注意到 https://www.cmegroup.com/markets/ 没有带有 Disallow 标签。但是，如果我们查看我们想要抓取的网站的 robots.txt 文件，我们将找到以下内容。

User-agent: Python-urllib/1.17
Disallow: /

上面的文字意味着不允许名为 Python-urllib 的用户代理抓取站点内的任何 URL。这意味着不允许使用 Python urllib 模块来抓取站点。

因此，检查或解析 robots.txt 以了解我们可以访问哪些资源。我们可以使用 robotparser 类解析 robots.txt 文件。这些可以防止我们的代码遇到 urllib.error.HTTPError: HTTP Error 403: Forbidden 错误消息。

请求头添加Cookie解决urllib HTTP错误403 Forbidden Message

将有效的用户代理作为标头参数传递将快速解决问题。本网站可能会使用 cookie 作为反抓取措施。

该网站可能会设置并要求回显 cookie 以防止抓取，这可能违反其政策。

from urllib.request import Request, urlopen

def get_page_content(url, head):

  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

输出:

<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta
'
'
'
<p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'

将有效的用户代理作为标头参数传递将快速解决问题。

使用Session对象解决urllib HTTP错误403 Forbidden Message

有时，即使使用用户代理也无法阻止此错误的发生。然后可以使用请求模块的会话对象。

from random import seed
import requests

url = "https://stackoverflow.com/search?q=html+error+403"
session_obj = requests.Session()
response = session_obj.get(url, headers={"User-Agent": "Mozilla/5.0"})

print(response.status_code)

输出:

上面的文章找到了 urllib.error.HTTPError: HTTP Error 403: Forbidden 的原因以及处理的方法。 mod_security 基本上会导致此错误，因为不同的网页使用不同的安全机制来区分人类和自动计算机（bots）。

上一篇：如何解决 Python 错误 ValueError: I/O Operation on Closed File

下一篇：Python 中错误 Path Python3 (From --Python=Python3) Does Not Exist

转载请发邮件至 1244347461@qq.com 进行申请，经作者同意之后，转载请以链接形式注明出处

本文地址：

迹忆客专注技术分享