爬虫踩坑系列 —— etree.HTML解析异常

Posted by Jack on 2019-08-13
Words 645 and Reading Time 3 Minutes
Viewed Times

在爬虫的过程中,难免会遇到各种各样的问题。在这里,为大家分享一个关于etree.HTML解析异常的问题。

1.问题描述:
爬虫过程中,一般会使用requests.get()方法获取一个网页上的HTML内容,然后通过lxml库中的etree.HTML来解析这个网页的结构,最后通过xpath获取自己所需的内容。

本人爬虫的具体代码可简单抽象如下:

1
2
3
res = requests.get(url)
html = etree.HTML(res.text)
contents = html.xpaht('//div/xxxx')

然后遇到了如下的错误信息:

1
2
3
4
5
6
7
8
Traceback (most recent call last):
File "xxxxxxxx.py", line 157, in <module>
get_website_title_content(url)
File "xxxxxxxx.py", line 141, in get_website_title_content
html = etree.HTML(html_text)
File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML
File "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

关键错误就是 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

2.解决方法
通过查阅相关资料,造成这个错误的原因其实是requests返回的 res.text 和 res.content 两者区别的问题。查阅requests源代码中是text和content定义(如下所示)可知:res.text返回的是Unicode类型的数据,而res.content返回的是bytes类型的数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
@property
def content(self):
"""Content of the response, in bytes."""

if self._content is False:
# Read the contents.
if self._content_consumed:
raise RuntimeError(
'The content for this response was already consumed')

if self.status_code == 0 or self.raw is None:
self._content = None
else:
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''

self._content_consumed = True
# don't need to release the connection; that's been handled by urllib3
# since we exhausted the data.
return self._content

@property
def text(self):
"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using
``chardet``.

The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""

# Try charset from content-type
content = None
encoding = self.encoding

if not self.content:
return str('')

# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding

# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors='replace')

return content

导致该错误的原因是etree解析是不支持编码声明的Unicode字符串的
因此解决方法很简单,第一种就是直接使用 res.content,如下:

1
2
3
res = requests.get(url)
html = etree.HTML(res.content )
contents = html.xpath('//div/xxxx')

第二种方法则是将Unicode字符串转换为bytes数组,如下:

1
2
3
4
res = requests.get(url)
html_text = bytes(bytearray(res.text, encoding='utf-8'))
html = etree.HTML(html_text)
contents = html.xpath('//div/xxxx')

...

...

00:00
00:00