龙空技术网

Python之scrapy linkextractors使用错误

编程小志 68

前言:

此时看官们对“python安装matplotlib模块出错”大约比较注意,各位老铁们都需要分析一些“python安装matplotlib模块出错”的相关文章。那么小编也在网络上收集了一些关于“python安装matplotlib模块出错””的相关资讯,希望看官们能喜欢,看官们快快来学习一下吧!

1.环境及版本

python3.7.1

scrapy 1.5.1

2.问题及错误代码

先贴上错误代码:

import scrapy

from scrapy.linkextractors import LinkExtractor

from urllib.parse import urljoin

class MatExamplesSpider(scrapy.Spider):

name = 'mat_examples'

# allowed_domains = ['matplotlib.org']

start_urls = ['']

def parse(self, response):

le = LinkExtractor(restrict_xpaths='//a[contains(@class, "reference internal")]/@href')

links = le.extract_links(response)

print(response.url)

print(type(links))

print(links)

运行后报错如下:

Traceback (most recent call last):

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks

current.result = callback(current.result, *args, **kw)

File "/Users/eric.luo/Desktop/Python/matplotlib_examples/matplotlib_examples/spiders/mat_examples.py", line 14, in parse

links = le.extract_links(response)

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links

links = self._extract_links(doc, response.url, response.encoding, base_url)

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links

return self.link_extractor._extract_links(*args, **kwargs)

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 58, in _extract_links

for el, attr, attr_val in self._iter_links(selector.root):

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 46, in _iter_links

for el in document.iter(etree.Element):

AttributeError: 'str' object has no attribute 'iter'

出现错误后自检代码也没发现问题,上网上查找也没有发现有人与我碰见一致的问题;于是就改成(restrict_css)去抓取获取,发现是能获取数据的,于是改回xpath;但这次先不用linkextractor,之间用response.xpath()方法去获取对应链接所在标签的href属性值;发现这样是可以获取到正常的数据的;

即将:

le = LinkExtractor(restrict_xpaths='//a[contains(@class, "reference internal")]/@href')

links = le.extract_links(response)

改成:

links = respon.xpath(‘//a[contains(@class, "reference internal")]/@href').extract()

然后,又发现报错:'str' object has no attribute 'iter'

而原本正常返回的links数据类型应该为list而不应是str;所以猜测可能是由于linkextractor中的restrict_xpaths和scrapy中的xpath存在差异;于是报错尝试的态度对linkextractor内的restrict_xpaths的规则进行更改;最后发现当我去掉@href时获取数据正常;

即将:

le = LinkExtractor(restrict_xpaths='//a[contains(@class, "reference internal")]/@href')

改成:

le = LinkExtractor(restrict_xpaths='//a[contains(@class, "reference internal")]')

重新运行后,发现运行成功;输出结果如下:

最后贴上正确的完整代码:(此问题来自与编写scrapy下载matplotlib例子的程序,完整代码见下期文章);

import scrapy

from scrapy.linkextractors import LinkExtractor

from urllib.parse import urljoin

from ..items import MatplotlibExamplesItem

class MatExamplesSpider(scrapy.Spider):

name = 'mat_examples'

# allowed_domains = ['matplotlib.org']

start_urls = ['']

def parse(self, response):

le = LinkExtractor(restrict_xpaths='//span[contains(@class, "caption-text")]/a[contains(@class, "reference internal")]')

links = le.extract_links(response)

for link in links:

yield scrapy.Request(link.url, callback=self.parse_mat)

------ 欢迎大家关注讨论,业余编程小志------

标签: #python安装matplotlib模块出错