Python网页内容提取库总结

高效码农 07-04 390

前言：

现时小伙伴们对“python提取网页链接”都比较关心，朋友们都需要分析一些“python提取网页链接”的相关内容。那么小编在网摘上收集了一些有关“python提取网页链接””的相关知识，希望我们能喜欢，我们一起来学习一下吧！

简介

以下介绍的库均为从网页中自动解析想要的内容，从而解放了需要每个网站都要正则匹配或者xpath的超大工作量。

一、lassie：人性化的网页内容检索库

安装

pip3 install lassie

使用：

import lassielassie.fetch(';)

输入：

{'images': [{'src': ';,   'type': 'favicon'}], 'videos': [], 'url': ';, 'title': 'Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression Fittings Suppliers', 'status_code': 200}

二、newspaper：新闻内容爬虫专用包

安装：

pip3 install newspaper3k

需要安装的是newspaper3k而不是newspaper，因为newspaper是python 2的安装包，pip install

newspaper 无法正常安装，请用python 3对应的 pip install newspaper3k正确安装。

使用：

from newspaper import Article# import nltk# nltk.download('punkt')url = ';article = Article(url) # Chinesearticle.download()article.parse()article.nlp()print(article.text)

三、goose3: HTML 内容/文章提取器(python3)

安装：

pip3 install goose3

使用：

from goose3 import Gooseurl = ';g = Goose()article = g.extract(url=url)article.title# article.meta_description# article.cleaned_text[:]

输入：

'Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression Fittings Suppliers'

四、python-readability：arc90 公司 readability 工具的 Python 高速端口

安装：

pip3 install readability-lxml

使用：

import requestsfrom readability import Document url = ';html = requests.get(url).contentdoc = Document(html)print('title:', doc.title())print('content:', doc.summary(html_partial=True))

输出：

title: Not Acceptable!content: <div><body id="readabilityBody"><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></div>

五、textract：从任何格式的文档中提取文本，Word，PowerPoint，PDFs 等等

安装

pip3 install textract

使用：

import textracttext = textract.process("xxx.pdf") #换成你自己本地的pdfprint(text.decode('utf-8'))

本文地址：http://www.longkongtuishu.com/caa3dAGsCB1oCCl0.html

标签： #python提取网页链接