龙空技术网

两个Python爬虫案例,轻松带你掌握requests+正则表达式

娇兮心有之 897

前言:

如今朋友们对“爬虫正则表达式实例”大概比较关怀,我们都想要分析一些“爬虫正则表达式实例”的相关知识。那么小编在网摘上网罗了一些关于“爬虫正则表达式实例””的相关知识,希望各位老铁们能喜欢,大家快快来了解一下吧!

Python 根式字

requests+正则表达式抓取猫眼电影top100

想要学习Python。关注小编头条号,私信【学习资料】,即可免费领取一整套系统的板Python学习教程!

一.首先我们先分析下网页结构

可以看到第一页的URL和第二页的URL的区别在于offset的值,第一页为0,第二页为10,以此类推。

二.<dd>标签的结构(含有电影相关信息)

三、源代码

import requestsimport reimport jsonfrom requests.exceptions import RequestException#获取页面源代码def get_one_page(url,headers): try: response = requests.get(url,headers=headers) if response.status_code ==200: return response.text except RequestException: return None#解析def parse_one_page(html):#生成正则表达式对象 pattern =re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?href.*?>(.*?)</a>.*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>.*?class="integer">(.*?)</i>.*?class="fraction">(.*?)</i>.*?</dd>',re.S) items = re.findall(pattern,html)#迭代器 for item in items: yield{ 'index':item[0], 'image':item[1], 'title':item[2], 'actor':item[3].strip()[3:], 'time':item[4].strip()[5:], 'score':item[5]+item[6] }#保存def write_to_file(content): with open('result.txt','a',encoding='utf-8') as f: f.write(json.dumps(content,ensure_ascii=False) + '\n') f.close()def main(offset): url = '' + str(offset)#没有headers头抓取不下来 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'} html = get_one_page(url,headers) for item in parse_one_page(html): print(item) write_to_file(item)if __name__ == '__main__':#循环十次 从第一页开始,传入offset值 for i in range(10): main(i*10)
requests+正则表达式抓取瓜子二手网二手车信息

关注后私信小编 PDF领取十套电子文档书籍

一、分析网页结构

这里我们根据城市与品牌来筛选

当我们点击城市孝感、汽车品牌大众

我们可以看到url的组成

后面跟的是城市的缩写,然后是汽车品牌,变成下面的

二、分析网页内容

三、源代码

在这里要注意获取瓜子二手网页面信息需要cookie,没有cookie抓取不成功。

f12打开,然后获取右边Request headers信息,copy到请求头。

import requestsimport refrom requests.exceptions import RequestException#获取页面源代码def get_page(url,headers,city,logo): crawl_url = url+city+'/'+logo+'/' print(crawl_url) try: response = requests.get(crawl_url,headers=headers) if response.status_code ==200: return response.text except RequestException: return None#解析页面def parse_page(html,url,): pattern = re.compile('<li data-scroll-track.*?<a title="(.*?) href="(.*?)" target.*?</li>' ,re.S) result =re.findall(pattern,html)#迭代器 for item in result: yield{ '车型':item[0], 'img-url':url+item[1] }def mian(): url = '' headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch, br', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Cookie':'uuid=2fc3c4a9-0346-4402-922c-cfe50ddc0474; antipas=Xz1U56hc02646759115E8169529; ganji_uuid=8392432802393356853482; financeCityDomain=all; 2fc3c4a9-0346-4402-922c-cfe50ddc0474_views=1; 1e049043-4ff0-4550-92e0-5f40105d0187_views=1; Hm_lvt_e6e64ec34653ff98b12aab73ad895002=1540715608; Hm_lpvt_e6e64ec34653ff98b12aab73ad895002=1540715608; cityDomain=sh; preTime=%7B%22last%22%3A1540717606%2C%22this%22%3A1540711622%2C%22pre%22%3A1540711622%7D; lg=1; sessionid=1e049043-4ff0-4550-92e0-5f40105d0187; clueSourceCode=%2A%2300; cainfo=%7B%22ca_s%22%3A%22dh_360llqmz%22%2C%22ca_n%22%3A%22360llq_mz%22%2C%22ca_i%22%3A%22-%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22-%22%2C%22ca_campaign%22%3A%22-%22%2C%22ca_kw%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22scode%22%3A%22-%22%2C%22ca_transid%22%3Anull%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22ca_b%22%3A%22-%22%2C%22ca_a%22%3A%22-%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%222fc3c4a9-0346-4402-922c-cfe50ddc0474%22%2C%22sessionid%22%3A%221e049043-4ff0-4550-92e0-5f40105d0187%22%7D', 'Host':'', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', } city=input("请输入城市:") logo=input("请输入品牌:") html = get_page(url,headers,city,logo) for item in parse_page(html,''): print(item)if __name__== '__main__': mian()

标签: #爬虫正则表达式实例