前言:
今天同学们对“豆瓣app爬虫”大概比较着重,各位老铁们都需要了解一些“豆瓣app爬虫”的相关资讯。那么小编在网络上汇集了一些有关“豆瓣app爬虫””的相关内容,希望各位老铁们能喜欢,看官们快快来了解一下吧!豆瓣电影、书籍、小组、相册、东西等爬虫。
代码地址:私信发送:“豆瓣爬虫”,系统自动回复下载地址。文章里面不能放下载地址,只能这样。
###依赖服务
MongoDB
###依赖包
pip install scrapy
pip install pybloom
pip install pymongo
###运行豆瓣电影爬虫
进入douban/movie目录
执行scrapy crawl movie
###运行豆瓣相册爬虫
进入douban/album目录
执行scrapy crawl album
主要代码展示:
1#encoding: utf-8
2 from scrapy import Field, Item
3
4 class MovieItem(Item):
5 subject_id = Field()
6 name = Field()
7 year = Field()
8 directors = Field()
9 actors = Field()
10 languages = Field()
11 genres = Field() #类型
12 runtime = Field()
13 stars = Field() #5星 4星 3星 2星 1星各个数量, 次序为:5 4 3 2 1
14 channel = Field()
15 average = Field() #平均分
16 vote = Field() #评分人数
17 tags = Field()
18 watched = Field() #看过
19 wish = Field() #想看
20 comment = Field() #短评数
21 question = Field() #提问数
22 review = Field() #影评数
23 discussion = Field() #讨论
24 image = Field() #图片数
25 countries = Field() #制片国家
26 summary = Field()
27
28
29 #豆瓣相册 文档格式
30 AlbumItem = dict(
31 from_url = "",
32 album_name = "少年听雨歌楼上,壮年画雨客舟中",
33 author = dict(
34 home_page = "",
35 nickname = "等温线",
36 avatar = "",
37 ),
38 photos = [
39 dict(
40 large_img_url = "",
41 like_count = 2,
42 recommend_count = 22,
43 desc = "李子哒粉蒸排骨!好吃!",
44 comments = [
45 dict(
46 avatar = "",
47 nickname = "muse",
48 post_datetime = "2014-07-29 08:37:14",
49 content = "看得流口水了",
50 home_page = "",
51 ),
52 ]
53 ),
54 ],
55 tags = ["美女", "标签", "时尚"],
56 recommend_total = 67,
57 like_total = 506,
58 create_date = "2014-07-21",
59 photo_count = 201,
60 follow_count = 37,
61 desc = "蛇蛇蛇 马马马",
62 )
63
64 class AlbumItem(Item):
65 album_name = Field()
66 author = Field()
67 photos = Field()
68 recommend_total = Field()
69 like_total = Field()
70 create_date = Field()
71 from_url = Field()
72 photo_count = Field()
73 follow_count = Field()
74 desc = Field()
75 tags = Field()
76
77
78 class PhotoItem(Item):
79 large_img_url = Field()
80 like_count = Field()
81 recommend_count = Field()
82 desc = Field()
#encoding: utf-8
2 import scrapy
3 from scrapy.contrib.linkextractors import LinkExtractor
4 from scrapy.contrib.spiders import CrawlSpider, Rule
5
6 from misc.store import doubanDB
7 from parsers import *
8
9 class AlbumSpider(CrawlSpider):
10 name = "album"
11 allowed_domains = [""]
12 start_urls = [
13 "",
14 ]
15
16 rules = (
17 #相册详情
18 Rule(LinkExtractor(allow=r"^\.douban\.com/photos/album/\d+/($|\?start=\d+)"),
19 callback="parse_album",
20 follow=True
21 ),
22
23 #照片详情
24 Rule(LinkExtractor(allow=r"^\.douban\.com/photos/photo/\d+/$"),
25 callback = "parse_photo",
26 follow = True
27 ),
28
29 #豆列集合
30 # Rule(LinkExtractor(allow=r"^\.douban\.com/photos/album/\d+/doulists$"),
31 # follow=True
32 # ),
33
34 #单个豆列
35 Rule(LinkExtractor(allow=r"^\.douban\.com/doulist/\d+/$"),
36 follow=True
37 ),
38 )
39
40 def parse_album(self, response):
41 album_parser = AlbumParser(response)
42 item = dict(album_parser.item)
43
44 if album_parser.next_page: return None
45 spec = dict(from_url = item["from_url"])
46 doubanDB.album.update(spec, {"$set": item}, upsert=True)
47
48 def parse_photo(self, response):
49 single = SinglePhotoParser(response)
50 from_url = single.from_url
51 if from_url is None: return
52 doc = doubanDB.album.find_one({"from_url": from_url}, {"from_url":True})
53
54 item = dict(single.item)
55 if not doc:
56 new_item = {}
57 new_item["from_url"] = from_url
58 new_item["photos"] = item
59 doubanDB.album.save(new_item)
60 else:
61 spec = {"from_url": from_url}
62 doc = doubanDB.album.find_one({"photos.large_img_url": item["large_img_url"]})
63 if not doc:
64 doubanDB.album.update(spec, {"$push": {"photos": item}})
65
66 cp = CommentParser(response)
67 comments = cp.get_comments()
68 if not comments: return
69 large_img_url = item["large_img_url"]
70 spec = {"photos.large_img_url": large_img_url }
71 doubanDB.album.update(spec, {"$set": {"photos.$.comments": comments} }, upsert=True)
代码地址:私信发送:“豆瓣爬虫”,系统自动回复下载地址。文章里面不能放下载地址,只能这样。