龙空技术网

豆瓣系列综合爬虫,豆瓣电影、书籍、小组、相册(项目地址源码)

Python乐园 96

前言:

今天同学们对“豆瓣app爬虫”大概比较着重,各位老铁们都需要了解一些“豆瓣app爬虫”的相关资讯。那么小编在网络上汇集了一些有关“豆瓣app爬虫””的相关内容,希望各位老铁们能喜欢,看官们快快来了解一下吧!

豆瓣电影、书籍、小组、相册、东西等爬虫。

代码地址:私信发送:“豆瓣爬虫”,系统自动回复下载地址。文章里面不能放下载地址,只能这样。

###依赖服务

MongoDB

###依赖包

pip install scrapy

pip install pybloom

pip install pymongo

###运行豆瓣电影爬虫

进入douban/movie目录

执行scrapy crawl movie

###运行豆瓣相册爬虫

进入douban/album目录

执行scrapy crawl album

主要代码展示:

1#encoding: utf-8

2 from scrapy import Field, Item

3

4 class MovieItem(Item):

5 subject_id = Field()

6 name = Field()

7 year = Field()

8 directors = Field()

9 actors = Field()

10 languages = Field()

11 genres = Field() #类型

12 runtime = Field()

13 stars = Field() #5星 4星 3星 2星 1星各个数量, 次序为:5 4 3 2 1

14 channel = Field()

15 average = Field() #平均分

16 vote = Field() #评分人数

17 tags = Field()

18 watched = Field() #看过

19 wish = Field() #想看

20 comment = Field() #短评数

21 question = Field() #提问数

22 review = Field() #影评数

23 discussion = Field() #讨论

24 image = Field() #图片数

25 countries = Field() #制片国家

26 summary = Field()

27

28

29 #豆瓣相册 文档格式

30 AlbumItem = dict(

31 from_url = "",

32 album_name = "少年听雨歌楼上,壮年画雨客舟中",

33 author = dict(

34 home_page = "",

35 nickname = "等温线",

36 avatar = "",

37 ),

38 photos = [

39 dict(

40 large_img_url = "",

41 like_count = 2,

42 recommend_count = 22,

43 desc = "李子哒粉蒸排骨!好吃!",

44 comments = [

45 dict(

46 avatar = "",

47 nickname = "muse",

48 post_datetime = "2014-07-29 08:37:14",

49 content = "看得流口水了",

50 home_page = "",

51 ),

52 ]

53 ),

54 ],

55 tags = ["美女", "标签", "时尚"],

56 recommend_total = 67,

57 like_total = 506,

58 create_date = "2014-07-21",

59 photo_count = 201,

60 follow_count = 37,

61 desc = "蛇蛇蛇 马马马",

62 )

63

64 class AlbumItem(Item):

65 album_name = Field()

66 author = Field()

67 photos = Field()

68 recommend_total = Field()

69 like_total = Field()

70 create_date = Field()

71 from_url = Field()

72 photo_count = Field()

73 follow_count = Field()

74 desc = Field()

75 tags = Field()

76

77

78 class PhotoItem(Item):

79 large_img_url = Field()

80 like_count = Field()

81 recommend_count = Field()

82 desc = Field()

#encoding: utf-8

2 import scrapy

3 from scrapy.contrib.linkextractors import LinkExtractor

4 from scrapy.contrib.spiders import CrawlSpider, Rule

5

6 from misc.store import doubanDB

7 from parsers import *

8

9 class AlbumSpider(CrawlSpider):

10 name = "album"

11 allowed_domains = [""]

12 start_urls = [

13 "",

14 ]

15

16 rules = (

17 #相册详情

18 Rule(LinkExtractor(allow=r"^\.douban\.com/photos/album/\d+/($|\?start=\d+)"),

19 callback="parse_album",

20 follow=True

21 ),

22

23 #照片详情

24 Rule(LinkExtractor(allow=r"^\.douban\.com/photos/photo/\d+/$"),

25 callback = "parse_photo",

26 follow = True

27 ),

28

29 #豆列集合

30 # Rule(LinkExtractor(allow=r"^\.douban\.com/photos/album/\d+/doulists$"),

31 # follow=True

32 # ),

33

34 #单个豆列

35 Rule(LinkExtractor(allow=r"^\.douban\.com/doulist/\d+/$"),

36 follow=True

37 ),

38 )

39

40 def parse_album(self, response):

41 album_parser = AlbumParser(response)

42 item = dict(album_parser.item)

43

44 if album_parser.next_page: return None

45 spec = dict(from_url = item["from_url"])

46 doubanDB.album.update(spec, {"$set": item}, upsert=True)

47

48 def parse_photo(self, response):

49 single = SinglePhotoParser(response)

50 from_url = single.from_url

51 if from_url is None: return

52 doc = doubanDB.album.find_one({"from_url": from_url}, {"from_url":True})

53

54 item = dict(single.item)

55 if not doc:

56 new_item = {}

57 new_item["from_url"] = from_url

58 new_item["photos"] = item

59 doubanDB.album.save(new_item)

60 else:

61 spec = {"from_url": from_url}

62 doc = doubanDB.album.find_one({"photos.large_img_url": item["large_img_url"]})

63 if not doc:

64 doubanDB.album.update(spec, {"$push": {"photos": item}})

65

66 cp = CommentParser(response)

67 comments = cp.get_comments()

68 if not comments: return

69 large_img_url = item["large_img_url"]

70 spec = {"photos.large_img_url": large_img_url }

71 doubanDB.album.update(spec, {"$set": {"photos.$.comments": comments} }, upsert=True)

代码地址:私信发送:“豆瓣爬虫”,系统自动回复下载地址。文章里面不能放下载地址,只能这样。

标签: #豆瓣app爬虫 #豆瓣可以爬虫吗