龙空技术网

Python爬虫+数据分析之影评分析

程序员梓羽同学 98

前言:

当前咱们对“基于python的影评数据分析”可能比较着重,朋友们都想要剖析一些“基于python的影评数据分析”的相关资讯。那么小编同时在网络上搜集了一些有关“基于python的影评数据分析””的相关资讯,希望你们能喜欢,各位老铁们快快来学习一下吧!

本次通过猫眼电影,对春节贺岁大片【满江红】进行数据分析。而本次我们通过动态接口形式获取评论信息,静态HTML解析需要额外的字体解析,网上的教程也已经很全了,有兴趣的小伙伴们也可以多多冲浪或和本人探讨哈!

满江红影图

一、 接口分析

1. 目标站点:猫眼H5

接口列表

2. 通过滑动查看评论信息,或点击评论进入评论子页面滑动,即可抓取到相关接口(浏览器F12工具中只能抓取到子评论接口,如果要整个评论的需要抓包工具配合或使用手机抓包)

接口详情

3. 评论接口(已加密处理)

aHR0cHM6Ly9tLm1hb3lhbi5jb20vYXBvbGxvL2Fwb2xsb2FwaS9tbWRiL3JlcGxpZXMvY29tbWVudC8xMTY3MTI5MDg5Lmpzb24/X3ZfPXllcyZvZmZzZXQ9NDA=

二、 响应分析通过子评论接口,可以分析出来相关字段(昵称、性别、评分、评论内容、评论点赞量、用户等级等)

{    "cmts": [        {            "approve": 0,            "assistAwardInfo": {                "avatar": "",                "celebrityId": 0,                "celebrityName": "",                "rank": 0,                "title": ""            },            "avatarurl": ";,            "channelId": 70001,            "content": "在电影院看真的很有氛围!背景音乐也很加分",            "deleted": false,            "id": 1171602285,            "ipLocName": "福建",            "nickName": "腿小菇",            "time": "2023-02-27 10:24",            "userId": 1322748722,            "userLevel": 3,            "vipInfo": "",            "vipType": 0        }    ],    "ocm": {        "approve": 8657,        "approved": false,        "assistAwardInfo": {            "avatar": "",            "celebrityId": 0,            "celebrityName": "",            "rank": 0,            "title": ""        },        "authInfo": "",        "avatarurl": ";,        "content": "刚看完满江红,真的好看,这是我看过最值的一部电影,反转反转再反转,真的是永远想不到下一步是什么,而且还很搞笑,搞笑又宏伟,真的描述不出来这个电影的好,都给我去看!满江红!入股不亏!!!!",        "id": 1167129089,        "ipLocName": "辽宁",        "isMajor": false,        "juryLevel": 0,        "majorType": 0,        "mvid": 1462626,        "nick": "Gpc126688235",        "nickName": "Gpc126688235",        "oppose": 0,        "pro": false,        "reply": 680,        "score": 5,        "spoiler": 0,        "supportComment": true,        "supportLike": true,        "sureViewed": 1,        "tagList": {            "fixed": [                {                    "id": 1,                    "name": "购票好评"                },                {                    "id": 4,                    "name": "购票"                },                {                    "id": 6,                    "name": "优质评价"                }            ]        },        "time": "2023-01-22 12:19",        "userId": 3164097169,        "userLevel": 2,        "videoDuration": 0,        "vipInfo": "",        "vipType": 0    },    "total": 60}

2. 完整comment接口响应示例

{    "data": {        "hotIds": [                1167280609,            1167187803        ],        "total": 16521,        "comments": [            {                "avatarUrl": ";,                "buyTicket": false,                "channelId": 3,                "content": "还行吧,没有看开心 ",                "delete": false,                "follow": false,                "gender": 1,                "id": 1171756165,                "imageUrls": [],                "ipLocName": "山东",                "likedByCurrentUser": false,                "major": false,                "movie": {                    "id": 0,                    "sc": 0                },                "movieId": 1462626,                "nick": "淘嘉豪",                "replyCount": 0,                "score": 9,                "showApprove": false,                "showVote": false,                "spoiler": false,                "startTime": "1677923460000",                "tagList": [                    {                        "id": 1,                        "name": "购票好评"                    },                    {                        "id": 4,                        "name": "购票"                    }                ],                "time": 1677923460000,                "ugcType": 11,                "upCount": 0,                "userId": 71317227,                "userLevel": 2,                "vipType": 0            },        ],        "t2total": 0,        "myComment": {}    },    "paging": {},    "ts": 1677956823197}
三、数据解析构造请求头,模拟数据请求
def get_film_data(offset = 0, filename="film"):    url = f'aHR0cHM6Ly9tLm1hb3lhbi5jb20vYXBvbGxvL2Fwb2xsb2FwaS9tbWRiL3JlcGxpZXMvY29tbWVudC8xMTY3MTI5MDg5Lmpzb24/X3ZfPXllcyZvZmZzZXQ9NDA='    headers = {        'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'    }    cookies = {        'uuid_n_v':'v1',        'iuuid':'942C12B0DF4311E9ADA9C1C3B540BA45F066B2B3028841B8A0BC3544E4C0AD17',        'ci':'1%2C%E5%8C%97%E4%BA%AC',        '_lxsdk_cuid':'16d6c9b401ec8-0c6c86354bd8a9-5b123211-100200-16d6c9b401ec8',        'webp':'true',        '_lxsdk':'942C12B0DF4311E9ADA9C1C3B540BA45F066B2B3028841B8A0BC3544E4C0AD17'        }    # 开始页面请求,返回响应内容    response = requests.get(url,headers=headers,cookies=cookies).json()    # 总评论数    total = response['total']    print(total)    # 评论信息列表    cmts = response['cmts']    pprint(cmts)    for comment in cmts:        data = []        # 评论id        # id = comment['id']        # 评论内容        content = comment['content']        # 用户昵称        nickName = comment['nickName']        # 用户评分        score = comment['score']        # 评论时间        # startTime = comment['time']        # 用户id        userId = comment['userId']        # 用户等级        userLevel = comment['userLevel']        # 用户性别        gender = comment.get('gender',None)        data['nickName '] = nickName         data['gender'] = gender        data['score'] = score        data['content'] = content        data['userId '] = userId         data['userLevel'] = userLevel        save_data_csv(data,filename)    return total

2. 数据存储(这里为以csv演示)

def save_data_csv(data, file_name):    with open(file_name,'a',encoding='utf-8-sig',newline='')as fp:        # 创建写对象        writer = csv.writer(fp)        title = ['nickName ','gender','score','content','userId ','userLevel']        # 解决循环存储,表头重复问题        with open(file_name,'r',encoding='utf-8-sig',newline='')as fp:            # 创建读对象            reader = csv.reader(fp)            if not [row for row in reader]:                writer.writerow(title)                writer.writerow([data[i] for i in title])            else:                writer.writerow([data[i] for i in title])    print('*'*10+'保存完毕'+'*'*10)

影评结果

四、数据可视化影评分词

def wordcloud_analysis(file_name):    df = pd.read_csv(file_name, encoding='utf-8')    content = df['content'].to_string()    # 开始分词 使用jieba进行精确分词获取词语列表    words = jieba.lcut(content)    # 使用空格拼接获得字符串    words = ' '.join(words)    # 生成词云    # 读取图片,生成图片形状    mask_pic = np.array(Image.open('1.jpg'))    words_cloud = WordCloud(        background_color='white',  # 词云图片的背景颜色        width=800, height=600,  # 词云图片的宽度,默认400像素;词云图片的高度,默认200像素        font_path='msyh.ttf',  # 词云指定字体文件的完整路径        max_words=200,  # 词云图中最大词数,默认200        max_font_size=80,  # 词云图中最大的字体字号,默认None,根据高度自动调节 min_font_size# 词云图中最小的字体字号,默认4号        font_step=1,  # 词云图中字号步进间隔,默认1        random_state=30,  # 设置有多少种随机生成状态,即有多少种配色方案        mask=mask_pic  # 词云形状,默认None,即方形图    ).generate(words)  # 有jieba分词拼接的字符串生成词云    words_cloud.to_file('comment.png')  # 保存词云为图片    # 使用plt显示词云    plt.imshow(words_cloud, interpolation='bilinear')    # 消除坐标轴    plt.axis('off')    plt.show()

分词

2. 观看人群性别及评分占比分析(由于取得部分数据,不代表最终现实结果,勿纠)

def gender_pie_analysis(file_name):    df = pd.read_csv(file_name, encoding='utf-8')    print(df)    #    # # 1.观看人群性别    gender = df['gender'].value_counts()    print(gender)    # 饼图,标题:观看人群性别占比    # 调用自定义饼图函数    # 创建画布和轴    fig, ax = plt.subplots(figsize=(6, 6), dpi=100)    # plt.figure()    size = 0.5    # labels = data.index    ax.pie(gender, labels=['女','男','未知'], startangle=90, autopct='%.1f%%'           , colors=sns.color_palette('husl', len(gender)),           radius=1,  # 饼图半径,默认为1           pctdistance=0.75,  # 控制百分比显示位置           wedgeprops=dict(width=size, edgecolor='w'),  # 控制甜甜圈的宽度           textprops=dict(fontsize=10)  # 控制字号及颜色           )    ax.set_title("【满江红】观看人群性别占比", fontsize=15)    # plt.title(title)    plt.show()

性别占比

评分占比

3. 用户等级分析

def user_level_bar_analysis(file_name):    df = pd.read_csv(file_name, encoding='utf-8')    print(df)    userLevel = df['userLevel'].value_counts().sort_index()    print(userLevel)    x = userLevel.index    y = userLevel    fig, ax = plt.subplots()    plt.bar(x, y, color='#DE85B5')    # 柱状图标题    plt.title('评论用户等级数量分布柱状图')    plt.grid(True, axis='y', alpha=1)    for i, j in zip(x, y):        plt.text(i, j, '%d' % j, horizontalalignment='center', )    ax.spines['right'].set_visible(False)    ax.spines['top'].set_visible(False)    plt.show()

等级数量分布

该篇文章只是从评分角度去做的数据分析,其实还可以从影视类型、年度电影Top、票房等角度进一步做数据分析。

该篇文章来自本人知乎号:梓羽Python

文章链接:

标签: #基于python的影评数据分析