龙空技术网

JS逆向教程:Python采集今日头条视频

Python可乐 218

前言:

如今姐妹们对“python34教程”大致比较珍视,你们都需要了解一些“python34教程”的相关内容。那么小编同时在网上汇集了一些对于“python34教程””的相关文章,希望兄弟们能喜欢,我们一起来学习一下吧!

最近在做今日头条文章数据抓取的过程中,发现视频地址的获取较为复杂。在源码与浏览器配合下发现对应的解决思路,故此记录一下。

目录

私信小编01即可获取大量Python学习资料

需要的Python模块实现思路代码及运行结果正文

1.需要的Python模块

   模块主要有requests(或者aiohttp),PyExecJS。   前者是请求文章的源码,后者是Python执行JS代码的依赖库,主要是生成视频地址12
实现思路一. 需求主要是替换原有文章中的视频及图片地址为本地储存地址,因此需要下载资源,在针对视频分析时通过抓包发现对应的视 频地址,但是在源码及相关接口响应中都未发现对应的视频地址参数。

通过文章源码(HTML)浏览器渲染发现video标签后期生成,视频地址也存在,那么此标签肯定通过JS生成,通过查找发现关键JS所在标签script

二. 分析该地址对应的js发现里面有生成video标签的方法,依次推断这里面有视频地址生成的方法,如下:

这里可以清楚我们所要的视频地址从何而来,下面是该方法:

分析该方法,发现其中有一个关键参数t,另外在图二中我们发现方法e,填入的参数v,这里让我想到之前抓包中有个接口返回的结果对应的main_url var u = o.data.video_list, h = u.video_1, v = h.main_url, 123三. 该接口为:

接口返回结果中:

同时,该接口中的参数(v0201f800000bub4vq2vtt9a5oknnlp0)在源码中即可找到,可用正则匹配。

可以大胆尝试一下,将main_url值加入生成视频地址的方法中尝试下,另外需要将JS最下面的几个参数加上即:var c = new Array( - 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1); 1

我用的是JS调试工具(方便调试,检查代码语法),也可用其他方法

结果为:

该地址即为视频地址,因此证明了以上的猜想正确,但是该地址参数是有时效的,因此要动态变换。可以自己测试重新生成。

代码及运行结果(我采用的是异的方式)

async def get_page_source(url):    browser = None    page = None    try:        browser = await launch(            headless=True,            ignoreHTTPSErrors=True,            handleSIGINT=False,            handleSIGTERM=False,            handleSIGHUP=False,            defaultViewport=None,            args=['--disable-setuid-sandbox',                  '--no-sandbox',                  '--ignore-certificate-errors',                  '--disable-gpu',                  '--disable-gpu-sandbox',                  '--start-maximized'                  ]        )        pages = await browser.pages()        page = pages[0]        # 是否启用JS,enabled设为False,则无渲染效果        await page.setJavaScriptEnabled(enabled=True)        await page.setViewport(viewport={'width': 1200, 'height': 800})        await page.evaluateOnNewDocument(            '() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }')        await page.evaluateOnNewDocument("() =>{ Object.defineProperty(navigator, 'plugins', { get: () => [] }) }")        await page.evaluateOnNewDocument(            "() =>{ Object.defineProperty(navigator, 'languages', { get: () => ['zh-CN','zh] }) }")        await page.setUserAgent(            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')        await page.goto(url, {'timeout': 5000, 'waitUntil': 'load'})        page_source = await page.content()        return page_source    except Exception as e:        # app_logger.error('账号:%s, 登录错误:%s' % (username, e))        print(e)        return -1    finally:        if page is not None:            # await page.waitFor(1000)            await page.close()        if browser is not None:            await browser.close()async def get_data(url, continue_number=0):    """解析文章源码,提取视频,文字,图片等信息"""    try:        page_source = await get_page_source(url)        # 视频处理,及视频封面        video_message_id_ = re.findall('tt-videoid="(.*?)"', page_source)        video_cover_ = re.findall('tt-poster="(.*?)"', page_source)        if len(video_message_id_) > 0 and len(video_cover_) > 0:            video_message_id = video_message_id_[0]            video_url = await get_video_url_id(video_message_id, url)            video_cover = await download_video_cover(video_cover_[0], url)    except Exception as e:        if continue_number < continue_num:            print(e)            # app_logger.error('function get_data error: %s' % e)            continue_number += 1            video_address = await get_data(url, continue_number)            return video_address        else:            # app_logger.error('function get_data : %s  exceed maximum retry' % url)            return -1async def get_video_url_id(video_id, article_url, continue_number=0):    """解析视频main_url"""    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '                            'Chrome/83.0.4103.116 Safari/537.36'}    data_url = '{}'.format(video_id)    try:        async with aiohttp.ClientSession(connector=TCPConnector(verify_ssl=False), timeout=timeout) as session:            async with session.get(data_url, headers=header) as resp:                response = await resp.json()                if response['message'].strip() == "success":                    data = response['data']['video_list']                    keys = data.keys()                    if 'video_3' in keys:                        main_url = data['video_3']['main_url']                        video_url = await get_video_url(main_url)                        video_url_oss = await download_video(video_url, article_url)                        return video_url_oss                    elif 'video_3' not in keys and 'video_2' in keys:                        main_url = data['video_3']['main_url']                        video_url = await get_video_url(main_url)                        video_url_oss = await download_video(video_url, article_url)                        return video_url_oss                    else:                        main_url = data['video_3']['main_url']                        video_url = await get_video_url(main_url)                        video_url_oss = await download_video(video_url, article_url)                        return video_url_oss    except Exception as e:        if continue_number < continue_num:            print(e)            # app_logger.error('function get_data error: %s' % e)            continue_number += 1            video_address = await get_data(url, continue_number)            return video_address        else:            # app_logger.error('function get_data : %s  exceed maximum retry' % url)            return -1async def get_video_url(main_url, continue_number=0):    """获取视频地址,js执行"""    try:        tt = """var c = new Array( - 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1);        function e(t) {                var e, o, i, r, n, a, s;                for (a = t.length, n = 0, s = ""; a > n;) {                    do e = c[255 & t.charCodeAt(n++)];                    while (a > n && -1 == e);                    if ( - 1 == e) break;                    do o = c[255 & t.charCodeAt(n++)];                    while (a > n && -1 == o);                    if ( - 1 == o) break;                    s += String.fromCharCode(e << 2 | (48 & o) >> 4);                    do {                        if (i = 255 & t.charCodeAt(n++), 61 == i) return s;                        i = c[i]                    } while ( a > n && - 1 == i );                    if ( - 1 == i) break;                    s += String.fromCharCode((15 & o) << 4 | (60 & i) >> 2);                    do {                        if (r = 255 & t.charCodeAt(n++), 61 == r) return s;                        r = c[r]                    } while ( a > n && - 1 == r );                    if ( - 1 == r) break;                    s += String.fromCharCode((3 & i) << 6 | r)                }                return s            }"""        js = execjs.compile(tt)        result = js.call('e', main_url)        return result    except Exception as e:        if continue_number < continue_num:            # app_logger.error('function get_video_url error: %s' % e)            continue_number += 1            video_address = await get_video_url(main_url, continue_number)            return video_address        else:            # app_logger.error('function get_video_url  exceed maximum retry')            return -1

总结该项目是JS反爬的一种,相对来说不是很复杂,未采用JS代码混淆,参数加密等,查找方向上障碍不是很多,后期碰到复杂点的会继续更新分享的

标签: #python34教程