Python爬虫入门教程17：酷某音乐网站的爬取

来源：cnblogs　　作者：有趣的Python　　时间：2021/2/18 15:43:24　　对本文有异议

前言??

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

前文内容??

Python爬虫入门教程01：豆瓣Top电影爬取

Python爬虫入门教程02：小说爬取

Python爬虫入门教程03：二手房数据爬取

Python爬虫入门教程04：招聘信息爬取

Python爬虫入门教程05：B站视频弹幕的爬取

Python爬虫入门教程06：爬取数据后的词云图制作

Python爬虫入门教程07：腾讯视频弹幕爬取

Python爬虫入门教程08：爬取csdn文章保存成PDF

Python爬虫入门教程09：多线程爬取表情包图片

Python爬虫入门教程10：彼岸壁纸爬取

Python爬虫入门教程11：新版王者荣耀皮肤图片的爬取

Python爬虫入门教程12：英雄联盟皮肤图片的爬取

Python爬虫入门教程13：高质量电脑桌面壁纸爬取

Python爬虫入门教程14：有声书音频爬取

Python爬虫入门教程15：音乐网站数据的爬取

Python爬取入门教程16：音频素材网站的爬取

PS：如有需要 Python学习资料 以及 解答 的小伙伴可以加点击下方链接自行获取
python免费学习资料以及群交流解答点击即可加入

基本开发环境??

Python 3.6
Pycharm

一、??确定需求

爬取所有榜单上面的音乐
在这里插入图片描述

二、??网页数据分析

1、先找音乐的URL地址

点击播放，开发者工具里面就会有出现一个音乐播放地址。
在这里插入图片描述

2、找寻音乐url地址的来源。

https://webfs.yun.kugou.com/202102051451/598a943870c34115e8c290507183a2c9/G188/M06/18/09/_A0DAF34pOiABslMADSv-ykkq2s784.mp3

这样的音乐URL根本就不知道有什么规律，所以可以在开发者工具里面搜索来源。
在这里插入图片描述
两个url地址都是可以用的，因为有一个备用的url地址。

这些就是数据包的请求参数。一个链接是看不出来变化参数的。所以需要在对比一个音乐地址。

通过对比可以看到 hash，album_id 主要是这两个参数的变化，最后的那个参数是时间戳。也可以把它当作恒定不变的也可以。

3、找寻 hash，album_id 请求参数的来源

其实这两个参数在列表页面的网页源代码里面就有的
在这里插入图片描述

里面的音乐名字是需要转码的，不过我们只需要 hash 和 album_id 这两个参数就可以了，也不需要在这获取音乐名字。不过还是说一下吧。

`遇到 \u591c\u591c\u591c\u6f2b\u957f如何转码`

字符串.encode('utf-8').decode('unicode_escape')

既然知道了 hash 和 album_id 这两个参数在网页的源代码里面就有，那现在只需要获取每个类目的url地址就可以爬取所有的榜单的音乐了。

直接请求网页就可以获取所有类目的url地址了
在这里插入图片描述

三、??代码实现

获取所有类目url地址以及标题

def get_type_url(html_url):
    response = get_response(html_url)
    selector = parsel.Selector(response.text)
    lis = selector.css('.pc_temp_side ul li')
    for li in lis:
        # 获取类目标题
        type_title = li.css('a::attr(title)').get()
        # 获取类目url
        type_url = li.css('a::attr(href)').get()
        print(f'正在爬取{type_title}', type_url)

获取请求参数 hash 以及 album_id

def get_music_info(type_url):
    response = get_response(type_url)
    result = re.findall('global\.features = \[(.*?)\]', response.text)[0].encode('utf-8').decode('unicode_escape')
    hash_num = re.findall('"Hash":"(.*?)"', result)
    album_id = re.findall('"album_id":(\d+),"', result)
    music_info = zip(hash_num, album_id)
    for index in music_info:
        music_hash = index[0]
        music_id = index[1]

获取音乐url 以及音乐名

def get_music_url(music_hash, album_id):
    page_url = 'https://wwwapi.kugou.com/yy/index.php'
    params = {
        'r': 'play/getdata',
        'hash': music_hash,
        'dfid': '3ve7aQ2XyGmN0yE3uv3WcaHs',
        'mid': 'ac3836df72c523f46a85d8a5fd90fe59',
        'platid': '4',
        'album_id': album_id,
        '_': '1612508120385',
    }
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=page_url, params=params, headers=headers)
    json_data = response.json()
    music_name = json_data['data']['audio_name']
    music_url = json_data['data']['play_url']

保存数据到本地

def save(music_name, music_url):
    path = 'music\\'
    if not os.path.exists(path):
        os.mkdir(path)
    music_content = get_response(music_url).content
    with open(path + music_name + '.mp3', mode='wb') as f:
        f.write(music_content)
        print('正在保存：', music_name)