经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 程序设计 » Python » 查看文章
Python?xpath,JsonPath,bs4的基本使用
来源:jb51  时间:2022/7/4 8:50:16  对本文有异议

1.xpath

1.1 xpath使用

  • google提前安装xpath插件,按ctrl + shift + x 出现小黑框
  • 安装lxml库 pip install lxml ‐i https://pypi.douban.com/simple
  • 导入lxml.etreefrom lxml import etree
  • etree.parse() 解析本地文件html_tree = etree.parse('XX.html')
  • etree.HTML() 服务器响应文件html_tree = etree.HTML(response.read().decode('utf‐8')
  • .html_tree.xpath(xpath路径)

1.2 xpath基本语法

1.路径查询

  • 查找所有子孙节点,不考虑层级关系 
  • 找直接子节点

2.谓词查询

  1. //div[@id]
  2. //div[@id="maincontent"]

3.属性查询

  1. //@class

4.模糊查询

  1. //div[contains(@id, "he")]
  2. //div[starts‐with(@id, "he")]

5.内容查询

  1. //div/h1/text()

6.逻辑运算

  1. //div[@id="head" and @class="s_down"]
  2. //title | //price

1.3 示例

xpath.html

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8"/>
  5. <title>Title</title>
  6. </head>
  7. <body>
  8. <ul>
  9. <li id="l1" class="class1">北京</li>
  10. <li id="l2" class="class2">上海</li>
  11. <li id="d1">广州</li>
  12. <li>深圳</li>
  13. </ul>
  14. </body>
  15. </html>
  1. from lxml import etree
  2.  
  3. # xpath解析
  4. # 本地文件: etree.parse
  5. # 服务器相应的数据 response.read().decode('utf-8') etree.HTML()
  6.  
  7.  
  8. tree = etree.parse('xpath.html')
  9.  
  10. # 查找url下边的li
  11. li_list = tree.xpath('//body/ul/li')
  12. print(len(li_list)) # 4
  13.  
  14. # 获取标签中的内容
  15. li_list = tree.xpath('//body/ul/li/text()')
  16. print(li_list) # ['北京', '上海', '广州', '深圳']
  17.  
  18. # 获取带id属性的li
  19. li_list = tree.xpath('//ul/li[@id]')
  20. print(len(li_list)) # 3
  21.  
  22. # 获取id为l1的标签内容
  23. li_list = tree.xpath('//ul/li[@id="l1"]/text()')
  24. print(li_list) # ['北京']
  25.  
  26. # 获取id为l1的class属性值
  27. c1 = tree.xpath('//ul/li[@id="l1"]/@class')
  28. print(c1) # ['class1']
  29.  
  30. # 获取id中包含l的标签
  31. li_list = tree.xpath('//ul/li[contains(@id, "l")]/text()')
  32. print(li_list) # ['北京', '上海']
  33. # 获取id以d开头的标签
  34. li_list = tree.xpath('//ul/li[starts-with(@id,"d")]/text()')
  35. print(li_list) # ['广州']
  36. # 获取id为l2并且class为class2的标签
  37. li_list = tree.xpath('//ul/li[@id="l2" and @class="class2"]/text()')
  38. print(li_list) # ['上海']
  39. # 获取id为l2或id为d1的标签
  40. li_list = tree.xpath('//ul/li[@id="l2"]/text() | //ul/li[@id="d1"]/text()')
  41. print(li_list) # ['上海', '广州']

1.4 爬取百度搜索按钮的value

  1. import urllib.request
  2. from lxml import etree
  3. url = 'http://www.baidu.com'
  4. headers = {
  5. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
  6. }
  7. request = urllib.request.Request(url=url, headers=headers)
  8. response = urllib.request.urlopen(request)
  9. content = response.read().decode('utf-8')
  10. tree = etree.HTML(content)
  11. value = tree.xpath('//input[@id="su"]/@value')
  12. print(value)

1.5 爬取站长素材的图片

  1. # 需求 下载的前十页的图片
  2. # https://sc.chinaz.com/tupian/qinglvtupian.html 1
  3. # https://sc.chinaz.com/tupian/qinglvtupian_page.html
  4. import urllib.request
  5. from lxml import etree
  6. def create_request(page):
  7. if (page == 1):
  8. url = 'https://sc.chinaz.com/tupian/qinglvtupian.html'
  9. else:
  10. url = 'https://sc.chinaz.com/tupian/qinglvtupian_' + str(page) + '.html'
  11. headers = {
  12. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
  13. }
  14. request = urllib.request.Request(url=url, headers=headers)
  15. return request
  16. def get_content(request):
  17. response = urllib.request.urlopen(request)
  18. content = response.read().decode('utf-8')
  19. return content
  20. def down_load(content):
  21. # 下载图片
  22. # urllib.request.urlretrieve('图片地址','文件的名字')
  23. tree = etree.HTML(content)
  24. name_list = tree.xpath('//div[@id="container"]//a/img/@alt')
  25. # 一般设计图片的网站都会进行懒加载
  26. src_list = tree.xpath('//div[@id="container"]//a/img/@src2')
  27. print(src_list)
  28. for i in range(len(name_list)):
  29. name = name_list[i]
  30. src = src_list[i]
  31. url = 'https:' + src
  32. urllib.request.urlretrieve(url=url, filename='./loveImg/' + name + '.jpg')
  33. if __name__ == '__main__':
  34. start_page = int(input('请输入起始页码'))
  35. end_page = int(input('请输入结束页码'))
  36.  
  37. for page in range(start_page, end_page + 1):
  38. # (1) 请求对象的定制
  39. request = create_request(page)
  40. # (2)获取网页的源码
  41. content = get_content(request)
  42. # (3)下载
  43. down_load(content)

2. JsonPath

2.1 pip安装

  1. pip install jsonpath

2.2 jsonpath的使用

  1. obj = json.load(open('json文件', 'r', encoding='utf‐8'))
  2. ret = jsonpath.jsonpath(obj, 'jsonpath语法')

JSONPath语法元素和对应XPath元素的对比:

示例:

jsonpath.json

  1. { "store": {
  2. "book": [
  3. { "category": "修真",
  4. "author": "六道",
  5. "title": "坏蛋是怎样练成的",
  6. "price": 8.95
  7. },
  8. { "category": "修真",
  9. "author": "天蚕土豆",
  10. "title": "斗破苍穹",
  11. "price": 12.99
  12. },
  13. { "category": "修真",
  14. "author": "唐家三少",
  15. "title": "斗罗大陆",
  16. "isbn": "0-553-21311-3",
  17. "price": 8.99
  18. },
  19. { "category": "修真",
  20. "author": "南派三叔",
  21. "title": "星辰变",
  22. "isbn": "0-395-19395-8",
  23. "price": 22.99
  24. }
  25. ],
  26. "bicycle": {
  27. "author": "老马",
  28. "color": "黑色",
  29. "price": 19.95
  30. }
  31. }
  32. }
  1. import json
  2. import jsonpath
  3.  
  4. obj = json.load(open('jsonpath.json', 'r', encoding='utf-8'))
  5.  
  6. # 书店所有书的作者
  7. author_list = jsonpath.jsonpath(obj, '$.store.book[*].author')
  8. print(author_list) # ['六道', '天蚕土豆', '唐家三少', '南派三叔']
  9.  
  10. # 所有的作者
  11. author_list = jsonpath.jsonpath(obj, '$..author')
  12. print(author_list) # ['六道', '天蚕土豆', '唐家三少', '南派三叔', '老马']
  13.  
  14. # store下面的所有的元素
  15. tag_list = jsonpath.jsonpath(obj, '$.store.*')
  16. print(
  17. tag_list) # [[{'category': '修真', 'author': '六道', 'title': '坏蛋是怎样练成的', 'price': 8.95}, {'category': '修真', 'author': '天蚕土豆', 'title': '斗破苍穹', 'price': 12.99}, {'category': '修真', 'author': '唐家三少', 'title': '斗罗大陆', 'isbn': '0-553-21311-3', 'price': 8.99}, {'category': '修真', 'author': '南派三叔', 'title': '星辰变', 'isbn': '0-395-19395-8', 'price': 22.99}], {'author': '老马', 'color': '黑色', 'price': 19.95}]
  18.  
  19. # store里面所有东西的price
  20. price_list = jsonpath.jsonpath(obj, '$.store..price')
  21. print(price_list) # [8.95, 12.99, 8.99, 22.99, 19.95]
  22.  
  23. # 第三个书
  24. book = jsonpath.jsonpath(obj, '$..book[2]')
  25. print(book) # [{'category': '修真', 'author': '唐家三少', 'title': '斗罗大陆', 'isbn': '0-553-21311-3', 'price': 8.99}]
  26.  
  27. # 最后一本书
  28. book = jsonpath.jsonpath(obj, '$..book[(@.length-1)]')
  29. print(book) # [{'category': '修真', 'author': '南派三叔', 'title': '星辰变', 'isbn': '0-395-19395-8', 'price': 22.99}]
  30. # 前面的两本书
  31. book_list = jsonpath.jsonpath(obj, '$..book[0,1]')
  32. # book_list = jsonpath.jsonpath(obj,'$..book[:2]')
  33. print(
  34. book_list) # [{'category': '修真', 'author': '六道', 'title': '坏蛋是怎样练成的', 'price': 8.95}, {'category': '修真', 'author': '天蚕土豆', 'title': '斗破苍穹', 'price': 12.99}]
  35.  
  36. # 条件过滤需要在()的前面添加一个?
  37. # 过滤出所有的包含isbn的书。
  38. book_list = jsonpath.jsonpath(obj, '$..book[?(@.isbn)]')
  39. print(
  40. book_list) # [{'category': '修真', 'author': '唐家三少', 'title': '斗罗大陆', 'isbn': '0-553-21311-3', 'price': 8.99}, {'category': '修真', 'author': '南派三叔', 'title': '星辰变', 'isbn': '0-395-19395-8', 'price': 22.99}]
  41. # 哪本书超过了10块钱
  42. book_list = jsonpath.jsonpath(obj, '$..book[?(@.price>10)]')
  43. print(
  44. book_list) # [{'category': '修真', 'author': '天蚕土豆', 'title': '斗破苍穹', 'price': 12.99}, {'category': '修真', 'author': '南派三叔', 'title': '星辰变', 'isbn': '0-395-19395-8', 'price': 22.99}]

3. BeautifulSoup

3.1 基本简介

1.安装

 pip install bs4 

2.导入

 from bs4 import BeautifulSoup 

3.创建对象 

  •  服务器响应的文件生成对象 soup = BeautifulSoup(response.read().decode(), 'lxml') 
  • 本地文件生成对象 soup = BeautifulSoup(open('1.html'), 'lxml') 

注意:默认打开文件的编码格式gbk所以需要指定打开编码格式utf-8

3.2 安装以及创建

  1. 1.根据标签名查找节点
  2. soup.a 【注】只能找到第一个a
  3. soup.a.name
  4. soup.a.attrs
  5. 2.函数
  6. (1).find(返回一个对象)
  7. find('a'):只找到第一个a标签
  8. find('a', title='名字')
  9. find('a', class_='名字')
  10. (2).find_all(返回一个列表)
  11. find_all('a') 查找到所有的a
  12. find_all(['a', 'span']) 返回所有的aspan
  13. find_all('a', limit=2) 只找前两个a
  14. (3).select(根据选择器得到节点对象)【推荐】
  15. 1.element
  16. eg:p
  17. 2..class
  18. eg:.firstname
  19. 3.#id
  20. eg:#firstname
  21. 4.属性选择器
  22. [attribute]
  23. eg:li = soup.select('li[class]')
  24. [attribute=value]
  25. eg:li = soup.select('li[class="hengheng1"]')
  26. 5.层级选择器
  27. element element
  28. div p
  29. element>element
  30. div>p
  31. element,element
  32. div,p
  33. eg:soup = soup.select('a,span')

3.3 节点定位

  1. 1.根据标签名查找节点
  2. soup.a 【注】只能找到第一个a
  3. soup.a.name
  4. soup.a.attrs
  5. 2.函数
  6. (1).find(返回一个对象)
  7. find('a'):只找到第一个a标签
  8. find('a', title='名字')
  9. find('a', class_='名字')
  10. (2).find_all(返回一个列表)
  11. find_all('a') 查找到所有的a
  12. find_all(['a', 'span']) 返回所有的aspan
  13. find_all('a', limit=2) 只找前两个a
  14. (3).select(根据选择器得到节点对象)【推荐】
  15. 1.element
  16. eg:p
  17. 2..class
  18. eg:.firstname
  19. 3.#id
  20. eg:#firstname
  21. 4.属性选择器
  22. [attribute]
  23. eg:li = soup.select('li[class]')
  24. [attribute=value]
  25. eg:li = soup.select('li[class="hengheng1"]')
  26. 5.层级选择器
  27. element element
  28. div p
  29. element>element
  30. div>p
  31. element,element
  32. div,p
  33. eg:soup = soup.select('a,span')

3.5 节点信息 

  1. (1).获取节点内容:适用于标签中嵌套标签的结构
  2. obj.string
  3. obj.get_text()【推荐】
  4. (2).节点的属性
  5. tag.name 获取标签名
  6. eg:tag = find('li)
  7. print(tag.name)
  8. tag.attrs将属性值作为一个字典返回
  9. (3).获取节点属性
  10. obj.attrs.get('title')【常用】
  11. obj.get('title')
  12. obj['title']
  1. (1).获取节点内容:适用于标签中嵌套标签的结构
  2. obj.string
  3. obj.get_text()【推荐】
  4. (2).节点的属性
  5. tag.name 获取标签名
  6. eg:tag = find('li)
  7. print(tag.name)
  8. tag.attrs将属性值作为一个字典返回
  9. (3).获取节点属性
  10. obj.attrs.get('title')【常用】
  11. obj.get('title')
  12. obj['title']

3.6 使用示例

bs4.html

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <title>Title</title>
  6. </head>
  7. <body>
  8.  
  9. <div>
  10. <ul>
  11. <li id="l1">张三</li>
  12. <li id="l2">李四</li>
  13. <li>王五</li>
  14. <a href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" " class="a1">google</a>
  15. <span>嘿嘿嘿</span>
  16. </ul>
  17. </div>
  18.  
  19.  
  20. <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百度</a>
  21.  
  22. <div id="d1">
  23. <span>
  24. 哈哈哈
  25. </span>
  26. </div>
  27.  
  28. <p id="p1" class="p1">呵呵呵</p>
  29. </body>
  30. </html>
  1. from bs4 import BeautifulSoup
  2. # 通过解析本地文件 来将bs4的基础语法进行讲解
  3. # 默认打开的文件的编码格式是gbk 所以在打开文件的时候需要指定编码
  4. soup = BeautifulSoup(open('bs4.html', encoding='utf-8'), 'lxml')
  5. # 根据标签名查找节点
  6. # 找到的是第一个符合条件的数据
  7. print(soup.a) # <a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>
  8. # 获取标签的属性和属性值
  9. print(soup.a.attrs) # {'href': '', 'id': '', 'class': ['a1']}
  10. # bs4的一些函数
  11. # (1)find
  12. # 返回的是第一个符合条件的数据
  13. print(soup.find('a')) # <a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>
  14. # 根据title的值来找到对应的标签对象
  15. print(soup.find('a', title="a2")) # <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百度</a>
  16.  
  17. # 根据class的值来找到对应的标签对象 注意的是class需要添加下划线
  18. print(soup.find('a', class_="a1")) # <a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>
  19.  
  20. # (2)find_all 返回的是一个列表 并且返回了所有的a标签
  21. print(soup.find_all('a')) # [<a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>, <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百度</a>]
  22.  
  23. # 如果想获取的是多个标签的数据 那么需要在find_all的参数中添加的是列表的数据
  24. print(soup.find_all(['a','span'])) # [<a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>, <span>嘿嘿嘿</span>, <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百</a><spa哈</span>]
  25.  
  26. # limit的作用是查找前几个数据
  27. print(soup.find_all('li', limit=2)) # [<li id="l1">张三</li>, <li id="l2">李四</li>]
  28.  
  29. # (3)select(推荐)
  30. # select方法返回的是一个列表 并且会返回多个数据
  31. print(soup.select('a')) # [<a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>, <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百度</a>]
  32.  
  33. # 可以通过.代表class 我们把这种操作叫做类选择器
  34. print(soup.select('.a1')) # [<a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>]
  35.  
  36. print(soup.select('#l1')) # [<li id="l1">张三</li>]
  37.  
  38. # 属性选择器---通过属性来寻找对应的标签
  39. # 查找到li标签中有id的标签
  40. print(soup.select('li[id]')) # [<li id="l1">张三</li>, <li id="l2">李四</li>]
  41.  
  42. # 查找到li标签中id为l2的标签
  43. print(soup.select('li[id="l2"]')) # [<li id="l2">李四</li>]
  44.  
  45. # 层级选择器
  46. # 后代选择器
  47. # 找到的是div下面的li
  48. print(soup.select('div li')) # [<li id="l1">张三</li>, <li id="l2">李四</li>, <li>王五</li>]
  49.  
  50. # 子代选择器
  51. # 某标签的第一级子标签
  52. # 注意:很多的计算机编程语言中 如果不加空格不会输出内容 但是在bs4中 不会报错 会显示内容
  53. print(soup.select('div > ul > li')) # [<li id="l1">张三</li>, <li id="l2">李四</li>, <li>王五</li>]
  54.  
  55. # 找到a标签和li标签的所有的对象
  56. print(soup.select(
  57. 'a,li')) # [<li id="l1">张三</li>, <li id="l2">李四</li>, <li>王五</li>, <a class="a1" href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ">google</a>, <a href="" title=" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" a2">百度</a>]
  58.  
  59. # 节点信息
  60. # 获取节点内容
  61. obj = soup.select('#d1')[0]
  62. # 如果标签对象中 只有内容 那么string和get_text()都可以使用
  63. # 如果标签对象中 除了内容还有标签 那么string就获取不到数据 而get_text()是可以获取数据
  64. # 我们一般情况下 推荐使用get_text()
  65. print(obj.string) # None
  66. print(obj.get_text()) # 哈哈哈
  67.  
  68. # 节点的属性
  69. obj = soup.select('#p1')[0]
  70. # name是标签的名字
  71. print(obj.name) # p
  72. # 将属性值左右一个字典返回
  73. print(obj.attrs) # {'id': 'p1', 'class': ['p1']}
  74.  
  75. # 获取节点的属性
  76. obj = soup.select('#p1')[0]
  77. #
  78. print(obj.attrs.get('class')) # ['p1']
  79. print(obj.get('class')) # ['p1']
  80. print(obj['class']) # ['p1']

3.7 解析星巴克产品名称

  1. import urllib.request
  2. url = 'https://www.starbucks.com.cn/menu/'
  3. response = urllib.request.urlopen(url)
  4. content = response.read().decode('utf-8')
  5. from bs4 import BeautifulSoup
  6. soup = BeautifulSoup(content,'lxml')
  7. # //ul[@class="grid padded-3 product"]//strong/text()
  8. # 一般先用xpath方式通过google插件写好解析的表达式
  9. name_list = soup.select('ul[class="grid padded-3 product"] strong')
  10. for name in name_list:
  11. print(name.get_text())

到此这篇关于Python xpath,JsonPath,bs4的基本使用的文章就介绍到这了,更多相关Python xpath,JsonPath,bs4内容请搜索w3xue以前的文章或继续浏览下面的相关文章希望大家以后多多支持w3xue!

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号