python爬虫
概述
获取网页内容,python的request库
解析网页内容,用beautiful soup解析HTML
储存或分析数据
用requests发送请求 实际爬虫请回去找模板:[[python语法(上)#Python requests 模块]]
1 2 3 4 5 6 7 import requests header={"{如:UA头键值对}" } response = requests.get("http://books.toscrape.com/" ,header) if response.ok: print (response.text)else : print ("请求失败" )
基础HTML知识 1 2 3 4 5 6 7 8 9 10 11 12 13 <h1 > <p > <b > <i > <u > <strong > <br > <img src ="" > <a href ="" > <div > <span > <ol > <ul > <li > <table > <thead > <tbody > <tr > <td > class属性
BeautifulSoup使用 详见[[python语法(上)#Python 爬虫(bs4模块)]]
1 2 3 4 5 6 7 8 from bs4 import BeautifulSoupimport requests content = requests.get("http://books.toscrape.com/" ).text soup = BeautifulSoup(content,"html.parser" ) all_prices = soup.findAll("p" , attrs={"class" : "price_color" }) for price in all_prices: print (price.string[2 :1 ])
re使用:正则匹配 点击转到笔记:[[正则表达式]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import re result=re.findall("a" ,"adadsads" ) result=re.findall(r"\d+" ,"2026" ) result=re.finditer("a" ,"adadsads" )for item in result: print (item.group()) result=re.search("a" ,"adadsads" ) result=re.match ("d" ,"adadsads" )print (result) s=""" <div class='西游记'><span id="10086">中国移动</span></div> <div class='西游记'><span id="10010">中国联通</span></div> """ obj = re.compile (r"""<span id="(?P<id>\d+)">(?P<name>.*?)</span>""" ) result = obj.finditer(s)for item in result: id = item.group("id" ) print (id ) name = item.group("name" ) print (name)
re.compile()的第二个参数可以是re.S,使跨行匹配
xpath使用:标签路径定位 XML部分 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 from lxml import etree xml=""" <sites> <site> <name class="菜鸟" id="2026">RUNOOB</name> <url>www.runoob.com</url> </site> <wangzhan> <name>Google</name> <url>www.google.com</url> </wangzhan> <site> <name>Facebook</name> <url>www.facebook.com</url> </site> <url>www.example.com</url> </sites> """ et = etree.XML(xml) result = et.xpath("/sites" ) result = et.xpath("/sites/site" ) result = et.xpath("/sites//url" ) result = et.xpath("/sites/*/url" ) result = et.xpath("/sites/*/name/text()" )[0 ] result = et.xpath("/sites/site/name[@class='菜鸟']" ) result = et.xpath("/sites/site/name/@id" ) print (result)
HTML部分 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 from lxml import etree html=""" <html> <head> <meta charset="UTF-8" /> <title>Titles</title> </head> <body> <ul> <li><a href="http://www.google.com">谷歌</a></li> <li><a href="http://sogou.com">搜狗</a></li> </ul> <ol> <li><a href="feiji">飞机</a></li> <li><a href="dapao">大炮</a></li> <li><a href="huoche">火车</a></li> </ol> <div class="job">李嘉诚</div> <div class="common">胡辣汤</div> </body> </html> """ et=etree.HTML(html) li_list=et.xpath("//li" )for li in li_list: href=li.xpath("./a/@href" )[0 ] text=li.xpath("./a/text()" )[0 ] print (href,text)
pyquery使用:css选择器 查找 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from pyquery import PyQuery html=""" <ol> <li class="a" id="2026"><a href="https://www.example1.com">bs</a></li> <li class="b"><a href="https://www.example2.com">xpath</a></li> <li class="c"><a href="https://www.example3.com">pyquery</a></li> </ol> """ p=PyQuery(html) li=p("ol" )("li" ) li=p("ol li a" ) li=p("a" ).attr("href" ) li=p("a" ).text() item=p("a" ).items() for it in item: li=it.attr("href" ) lii=it.text() print (li) print (lii)
修改 功能:修改html,统一页面结构,便于后续爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from pyquery import PyQuery html=""" <ol> <li class="a" id="2025"><a href="https://www.example1.com">bs</a></li> <li class="b"><a href="https://www.example2.com">xpath</a></li> <li class="c"><a href="https://www.example3.com">pyquery</a></li> </ol> """ p=PyQuery(html) p("li.a" ).after("""<li class="d"><a href="https://www.example4.com">rerere</a></li>\n """ ) p("li.a" ).append("""<span>新手专用</span>""" ) p("li.a" ).attr("id" ,"2026" ) p("li.b" )("a" ).attr("type" ,"text" ) p("li.d" )("a" ).remove_attr("href" ) p("li.d" )("a" ).remove()print (p)
总结四个库
库名 / 工具
核心定位
语法特点
优势场景
性能
BeautifulSoup
友好的 HTML 解析器
标签名 + 属性 (Python 风格)
快速开发、HTML 格式不规范、教学入门
⭐⭐ (较慢)
re (正则)
纯文本模式匹配
正则元字符
提取简单文本、清洗数据、从 JS/文本中提取特定字符串
⭐⭐⭐ (快)
XPath
节点路径查询语言
路径表达式 + 函数 (如 //, @, contains)
结构复杂、需要精准定位父子/兄弟节点、动态属性
⭐⭐⭐ (最快)
pyquery
jQuery 风格解析器
CSS 选择器 (前端熟悉)
前端开发者上手、链式操作、代码简洁性
⭐⭐⭐ (快)
session 防盗链 代理 1 2 3 4 5 6 7 import requests url ="https://www.baidu.com" proxy ={ "http" :"http://82.156.171.147:3128" , "https" :"https://82.156.171.147:3128" } resp = requests.get(url, proxies=proxy) resp.encoding ='utf-8" print(resp.text)