爬虫基础知识 | 司南的博客

Python爬虫

1.任务介绍

2.爬虫初识

网络爬虫（网络蜘蛛）原理图

搜索引擎原理图

3.基本流程

3.1 准备工作

3.1.1 分析页面

3.1.2 编码规范

3.1.3 引入模块

模块 module ：一般情况下，是一个以.py为后缀的文件。

module 可看作一个工具类，可共用或者隐藏代码细节，将相关代码放置在一个module以便让代码更

好用、易懂，让coder重点放在高层逻辑上。

module能定义函数、类、变量，也能包含可执行的代码。module来源有 3 种：

①Python内置的模块（标准库）；

②第三方模块；

③自定义模块。

包 package ：为避免模块名冲突，Python引入了按目录组织模块的方法，称之为包（package）。包

是含有Python模块的文件夹。

3.2 获取数据

补充：urllib模块

最最基本的请求

是python内置的一个http请求库，不需要额外的安装。只需要关注请求的链接，参数，提供了强大的

解析。

urllb.request 请求模块

urllib.error 异常处理模块

urllib.parse 解析模块

用法讲解

简单的一个get请求

简单的一个post请求

超时处理

打印出响应类型，状态码，响应头

由于使用urlopen无法传入参数，我们需要解决这个问题

我们需要声明一个request对象，通过这个对象来添加参数

import urllib.request
reponse = urllib.request.urlopen('http://www.baidu.com')
print(reponse.read().decode('utf- 8 '))

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf- 8 ')
reponse = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(reponse.read())

import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get',timeout= 1 )
print(response.read())

import urllib.request
import socket
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get',timeout= 0. 01 )
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):#判断错误原因
print('time out!')

import urllib.request
response=urllib.request.urlopen('http://www.baidu.com')
print(type(response))

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.status) # 状态码 判断请求是否成功
print(response.getheaders()) # 响应头 得到的一个元组组成的列表
print(response.getheader('Server')) #得到特定的响应头
print(response.read().decode('utf- 8 ')) #获取响应体的内容，字节流的数据，需要转成utf- 8
格式

我们还可以分别创建字符串、字典等等来带入到request对象里面

我们还可以通过addheaders方法不断的向原始的requests对象里不断添加

打印出信息cookies

下面这段程序获取response后声明的cookie对象会被自动赋值

import urllib.request
request = urllib.request.Request('https://python.org') #由于urlopen无法传参数，声
明一个Request对象
response = urllib.request.urlopen(request)
print(response.read().decode('utf- 8 '))

from urllib import request,parse
url='http://httpbin.org/post'
headers={
'user-agent': 'Mozilla/ 5. 0 (Windows NT 6. 1 ; Win 64 ; x 64 ) AppleWebKit/ 537. 36
(KHTML, like Gecko) Chrome/ 71. 0. 3578. 98 Safari/ 537. 36 ',
'Host':'httpbin.org'
}
dict={
'name':'jay'
}
data = bytes(parse.urlencode(dict),encoding='utf- 8 ')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf- 8 '))

from urllib import request,parse
url ='http://httpbin.org/post'
dict = {
'name':'cq'
}
data=bytes(parse.urlencode(dict),encoding='utf- 8 ')
req = request.Request(url=url,data=data,method='POST')
req.add_header('user-agent', 'Mozilla/ 5. 0 (Windows NT 6. 1 ; Win 64 ; x 64 )
AppleWebKit/ 537. 36 (KHTML, like Gecko) Chrome/ 71. 0. 3578. 98 Safari/ 537. 36 ')
response=request.urlopen(req)
print(response.read().decode('utf- 8 ')

保存cookie文件,两种格式

用文本文件的形式维持登录状态

关于异常处理部分，需要了解有httperror和urlerror两种，父类与子类的关系。

import http.cookiejar,urllib.request
cookie = http.cookiejar.CookieJar()
handerler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handerler)
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值
for item in cookie:
print(item.name+'='+item.value)

import http.cookiejar,urllib.request
filename='cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handerler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handerler)
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值
cookie.save(ignore_discard=True,ignore_expires=True) #保存cookie.txt文件

import http.cookiejar,urllib.request
filename='cookie 2 .txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handerler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handerler)
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值
cookie.save(ignore_discard=True,ignore_expires=True) #保存cookie.txt文件

import http.cookiejar,urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handerler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handerler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf- 8 '))

解析,将一个url解析

url拼接

#父类，只有一个reason
from urllib import request,error
try:
response = request.urlopen('http://www.bai.com/index.html')
except error.URLError as e:
print(e.reason)

#子类，有更多的属性

from urllib import request,error
try:
response = request.urlopen('http://abc. 123 /index.html')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,sep='\n')

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id= 5 #comment')
print(result)#协议内容、路径、参数
print(type(result))

from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id= 5 #comment',scheme='https')
print(result)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?
id= 5 #comment',scheme='https')
print(result)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?
id= 5 #comment',allow_fragments=False) #会被拼接
print(result)

from urllib.parse import urlparse
result =
urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False) #会被
拼接到path没有query
print(result)

from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a= 6 ','comment']
print(urlunparse(data))

更多细节可以参考：https://www.cnblogs.com/qikeyishu/p/10748497.html

3.3 解析内容

from urllib.parse import urljoin
#拼接两个url
#截图，以后面的为基准，有留下，没有拼接

print(urljoin('http://www.baidu.com','HAA.HTML'))
print(urljoin('https://wwww.baidu.com','https://www.baidu.com/index.html;questio
n= 2 '))

#字典方式直接转换成url参数
from urllib.parse import urlencode
params = {
'name':'germey',
'age':' 122 '
}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

3.3.1 标签解析

补充：BeautifulSoup模块

参考： http://www.jsphp.net/python/show-24-214-1.html

1.BeautifulSoup4简介

BeautifulSoup4和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解

析和提取 HTML/XML 数据。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，

则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方

式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅

仅需要说明一下原始编码方式就可以了。

2.BeautifulSoup4主要解析器，以及优缺点：

3.BeautifulSoup4简单使用

假设有这样一个Html，具体内容如下：

创建beautifulsoup4对象：

<!DOCTYPE html>
<html>
<head>
<meta content="text/html;charset=utf- 8 " http-equiv="content-type" />
<meta content="IE=Edge" http-equiv="X-UA-Compatible" />
<meta content="always" name="referrer" />
<link
href="https://ss 1 .bdstatic.com/ 5 eN 1 bjq 8 AAUYm 2 zgoY 3 K/r/www/cache/bdorz/baidu.min.
css" rel="stylesheet" type="text/css" />
<title>百度一下，你就知道 </title>
</head>
<body link="# 0000 cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u 1 ">
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻
</a>
<a class="mnav" href="https://www.hao 123 .com"
name="tj_trhao 123 ">hao 123 </a>
<a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧
</a>
<a class="bri" href="//www.baidu.com/more/" name="tj_briicon"
style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
</body>
</html>

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser") # 缩进格式

print(bs.prettify()) # 获取title标签的所有内容
print(bs.title) # 获取title标签的名称
print(bs.title.name) # 获取title标签的文本内容
print(bs.title.string) # 获取head标签的所有内容
print(bs.head) # 获取第一个div标签中的所有内容
print(bs.div) # 获取第一个div标签的id的值
print(bs.div["id"]) # 获取第一个a标签中的所有内容
print(bs.a) # 获取所有的a标签中的所有内容
print(bs.find_all("a")) # 获取id="u 1 "

4.BeautifulSoup4四大对象种类

BeautifulSoup4将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以

归纳为 4 种:

NavigableString

BeautifulSoup

Comment

（ 1 ） Tag

Tag通俗点讲就是HTML中的一个个标签，例如：

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是

注意，它查找的是在所有内容中的第一个符合要求的标签。

对于 Tag，它有两个重要的属性，是 name 和 attrs：

print(bs.find(id="u 1 ")) # 获取所有的a标签，并遍历打印a标签中的href的值

for item in bs.find_all("a"):
print(item.get("href")) # 获取所有的a标签，并遍历打印a标签的文本值

for item in bs.find_all("a"):
print(item.get_text())

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
# 获取title标签的所有内容
print(bs.title)
# 获取head标签的所有内容
print(bs.head)
# 获取第一个a标签的所有内容
print(bs.a)
# 类型
print(type(bs.a))

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
# [document] #bs 对象本身比较特殊，它的 name 即为 [document]
print(bs.name)
# head #对于其他内部标签，输出的值便为标签本身的名称
print(bs.head.name)
# 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。
print(bs.a.attrs)
#还可以利用get方法，传入属性的名称，二者是等价的

4.2、NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用

.string 即可，例如

4.3、BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的

Tag，我们可以分别获取它的类型，名称，以及属性，例如：

4.4、Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

print(bs.a['class']) # 等价 bs.a.get('class')
# 可以对这些属性和内容等等进行修改
bs.a['class'] = "newClass"
print(bs.a)
# 还可以对这个属性进行删除
del bs.a['class']
print(bs.a)

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.title.string)
print(type(bs.title.string))

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()

bs = BeautifulSoup(html,"html.parser")
print(type(bs.name))
print(bs.name)
print(bs.attrs)

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
print(bs.a)
# 此时不能出现空格和换行符，a标签如下：
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>
print(bs.a.string) # 新闻
print(type(bs.a.string)) # <class 'bs 4 .element.Comment'>

五、遍历文档树

5.1 .contents：获取Tag的所有子节点，返回一个list

5.2 .children ：获取Tag的所有子节点，返回一个生成器

5.3、.descendants：获取Tag的所有子孙节点

5.4、.strings：如果Tag包含多个字符串，即在子孙节点中有内容，可以用此获取，而后进行遍历

5.5、.stripped_strings：与strings用法一致，只不过可以去除掉那些多余的空白内容

5.6、.parent：获取Tag的父节点

5.7、.parents：递归得到父辈元素的所有节点，返回一个生成器

5.8、.previous_sibling：获取当前Tag的上一个节点，属性通常是字符串或空白，真实结果是当前标签

与上一个标签之间的顿号和换行符

5.9、.next_sibling：获取当前Tag的下一个节点，属性通常是字符串或空白，真是结果是当前标签与下

一个标签之间的顿号与换行符

5.10、.previous_siblings：获取当前Tag的上面所有的兄弟节点，返回一个生成器

5.11、.next_siblings：获取当前Tag的下面所有的兄弟节点，返回一个生成器

5.12、.previous_element：获取解析过程中上一个被解析的对象(字符串或tag)，可能与

previous_sibling相同，但通常是不一样的

5.13、.next_element：获取解析过程中下一个被解析的对象(字符串或tag)，可能与next_sibling相同，

但通常是不一样的

5.14、.previous_elements：返回一个生成器，可以向前访问文档的解析内容

5.15、.next_elements：返回一个生成器，可以向后访问文档的解析内容

5.16、.has_attr：判断Tag是否包含属性

六、搜索文档树

6.1、find_all(name, attrs, recursive, text, **kwargs)

# tag的.content 属性可以将tag的子节点以列表的方式输出
print(bs.head.contents)
# 用列表索引来获取它的某一个元素
print(bs.head.contents[ 1 ])

for child in bs.body.children:
print(child)

在上面的例子中我们简单介绍了find_all的使用，接下来介绍一下find_all的更多用法-过滤器。这些过滤

器贯穿整个搜索API，过滤器可以被用在tag的name中，节点的属性等。

（ 1 ）name参数：

字符串过滤：会查找与字符串完全匹配的内容

正则表达式过滤：如果传入的是正则表达式，那么BeautifulSoup4会通过search()来匹配内容

列表：如果传入一个列表，BeautifulSoup4将会与列表中的任一元素匹配到的节点返回

方法：传入一个方法，根据方法来匹配

（ 2 ）kwargs参数：

a_list = bs.find_all("a")
print(a_list)

from bs 4 import BeautifulSoup
import re
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
t_list = bs.find_all(re.compile("a"))
for item in t_list:
print(item)

t_list = bs.find_all(["meta","link"])
for item in t_list:
print(item)

from bs 4 import BeautifulSoup
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
def name_is_exists(tag):
return tag.has_attr("name")
t_list = bs.find_all(name_is_exists)
for item in t_list:
print(item)

from bs 4 import BeautifulSoup
import re
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
# 查询id=head的Tag
t_list = bs.find_all(id="head")

（ 3 ）attrs参数：

并不是所有的属性都可以使用上面这种方式进行搜索，比如HTML的data-*属性：

如果执行这段代码，将会报错。我们可以使用attrs参数，定义一个字典来搜索包含特殊属性的tag：

（ 4 ）text参数：

通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串，正则

表达式，列表

当我们搜索text中的一些特殊属性时，同样也可以传入一个方法来达到我们的目的：

print(t_list)
# 查询href属性包含ss 1 .bdstatic.com的Tag
t_list = bs.find_all(href=re.compile("http://news.baidu.com"))
print(t_list)
# 查询所有包含class的Tag(注意：class在Python中属于关键字，所以加_以示区别)
t_list = bs.find_all(class_=True)
for item in t_list:
print(item)

t_list = bs.find_all(data-foo="value")

t_list = bs.find_all(attrs={"data-foo":"value"})
for item in t_list:
print(item)

from bs 4 import BeautifulSoup
import re
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
t_list = bs.find_all(attrs={"data-foo": "value"})
for item in t_list:
print(item)
t_list = bs.find_all(text="hao 123 ")
for item in t_list:
print(item)
t_list = bs.find_all(text=["hao 123 ", "地图", "贴吧"])
for item in t_list:
print(item)
t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
print(item)

（ 5 ）limit参数：

可以传入一个limit参数来限制返回的数量，当搜索出的数据量为 5 ，而设置了limit=2时，此时只会返回

前 2 个数据

find_all除了上面一些常规的写法，还可以对其进行一些简写：

6.2、find()

find()将返回符合条件的第一个Tag，有时我们只需要或一个Tag时，我们就可以用到find()方法了。当然

了，也可以使用find_all()方法，传入一个limit=1，然后再取出第一个值也是可以的，不过未免繁琐。

def length_is_two(text):
return text and len(text) == 2
t_list = bs.find_all(text=length_is_two)
for item in t_list:
print(item)

from bs 4 import BeautifulSoup
import re
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
t_list = bs.find_all("a",limit= 2 )
for item in t_list:
print(item)

# 两者是相等的

# t_list = bs.find_all("a") => t_list = bs("a")
t_list = bs("a") # 两者是相等的
# t_list = bs.a.find_all(text="新闻") => t_list = bs.a(text="新闻")
t_list = bs.a(text="新闻")

from bs 4 import BeautifulSoup
import re
file = open('./aa.html', 'rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
# 返回只有一个结果的列表
t_list = bs.find_all("title",limit= 1 )
print(t_list)
# 返回唯一值
t = bs.find("title")
print(t)
# 如果没有找到，则返回None
t = bs.find("abc") print(t)

从结果可以看出find_all，尽管传入了limit=1，但是返回值仍然为一个列表，当我们只需要取一个值

时，远不如find方法方便。但

是如果未搜索到值时，将返回一个None

在上面介绍BeautifulSoup4的时候，我们知道可以通过bs.div来获取第一个div标签，如果我们需要获取

第一个div下的第一个div，

我们可以这样：

7.CSS选择器

BeautifulSoup支持发部分的CSS选择器，在Tag获取BeautifulSoup对象的.select()方法中传入字符串参

数，即可使用CSS选择器的语法找到Tag:

7.1、通过标签名查找

7.2、通过类名查找

7.3、通过id查找

7.4、组合查找

7.5、属性查找

7.6、直接子标签查找

7.7、兄弟节点标签查找

t = bs.div.div
# 等价于
t = bs.find("div").find("div")

print(bs.select('title'))
print(bs.select('a'))

print(bs.select('.mnav'))

print(bs.select('#u 1 '))

print(bs.select('div .bri'))

print(bs.select('a[class="bri"]'))
print(bs.select('a[href="http://tieba.baidu.com"]'))

t_list = bs.select("head > title")
print(t_list)

t_list = bs.select(".mnav ~ .bri")
print(t_list)

7.8、获取内容

3.3.2 正则提取

补充：re模块

t_list = bs.select("title")
print(bs.select('title')[ 0 ].get_text())

修饰符描述

re.I 使匹配对大小写不敏感

re.L 做本地化识别（locale-aware）匹配

re.M 多行匹配，影响 ^ 和 $

re.S 使. 匹配包括换行在内的所有字符

re.U 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.

re.X 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。

正则表达式可以包含一些可选标志修饰符来控制匹配的模式。修饰符被指定为一个可选的标志。

多个标志可以通过按位 OR(|) 它们来指定。如 re.I | re.M 被设置成 I 和 M 标志：

#

re模块下的函数

compile(pattern)：创建模式对象

search(pattern,string)：在字符串中寻找模式

match(pattern,string)：在字符串开始处匹配模式

split(pattern,string)：根据模式分割字符串,返回列表

import re
pat=re.compile("A")
#m=pat.search("CBA")
m=pat.search("ABC")
#等价于 re.search( A , CBA )
print(m)

#<re.Match object; span=( 2 , 3 ), match='A'> 表示匹配到了

m=pat.search("CBD")
print(m) #None 表示没匹配到

import re
m = re.search("asd" , "ASDasd" )
print(m)
# <_sre.SRE_Match object at 0 xb 72 cd 6 e 8 > #匹配到了，返回MatchObject（True）
m = re.search("asd" , "ASDASD" )
print(m) #没有匹配到，返回None（False）

# 等价于

pat=re.compile( "a" )
print(pat.match( "Aasd" ))
#输出None

print(pat.match("aASD" ))
#输出 <_sre.SRE_Match object at 0 xb 72 cd 6 e 8 >

# 上面的函数返回都可以在if条件语句中进行判断：

if pat.search("asd"):
print ("OK") #OK #找到返回

if re.search("a","ASD" ):
print ("OK") #没有找到

re.split( , , a,s,d,asd )
[ a , s , d , asd ] #返回列表

pat = re.compile( , )

findall(pattern,string)：列表形式返回匹配项

sub(pat,repl,string) ：用repl替换 pat匹配项

(留的是中间的，因为中间在中心)

pat.split( a,s,d,asd )
[ a , s , d , asd ] #返回列表

re.split( [, ]+ , a , s ,d ,,,,,asd ) #正则匹配：[, ]+，后面说明
[ a , s , d , asd ]

re.split( [, ]+ , a , s ,d ,,,,,asd ,maxsplit= 2 ) # maxsplit 最多分割次数
[ a , s , d ,,,,,asd ]

pat = re.compile( [, ]+ ) #正则匹配：[, ]+，后面说明
pat.split( a , s ,d ,,,,,asd ,maxsplit= 2 ) # maxsplit 最多分割次数
[ a , s , d ,,,,,asd ]

import re
print(re.findall( "a" , "ASDaDFGAa" ))
#[ a , a ] #列表形式返回匹配到的字符串

pat = re.compile( "a" )
print(pat.findall( "ASDaDFGAa" ))
#[ a , a ] #列表形式返回匹配到的字符串

pat = re.compile( "[A-Z]+" ) #正则匹配：[A-Z]+ 后面有说明
print(pat.findall( "ASDcDFGAa" ))
#[ ASD , DFGA ] #找到匹配到的字符串

pat = re.compile( [A-Z] )
pat.findall( ASDcDFGAa ) #正则匹配：[A-Z]+ 后面有说明
[ A , S , D , D , F , G , A ] #找到匹配到的字符串

pat = re.compile( [A-Za-z] ) #正则匹配：[A-Za-z]+ 匹配所有单词，后面有说明
pat.findall( ASDcDFGAa )
[ A , S , D , c , D , F , G , A , a ]

re.sub( a , A , abcasd ) #找到a用A替换，后面见和group的配合使用
AbcAsd

pat = re.compile( a )
pat.sub( A , abcasd )
AbcAsd

pat=re.compile(r http://www.(.*)..{ 3 } ) #正则表达式

#在Python的string前面加上‘r’， 是为了告诉编译器这个string是个raw string，不要转译反斜杠
。

escape(string) ：对字符串里面的特殊字符串进行转义

上面的函数中，只有match、search有group方法，其他的函数没有。

函数的方法

group：获取子模式(组)的匹配项

#例如， 在raw string中，是两个字符，和n， 而不会转译为换行符。
#由于正则表达式和 会有冲突，因此，当一个字符串使用了正则表达式后，最好在前面加上 r 。

#与大多数编程语言相同，正则表达式里使用””作为转义字符，这就可能造成反斜杠困扰。

#假如你需要匹配文本中的字符””，那么使用编程语言表示的正则表达式里将需要 4 个反斜杠”\\”：

#前两个和后两个分别用于在编程语言里转义成反斜杠，转换成两个反斜杠后再在正则表达式里转义成一个

反斜杠。

#Python里的原生字符串很好地解决了这个问题，这个例子中的正则表达式可以使用r"\"表示。
#同样，匹配一个数字的"\d"可以写成r"d"。
#有了原生字符串，你再也不用担心是不是漏写了反斜杠，写出来的表达式也更直观。

#不是说 加了r 就没有转译功能，好乱，就直接记住 1 句话：
#当一个字符串使用了正则表达式后，最好在前面加上 r ，这样你再也不用担心是不是漏写了反斜杠，写
出来的表达式也更直观

pat.match( http://www.dxy.com ).group( 1 )
dxy

re.sub(r http://www.(.*)..{ 3 } ,r , hello,www.dxy.com )

pat.sub(r , hello,www.dxy.com )
hello,dxy

# r 1 是第一组的意思

#通过正则匹配找到符合规则的"www.dxy.com" ，取得 组 1 字符串 去替换 整个匹配。

pat=re.compile(r (w+) (w+) ) #正则表达式
s= hello world! hello hz!

pat.findall( hello world! hello hz! )
[( hello , world ), ( hello , hz )]
pat.sub(r ,s) #通过正则得到组 1 (hello)，组 2 (world)，再通过sub去替换。
即组 1 替换组 2 ，组 2 替换组 1 ，调换位置。
world hello!hz hello!

re.escape( http://www.dxy.cn )
www\.dxy\.cn #转义

start：给定组匹配项的开始位置

end：给定组匹配项的结束位置

span：给定组匹配项的开始结束位置

正则表达式

元字符

“.” ：通配符,除换行符外的任意的 1 个字符

“” : 转义符

pat = re.compile(r http://www.(.*).(.*) ) #用()表示 1 个组， 2 个组
m = pat.match( http://www.dxy.com )
m.group() #默认为 0 ，表示匹配整个字符串
http://www.dxy.com

m.group( 1 ) #返回给定组 1 匹配的子字符串
dxy

m.group( 2 )
com

m.start( 2 ) #组 2 开始的索引
8

m.end( 2 ) #组 2 结束的索引
11

m.span( 2 ) #组 2 开始、结束的索引
( 8 , 11 )

pat=re.compile(. )
pat.match( abc )
<_sre.SRE_Match object at 0 xb 72 b 6170 >
pat.match( abc ).group()
a #匹配到了首个字符
pat.search( abc ).group()
a
pat.match( ).group() #换行符匹配出错
Traceback (most recent call last):
File "<stdin>", line 1 , in <module>
AttributeError: NoneType object has no attribute group

“[…]” : 字符集合，匹配里面的任意一个元素

“d” : 数字

“D” : 非数字

pat=re.compile(. )
pat.search( abc.efg ).group() #匹配到.
.
pat.findall( abc.efg ) #不用group,返回列表
[. ]

pat=re.compile( [abc] )
pat.match( axbycz ).group()
a
pat.search( axbycz ).group()
a
pat.findall( axbycz )
[ a , b , c ]

>>> pat=re.compile( d )
>>> pat.search( ax 1 by 2 cz 3 ).group() #匹配到第一个数字: 1 ，返回
>>> 1

>>> pat.match( ax 1 by 2 cz 3 ).group() #匹配不到（首个不是）返回None，报错，match匹配字
符串头
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1 , in <module>
>>> AttributeError: NoneType object has no attribute group

>>> pat.findall( ax 1 by 2 cz 3 ) #匹配所有的数字，列表返回
>>> [ 1 , 2 , 3 ]

>>> pat=re.compile( D )
>>> pat.match( ax 1 by 2 cz 3 ).group()
>>> a
>>> pat.search( ax 1 by 2 cz 3 ).group()
>>> a
>>> pat.findall( ax 1 by 2 cz 3 )
>>> [ a , x , b , y , c , z ]
>>> 1
>>> 2
>>> 3
>>> 4
>>> 5
>>> 6
>>> 7
>>> “s” ：空白字符 、 、
>>> 、

“S” :非空白字符

“w” ：单个的数字和字母，[A-Za-z0-9]

“W”:非单词字符,除数字和字母外

数量词

“*” ： 0 次或多次

（乘 0 会变成 0 ）

>>> 、空格

>>> pat=re.compile( s )
>>> pat.findall( ax 1 by 2 cz 3 )
>>> [ , , , , ]
>>> pat.search( ax 1 by 2 cz 3 ).group()

>>> pat.match( ax 1 by 2 cz 3 ).group()

>>> pat=re.compile( S )
>>> pat.search( ax 1 by 2 cz 3 ).group()
>>> a
>>> pat.findall( ax 1 by 2 cz 3 )
>>> [ a , x , 1 , b , y , 2 , c , z , 3 ]

>>> pat=re.compile( w )
>>> pat.search( 1 a 2 b 3 c ).group()
>>> 1
>>> pat.findall( 1 a 2 b 3 c )
>>> [ 1 , a , 2 , b , 3 , c ]
>>> pat.match( 1 a 2 b 3 c ).group()

>>> pat=re.compile( W )
>>> pat.findall( 1 a 2 我b 3 c ) #python是用三字节表示一个汉字
>>> [ æ , ˆ , ‘ ]
>>> pat.search( 1 a 2 我b 3 c ).group()
>>> æ

>>> pat = re.compile( [abc]* )
>>> pat.match( abcabcdefabc ).group()
>>> abcabc # 2 次
>>> pat.search( abcabcdefabc ).group()
>>> abcabc # 2 次
>>> pat.findall( abcabcdefabc )
>>> [ abcabc , , , , abc , ] # 2 次和 1 次,因为有 0 次，所以匹配了

“+” ： 1 次或多次

“?” ： 0 次或 1 次，match,search 不会出现none，会出现’ ‘ （因为 0 次也是符合的）

0 次或 1 次不是指[xxx]这个集合，而是其中的任何的一个字符

“数量词?” ：非贪婪模式：只匹配最少的（尽可能少）；默认贪婪模式：匹配最多的（尽可能多）

“{m}” ：匹配字符串出现m次

>>> pat = re.compile( [abc]+ )
>>> pat.match( abcdefabcabc ).group()
>>> abc
>>> pat.search( abcdefabcabc ).group()
>>> abc
>>> pat.findall( abcdefabcabc )
>>> [ abc , abcabc ]

>>> pat = re.compile( [abc]? )
>>> pat.match( defabc ).group() # 0 次

>>> pat.match( abcdefabc ).group()
>>> a
>>> pat.search( defabc ).group() # 0 次

>>> pat.findall( defabc ) # 0 次和 1 次
>>> [ , , , a , b , c , ] #后面总再加个

>>> pat = re.compile( [abc]+ ) #贪婪模式
>>> pat.match( abcdefabcabc ).group() #匹配尽可能多的：abc
>>> abc
>>> pat.match( bbabcdefabcabc ).group()
>>> bbabc
>>> pat.search( dbbabcdefabcabc ).group()
>>> bbabc
>>> pat.findall( abcdefabcabc )
>>> [ abc , abcabc ]

>>> pat = re.compile( [abc]+? ) #非贪婪模式：+?
>>> pat.match( abcdefabcabc ).group() #匹配尽可能少的：a、b、c
>>> a
>>> pat.search( dbbabcdefabcabc ).group()
>>> b
>>> pat.findall( abcdefabcabc )
>>> [ a , b , c , a , b , c , a , b , c ]

“{m,n}” ：匹配字符串出现m到n次

.group() #匹配第一次出现

边界

“^” ：匹配字符串开头或行头

“$” ：匹配字符串结尾或则行尾

>>> pat = re.compile( [op]{ 2 } ) #o或p出现 2 次
>>> pat.search( abcooapp ).group() #匹配第一次出现的字符串,o比p先出现
>>> oo
>>> pat.findall( abcooapp ) #匹配出现的所有字符串，列表形式返回
>>> [ oo , pp ]

>>> pat = re.compile( [op]{ 2 , 4 } ) #o或则p出现 2 到 4 次
>>> pat.match( pppabcooapp ).group() #匹配开头
>>> ppp
>>> pat.search( pppabcooapp ).group() #匹配第一次出现
>>> ppp
>>> pat.findall( pppabcooapp ) #匹配所有
>>> [ ppp , oo , pp ]

>>> pat = re.compile( ^[abc] ) #开头是a、b、c中的任意一个
>>> pat.search( defabc ).group()
>>> pat.match( defabc ).group() #均找不到
>>> pat.findall( defabc )
>>> []

>>> pat.search( adefabc ).group()
>>> a
>>> pat.match( adefabc ).group() #开头是a、b、c中的任意一个
>>> a
>>> pat.findall( adefabc )
>>> [ a ]

>>> pat = re.compile( ^[abc]+ ) #开头是a、b、c中的任意一个的一次或则多次，贪婪：匹配
多个
>>> pat.findall( cbadefab )
>>> [ cba ]
>>> pat = re.compile(r ^[abc]+? ) #开头是a、b、c中的任意一个的一次或则多次，非贪婪：匹
配一个
>>> pat.findall( cbadefab )
>>> [ c ]

“A”：匹配字符串开头

“Z”：匹配字符串结尾

分组

(…)：分组匹配,从左到右,每遇到一个 ( 编号+1，分组后面可加数量词

：引用编号为的分组匹配到的字符串

>>> pat = re.compile( [abc]$ )
>>> pat.match( adefAbc ).group() #match匹配的是字符串开头，所以查找$的时，总是返回
None
>>> pat.search( adefAbc ).group() #结尾是a、b、c中的任意一个
>>> c
>>> pat.findall( adefAbc )
>>> [ c ]
>>> pat = re.compile( [abc]+$ )
>>> pat.search( adefAbc ).group() #结尾是a、b、c中的任意一个的一次或则多次，贪婪：匹配
多个
>>> bc
>>> pat.findall( adefAbc )
>>> [ bc ]

>>> pat = re.compile( A[abc]+ )
>>> pat.findall( cbadefab )
>>> [ cba ]
>>> pat.search( cbadefab ).group()
>>> cba

>>> pat = re.compile( [abc]+Z )
>>> pat.search( cbadefab ).group()
>>> ab
>>> pat.findall( cbadefab )
>>> [ ab ]

>>> pat=re.compile(r (a)w(c) ) #w:单个的数字或字母 [A-Za-z 0 - 9 ]
>>> pat.match( abcdef ).group()
>>> abc
>>> pat=re.compile( (a)b(c) ) #分 2 组，匿名分组

>>> pat.match( abcdef ).group() #默认返回匹配的字符串
>>> abc
>>> pat.match( abcdef ).group( 1 ) #取分组 1 ，适用于search
>>> a
>>> pat.match( abcdef ).group( 2 ) #取分组 2 ，适用于search
>>> c
>>> pat.match( abcdef ).groups() #取所有分组，元组形式返回
>>> ( a , c )

“(?P…)” ：在模式里面用()来表示分组（命名分组）,适用于提取目标字符串中的某一些部位。

“(?P=name)”：引用别名为的分组匹配到的串

“” ：引用分组编号匹配：

>>> pat=re.compile(r http://www.(.*)..{ 3 } )
>>> pat.match( http://www.dxy.com ).group( 1 )
>>> dxy

>>> pat=re.compile(r (?P<K>a)w(c) ) #分 2 组：命名分组+匿名分组
>>> pat.search( abcdef ).groups() #取所有分组，元组形式返回
>>> ( a , c )
>>> pat.search( abcdef ).group( 1 ) #取分组 1 ，适用于match
>>> a
>>> pat.search( abcdef ).group( 2 ) #取分组 2 ，适用于match
>>> c
>>> pat.search( abcdef ).group() #默认返回匹配的字符串
>>> abc
>>> pat.search( abcdef ).groupdict() #命名分组可以返回一个字典【专有】，匿名分组也没
有
>>> { K : a }

>>> pat=re.compile(r (?P<K>a)w(c)(?P=K) ) #(?P=K)引用分组 1 的值，就是a
>>> pat.search( abcdef ).group() #匹配不到，因为完整 awca ,模式的第 4 位是
a
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1 , in <module>
>>> AttributeError: NoneType object has no attribute group

>>> pat.search( abcadef ).group() #匹配到，模式的第 4 位和组 1 一样,值是c
>>> abca
>>> pat.search( abcadef ).groups()
>>> ( a , c )
>>> pat.search( abcadef ).group( 1 )
>>> a
>>> pat.search( abcadef ).group( 2 )
>>> c

特殊构造

“(?=…)”：匹配…表达式，返回。对后进行匹配，总是对后面进行匹配

>>> pat=re.compile(r (?P<K>a)w(c)(?P=K) ) #引用分组 2 的值，就是c
>>> pat.findall( Aabcadef ) #匹配不到，因为完整 awcac ,模式的第 5 位
是c
>>> []
>>> pat.findall( Aabcacdef ) #匹配到，模式的第 5 位和组 2 一样,值是c
>>> [( a , c )]
>>> pat.search( Aabcacdef ).groups()
>>> ( a , c )
>>> pat.search( Aabcacdef ).group()
>>> abcac
>>> pat.search( Aabcacdef ).group( 1 )
>>> a
>>> pat.search( Aabcacdef ).group( 2 )
>>> c

(?:…) (…)不分组版本,用于使用 | 或者后接数量词

(?iLmsux) iLmsux的每个字符代表一个匹配模式,只能用在正则表达式的开头,可选多个
(?#...) #号后的内容将作为注释
(?=...) 之后的字符串内容需要匹配表达式才能成功匹配
(?!...) 之后的字符串不匹配表达式才能成功
(?(?(?(id/name) yes |no) 如果编号为id/名字为name的组匹配到字符串,则需要匹配yes,否则匹配
no,no可以省略

###### “(?:...)” ：()里面有?:表示该()不是分组

>>> pat=re.compile(r a(?:bc) )
>>> pat.findall( abc )
>>> [ abc ]
>>> pat.match( abc ).groups() #显示不出分组

>>> pat=re.compile(r w(?=d) ) #匹配表达式d，返回数字的前一位，w：单词字符[A-Za-z 0 - 9 ]
>>> pat.findall( abc 1 def 1 xyz 1 )
>>> [ c , f , z ]
>>> pat.findall( zhoujy 20130628 hangzhou ) #匹配数字的前一位，列表返回
>>> [ y , 2 , 0 , 1 , 3 , 0 , 6 , 2 ]
>>> pat=re.compile(r w+(?=d) )
>>> pat.findall( abc 1 ,def 1 ,xyz 1 ) #匹配最末数字的前字符串，列表返回
>>> [ abc , def , xyz ]
>>> pat.findall( abc 21 ,def 31 ,xyz 41 )
>>> [ abc 2 , def 3 , xyz 4 ]
>>> pat.findall( zhoujy 20130628 hangzhou )
>>> [ zhoujy 2013062 ]
>>> pat=re.compile(r [A-Za-z]+(?=d) ) #[A-Za-z],匹配字母,可以用其他的正则方法
>>> pat.findall( zhoujy 20130628 hangzhou 123 ) #匹配后面带有数字的字符串，列表返回
>>> [ zhoujy , hangzhou ]
>>> pat.findall( abc 21 ,def 31 ,xyz 41 )

“(?!…)” 不匹配…表达式，返回。对后进行匹配

“(?<=…)”：匹配…表达式，返回。对前进行匹配,总是对前面进行匹配

“(?<!…)”：不匹配…表达式，返回。对前进行匹配,总是对前面进行匹配

“(?(id/name) yes |no)”: 组是否匹配，匹配返回

>>> [ abc , def , xyz ]

>>> pat=re.compile(r [A-Za-z]+(?!d) ) #[A-Za-z],匹配字母,可以用其他的正则方法
>>> pat.findall( zhoujy 20130628 hangzhou 123 , 12 ,binjiang 310 ) #匹配后面不是数字的字符
串，列表返回
>>> [ zhouj , hangzho , binjian ]
>>> pat.findall( abc 21 ,def 31 ,xyz 41 )
>>> [ ab , de , xy ]

>>> pat=re.compile(r (?<=d)[A-Za-z]+ ) #匹配前面是数字的字母
>>> pat.findall( abc 21 ,def 31 ,xyz 41 )
>>> []
>>> pat.findall( 1 abc 21 , 2 def 31 , 3 xyz 41 )
>>> [ abc , def , xyz ]
>>> pat.findall( zhoujy 20130628 hangzhou 123 , 12 ,binjiang 310 )
>>> [ hangzhou ]

>>> pat=re.compile(r (?<!d)[A-Za-z]+ ) #匹配前面不是数字的字母
>>> pat.findall( abc 21 ,def 31 ,xyz 41 )
>>> [ abc , def , xyz ]
>>> pat.findall( zhoujy 20130628 hangzhou 123 , 12 ,binjiang 310 )
>>> [ zhoujy , angzhou , binjiang ]

>>> pat=re.compile(r a(d)?bc(?( 1 )d) ) #no省略了，完整的是adbcd ==> a 2 bc 3 ,总共 5 位，
第 2 位是可有可无的数字，第 5 为是数字
>>> pat.findall( abc 9 ) #返回组 1 ，但第 2 位（组 1 ）没有，即返回了
>>> [ ]
>>> pat.findall( a 8 bc 9 ) #完整的模式，返回组 1
>>> [ 8 ]
>>> pat.match( a 8 bc 9 ).group()
>>> a 8 bc 9
>>> pat.match( a 8 bc 9 ).group( 1 )
>>> 8
>>> pat.findall( a 8 bc ) #第 5 位不存在，则没有匹配到
>>> []

“(?iLmsux)”:这里就介绍下i参数：大小写区分匹配

3.3.3 提取数据

>>> pat=re.compile(r abc )
>>> pat.findall( abc )
>>> [ abc ]
>>> pat.findall( ABC )
>>> []
>>> pat=re.compile(r (?i)abc ) #(?i) 不区分大小写
>>> pat.findall( ABC )
>>> [ ABC ]
>>> pat.findall( abc )
>>> [ abc ]
>>> pat.findall( aBc )
>>> [ aBc ]
>>> pat.findall( aBC )
>>> [ aBC ]
>>> pat=re.compile(r abc ,re.I) #re.I 作为参数使用，推荐
>>> pat.findall( aBC )
>>> [ aBC ]
>>> pat.findall( abc )
>>> [ abc ]
>>> pat.findall( ABC )
>>> [ ABC ]

3.4 保存数据

3.4.1 Excel表存储

补充：xlwt模块

简单使用xlwt

import xlwt #导入模块
workbook = xlwt.Workbook(encoding='utf- 8 ') #创建workbook 对象
worksheet = workbook.add_sheet('sheet 1 ') #创建工作表sheet
worksheet.write( 0 , 0 , 'hello') #往表中写内容,第一各参数 行,第二个参数列,第三个参数内容
workbook.save('students.xls') #保存表为students.xls

将九九乘法表显示在表格中，每个单元格 1 个公式

https://www.jianshu.com/p/fc97dd7e822c

https://www.cnblogs.com/caesar-id/p/11802440.html

3.4.2 数据库存储

可以参考菜鸟教程：

Python和SQLite： https://www.runoob.com/sqlite/sqlite-python.html

SQLite 和SQL语句：https://www.runoob.com/sqlite/sqlite-tutorial.html

workbook = xlwt.Workbook(encoding='utf- 8 ') #创建workbook 对象
worksheet = workbook.add_sheet('sheet 1 ') #创建工作表sheet

for i in range( 0 , 9 ):
for j in range( 0 ,i+ 1 ):
worksheet.write(i, j, "%d * %d = %d"%(i+ 1 ,j+ 1 ,(i+ 1 )*(j+ 1 )))

# worksheet.write( 0 , 0 , 'hello') #往表中写内容,第一各参数 行,第二个参数列,第三个参数内容
workbook.save('students.xls') #保存表为students.xls

1.引入sqlite3库

2.初始化数据库

demo：

创建数据表

项目源码：

3.数据库存储

插入数据：insert操作

import sqlite 3

import sqlite 3

conn = sqlite 3 .connect('test.db')

print ("Opened database successfully")

import sqlite 3

conn = sqlite 3 .connect('test.db')
print ("Opened database successfully")
c = conn.cursor()
c.execute('''"CREATE TABLE COMPANY
(ID INT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR( 50 ),
SALARY REAL);''')
print ("Table created successfully")
conn.commit()
conn.close()

def init_db(dbpath):
sql = "create table movie 250 (id INTEGER primary key
autoincrement,info_link text,
pic_link text,cname varchar,ename varchar ,score numeric ,rated
numeric ,introduction text,info text)" #创建数据表
conn = sqlite 3 .connect(dbpath) #连接或创建数据库
cursor = conn.cursor() #获取游标
cursor.execute(sql) #执行SQL语句：创建数据表
conn.commit() #事务提交：让操作生效
cursor.close() #关闭游标
conn.close() #关闭连接

项目源码：

4.数据库查询（后面项目用的上）

select操作：

下面的 Python 程序显示了如何从前面创建的 COMPANY 表中获取并显示记录：

import sqlite 3

conn = sqlite 3 .connect('test.db')
c = conn.cursor()
print ("Opened database successfully")

c.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES ( 1 , 'Paul', 32 , 'California', 20000. 00 )");

c.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES ( 2 , 'Allen', 25 , 'Texas', 15000. 00 )");

c.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES ( 3 , 'Teddy', 23 , 'Norway', 20000. 00 )");

c.execute("INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES ( 4 , 'Mark', 25 , 'Rich-Mond ', 65000. 00 )");

conn.commit()
print ("Records created successfully")
conn.close()

#datalist是封装好的数据，dbpath是数据库文件存放的全路径
def saveData 2 (datalist,dbpath):
init_db(dbpath) # 创建数据表
con = sqlite 3 .connect(dbpath) #连接数据库
cur = con.cursor() #获取游标
for data in datalist: #对每行数据进行操作
for index in range(len(data)): #index是每行数据的下标
data[index] = ("\""+data[index]+"\"") #对每个数据添加前后的双引号，\是
转义字符

sql = 'INSERT INTO
movie 250 (info_link,pic_link,cname,ename,score,rated,introduction,info) VALUES
(%s)' % ",".join(data) #拼接建表语句，连接data列表中的每一
项，使用逗号分隔

cursor.execute(sql) #执行SQL语句：创建数据表
conn.commit() #事务提交：让操作生效

cur.close() #关闭游标
conn.close() #关闭连接

5.修改操作：update （略）

6.删除操作：delete （略）

3.4.3 数据展示（了解）

import sqlite 3

conn = sqlite 3 .connect('test.db')
c = conn.cursor()
print ("Opened database successfully")

cursor = c.execute("SELECT id, name, address, salary from COMPANY")
for row in cursor:
print ("ID = ", row[ 0 ])
print ("NAME = ", row[ 1 ])
print ("ADDRESS = ", row[ 2 ])
print ("SALARY = ", row[ 3 ], "\n")

print("Operation done successfully")
conn.close()

Python爬虫

1.任务介绍

2.爬虫初识

网络爬虫（网络蜘蛛）原理图

搜索引擎原理图

3.基本流程

3.1 准备工作

3.1.1 分析页面

3.1.2 编码规范

3.1.3 引入模块

模块 module ：一般情况下，是一个以.py为后缀的文件。

module 可看作一个工具类，可共用或者隐藏代码细节，将相关代码放置在一个module以便让代码更

好用、易懂，让coder重点放在高层逻辑上。

module能定义函数、类、变量，也能包含可执行的代码。module来源有 3 种：

①Python内置的模块（标准库）；

②第三方模块；

③自定义模块。

包 package ： 为避免模块名冲突，Python引入了按目录组织模块的方法，称之为 包（package）。包

是含有Python模块的文件夹。

3.2 获取数据

补充：urllib模块

最最基本的请求

是python内置的一个http请求库，不需要额外的安装。只需要关注请求的链接，参数，提供了强大的

解析。

urllb.request 请求模块

urllib.error 异常处理模块

urllib.parse 解析模块

用法讲解

简单的一个get请求

简单的一个post请求

超时处理

打印出响应类型，状态码，响应头

由于使用urlopen无法传入参数，我们需要解决这个问题

我们需要声明一个request对象，通过这个对象来添加参数

我们还可以分别创建字符串、字典等等来带入到request对象里面

我们还可以通过addheaders方法不断的向原始的requests对象里不断添加

打印出信息cookies

下面这段程序获取response后声明的cookie对象会被自动赋值

保存cookie文件,两种格式

用文本文件的形式维持登录状态

关于异常处理部分，需要了解有httperror和urlerror两种，父类与子类的关系。

解析,将一个url解析

url拼接

#子类，有更多的属性

更多细节可以参考：https://www.cnblogs.com/qikeyishu/p/10748497.html

3.3 解析内容

3.3.1 标签解析

补充：BeautifulSoup模块

参考： http://www.jsphp.net/python/show-24-214-1.html

1.BeautifulSoup4简介

BeautifulSoup4和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解

析和提取 HTML/XML 数据。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，

则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方

式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅

仅需要说明一下原始编码方式就可以了。

2.BeautifulSoup4主要解析器，以及优缺点：

3.BeautifulSoup4简单使用

假设有这样一个Html，具体内容如下：

创建beautifulsoup4对象：

4.BeautifulSoup4四大对象种类

BeautifulSoup4将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以

归纳为 4 种:

Tag

NavigableString

BeautifulSoup

Comment

（ 1 ） Tag

Tag通俗点讲就是HTML中的一个个标签，例如：

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是

注意，它查找的是在所有内容中的第一个符合要求的标签。

对于 Tag，它有两个重要的属性，是 name 和 attrs：

4.2、NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用

.string 即可，例如

4.3、BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的

Tag，我们可以分别获取它的类型，名称，以及属性，例如：

4.4、Comment

包 package ：为避免模块名冲突，Python引入了按目录组织模块的方法，称之为包（package）。包

5.2 .children ：获取Tag的所有子节点，返回一个生成器

列表：如果传入一个列表，BeautifulSoup4将会与列表中的任一元素匹配到的节点返回

方法：传入一个方法，根据方法来匹配

通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串，正则

修饰符描述