Python正则表示字符集的特殊字符

Python正则表达式字符集的特殊字符

在Python的正则表达式中，字符集是一个非常重要的概念，而其中的特殊字符更是正则表达式强大功能的关键组成部分。理解并熟练运用这些特殊字符，能够让我们更高效地进行文本匹配和处理。

点号（`.`）

点号是正则表达式中最常用的特殊字符之一。它在默认情况下匹配除换行符（\n）之外的任意单个字符。

示例1：匹配任意字符（除换行符）

import re

pattern = r"a.c"
text = "abc"
match = re.search(pattern, text)
if match:
    print(match.group())

text = "a1c"
match = re.search(pattern, text)
if match:
    print(match.group())

text = "a\nc"
match = re.search(pattern, text)
if match:
    print(match.group())
else:
    print("未匹配到")

在上述代码中，a.c这个模式表示匹配以a开头，以c结尾，中间是任意字符（除换行符）的字符串。所以对于abc和a1c都能成功匹配，而a\nc则无法匹配。

如果想要匹配包括换行符在内的所有字符，可以使用re.DOTALL标志。

示例2：使用`re.DOTALL`匹配包括换行符的所有字符

import re

pattern = r"a.c"
text = "a\nc"
match = re.search(pattern, text, re.DOTALL)
if match:
    print(match.group())

这里通过添加re.DOTALL标志，使得点号能够匹配换行符，从而成功匹配a\nc。

字符类（方括号`[]`）

字符类允许我们定义一组字符，正则表达式将匹配这组字符中的任意一个。

示例3：匹配指定字符集中的字符

import re

pattern = r"[aeiou]"
text = "hello"
matches = re.findall(pattern, text)
print(matches)

text = "world"
matches = re.findall(pattern, text)
print(matches)

在这个例子中，[aeiou]表示匹配任意一个元音字母。在hello中，会匹配到e和o，而在world中则没有匹配结果。

我们还可以在字符类中使用连字符（-）来表示字符范围。

示例4：匹配字符范围

import re

pattern = r"[a-z]"
text = "Hello123"
matches = re.findall(pattern, text)
print(matches)

pattern = r"[0-9]"
matches = re.findall(pattern, text)
print(matches)

[a - z]表示匹配任意小写字母，[0 - 9]表示匹配任意数字。所以在Hello123中，首先会匹配到所有小写字母e、l、l、o，然后会匹配到数字1、2、3。

字符类还可以使用脱字符（^）来表示取反，即匹配不在字符集中的字符。

示例5：字符类取反

import re

pattern = r"[^a - z]"
text = "Hello123"
matches = re.findall(pattern, text)
print(matches)

这里[^a - z]表示匹配除小写字母之外的字符，所以会匹配到H、1、2、3。

预定义字符类

Python正则表达式提供了一些预定义的字符类，它们是一些常用字符集的简写形式。

`\d`

\d匹配任意一个数字字符，等价于[0 - 9]。

示例6：使用`\d`匹配数字

import re

pattern = r"\d"
text = "abc123"
matches = re.findall(pattern, text)
print(matches)

上述代码中，\d会匹配abc123中的1、2、3。

`\D`

\D是\d的取反，匹配任意一个非数字字符，等价于[^0 - 9]。

示例7：使用`\D`匹配非数字字符

import re

pattern = r"\D"
text = "abc123"
matches = re.findall(pattern, text)
print(matches)

在abc123中，\D会匹配a、b、c。

`\w`

\w匹配任意一个字母、数字或下划线字符，等价于[a - zA - Z0 - 9_]。

示例8：使用`\w`匹配字母、数字或下划线

import re

pattern = r"\w"
text = "hello_123"
matches = re.findall(pattern, text)
print(matches)

\w会匹配hello_123中的所有字符，因为它们都属于字母、数字或下划线。

`\W`

\W是\w的取反，匹配任意一个非字母、数字或下划线的字符，等价于[^a - zA - Z0 - 9_]。

示例9：使用`\W`匹配非字母、数字或下划线字符

import re

pattern = r"\W"
text = "hello_123!"
matches = re.findall(pattern, text)
print(matches)

在hello_123!中，\W会匹配!。

`\s`

\s匹配任意一个空白字符，包括空格、制表符（\t）、换行符（\n）、回车符（\r）和换页符（\f），等价于[ \t\n\r\f]。

示例10：使用`\s`匹配空白字符

import re

pattern = r"\s"
text = "hello world\n"
matches = re.findall(pattern, text)
print(matches)

这里\s会匹配hello world\n中的空格和换行符。

`\S`

\S是\s的取反，匹配任意一个非空白字符，等价于[^ \t\n\r\f]。

示例11：使用`\S`匹配非空白字符

import re

pattern = r"\S"
text = "hello world\n"
matches = re.findall(pattern, text)
print(matches)

\S会匹配hello world\n中的所有非空白字符。

边界匹配特殊字符

脱字符（`^`）

在正则表达式的开头使用^时，它表示匹配字符串的开头位置。

示例12：匹配字符串开头

import re

pattern = r"^hello"
text1 = "hello world"
text2 = "world hello"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
else:
    print("text2未匹配到")

在上述代码中，^hello表示匹配以hello开头的字符串，所以text1能匹配到，而text2不能匹配到。

美元符号（`$`）

在正则表达式的结尾使用$时，它表示匹配字符串的结尾位置。

示例13：匹配字符串结尾

import re

pattern = r"world$"
text1 = "hello world"
text2 = "world hello"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
else:
    print("text2未匹配到")

这里world$表示匹配以world结尾的字符串，text1能匹配到，text2则不能匹配到。

`\b`

\b表示单词边界，它匹配一个单词的边界位置，即单词与非单词字符的交界处。

示例14：使用`\b`匹配单词边界

import re

pattern = r"\bhello\b"
text1 = "hello world"
text2 = "helloworld"
text3 = "world hello"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
else:
    print("text2未匹配到")
if match3:
    print(match3.group())

在hello world和world hello中，hello都处于单词边界位置，所以能匹配到，而在helloworld中，hello不是一个独立的单词，没有单词边界，所以不能匹配到。

`\B`

\B是\b的取反，表示非单词边界，匹配不是单词边界的位置。

示例15：使用`\B`匹配非单词边界

import re

pattern = r"\Bhello\B"
text1 = "helloworld"
text2 = "hello world"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
else:
    print("text2未匹配到")

在helloworld中，hello处于非单词边界位置，所以能匹配到，而在hello world中，hello处于单词边界位置，所以不能匹配到。

量词相关特殊字符

星号（`*`）

*表示前面的字符或字符组可以出现0次或多次。

示例16：使用`*`匹配0次或多次

import re

pattern = r"ab*"
text1 = "a"
text2 = "ab"
text3 = "abb"
text4 = "abbb"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
if match4:
    print(match4.group())

在上述代码中，ab*表示a后面可以跟0个或多个b，所以a、ab、abb、abbb都能匹配到。

加号（`+`）

+表示前面的字符或字符组可以出现1次或多次。

示例17：使用`+`匹配1次或多次

import re

pattern = r"ab+"
text1 = "a"
text2 = "ab"
text3 = "abb"
text4 = "abbb"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
if match4:
    print(match4.group())
else:
    print("text1未匹配到")

这里ab+表示a后面必须跟1个或多个b，所以a不能匹配到，而ab、abb、abbb都能匹配到。

问号（`?`）

?表示前面的字符或字符组可以出现0次或1次。

示例18：使用`?`匹配0次或1次

import re

pattern = r"ab?"
text1 = "a"
text2 = "ab"
text3 = "abb"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
else:
    print("text3未匹配到")

ab?表示a后面可以跟0个或1个b，所以a和ab能匹配到，abb不能匹配到。

花括号（`{m,n}`）

{m,n}表示前面的字符或字符组至少出现m次，最多出现n次。

示例19：使用`{m,n}`匹配指定次数范围

import re

pattern = r"ab{2,4}"
text1 = "abb"
text2 = "abbb"
text3 = "abbbb"
text4 = "ab"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
if match4:
    print(match4.group())
else:
    print("text4未匹配到")

在这个例子中，ab{2,4}表示a后面跟2到4个b，所以abb、abbb、abbbb能匹配到，ab不能匹配到。

如果只写{m}，则表示前面的字符或字符组恰好出现m次。

示例20：使用`{m}`匹配指定次数

import re

pattern = r"ab{3}"
text1 = "abbb"
text2 = "ab"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
else:
    print("text2未匹配到")

这里ab{3}表示a后面恰好跟3个b，所以abbb能匹配到，ab不能匹配到。

分组相关特殊字符

圆括号（`()`）

圆括号在正则表达式中用于分组。分组有两个主要作用，一是将多个字符作为一个整体进行操作，二是可以捕获分组内的匹配内容。

示例21：分组作为整体操作

import re

pattern = r"(ab)+"
text1 = "ab"
text2 = "abab"
text3 = "a"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
else:
    print("text3未匹配到")

在上述代码中，(ab)+表示ab作为一个整体可以出现1次或多次，所以ab和abab能匹配到，a不能匹配到。

示例22：捕获分组内的匹配内容

import re

pattern = r"(\d{3})-(\d{4})"
text = "123-4567"
match = re.search(pattern, text)
if match:
    print(match.group())
    print(match.group(1))
    print(match.group(2))

这里(\d{3})-(\d{4})中，(\d{3})和(\d{4})是两个分组，match.group()返回整个匹配的字符串，match.group(1)返回第一个分组匹配的内容，match.group(2)返回第二个分组匹配的内容。

管道符（`|`）

管道符用于表示“或”的关系，它连接多个模式，只要其中一个模式匹配成功，整个表达式就匹配成功。

示例23：使用管道符表示“或”关系

import re

pattern = r"apple|banana"
text1 = "I like apples"
text2 = "I like bananas"
text3 = "I like oranges"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
if match1:
    print(match1.group())
if match2:
    print(match2.group())
if match3:
    print(match3.group())
else:
    print("text3未匹配到")

在这个例子中，apple|banana表示匹配apple或者banana，所以text1和text2能匹配到，text3不能匹配到。

转义字符（`\`）

在正则表达式中，很多特殊字符都有特定的含义。如果我们想要匹配这些特殊字符本身，就需要使用转义字符\。

示例24：匹配特殊字符本身

import re

pattern = r"\."
text = "www.example.com"
matches = re.findall(pattern, text)
print(matches)

pattern = r"\["
text = "This is a [test]"
matches = re.findall(pattern, text)
print(matches)

在上述代码中，\.表示匹配点号字符，\[表示匹配左方括号字符。如果不使用转义字符，点号和方括号就会按照它们的特殊含义进行匹配，而不是匹配字符本身。

同时，一些普通字符在加上转义字符后也会有特殊含义，比如前面提到的\d、\s等预定义字符类。

通过深入理解和熟练运用Python正则表达式中的这些特殊字符，我们能够在文本处理、数据提取、验证等诸多场景中发挥正则表达式的强大威力，提高编程效率和代码质量。在实际应用中，需要根据具体的需求，灵活组合这些特殊字符，构建出准确有效的正则表达式模式。

Python正则表示字符集的特殊字符

Python正则表达式字符集的特殊字符

点号（.）

示例1：匹配任意字符（除换行符）

示例2：使用re.DOTALL匹配包括换行符的所有字符

字符类（方括号[]）

示例3：匹配指定字符集中的字符

示例4：匹配字符范围

示例5：字符类取反

预定义字符类

\d

示例6：使用\d匹配数字

\D

示例7：使用\D匹配非数字字符

\w

示例8：使用\w匹配字母、数字或下划线

\W

示例9：使用\W匹配非字母、数字或下划线字符

\s

示例10：使用\s匹配空白字符

\S

示例11：使用\S匹配非空白字符