Python正则表达式的使用与匹配

Python正则表达式基础

正则表达式概述

正则表达式（Regular Expression），常简称为regex或regexp，是一种用于描述、匹配和处理文本模式的强大工具。在Python中，通过re模块来支持正则表达式操作。正则表达式本质上是一种字符模式，它定义了一组字符串的规则。例如，模式\d+代表匹配一个或多个数字字符。

导入`re`模块

在Python中使用正则表达式，首先要导入re模块。

import re

简单字符匹配

普通字符匹配：正则表达式中，普通字符（如字母、数字、标点符号等）会精确匹配自身。例如，正则表达式abc会匹配字符串abc，但不会匹配abd或aabc。

import re
pattern = 'abc'
text = 'abc'
match = re.search(pattern, text)
if match:
    print('匹配成功')
else:
    print('匹配失败')

元字符匹配：正则表达式中有一些具有特殊含义的字符，称为元字符。例如，^ 表示字符串的开头，$ 表示字符串的结尾。

import re
pattern_start = '^hello'
text_start = 'hello world'
match_start = re.search(pattern_start, text_start)
if match_start:
    print('以hello开头，匹配成功')
else:
    print('不以hello开头，匹配失败')

pattern_end = 'world$'
text_end = 'hello world'
match_end = re.search(pattern_end, text_end)
if match_end:
    print('以world结尾，匹配成功')
else:
    print('不以world结尾，匹配失败')

字符类

方括号字符类：方括号 [] 用于定义一个字符类，表示匹配方括号内的任意一个字符。例如，[abc] 可以匹配 a、b 或 c 中的任意一个字符。

import re
pattern_char_class = '[abc]'
text_char_class = 'abcdef'
matches = re.findall(pattern_char_class, text_char_class)
print(matches)

范围字符类：在方括号内，可以使用 - 来表示字符范围。例如，[a - z] 表示匹配任意小写字母，[0 - 9] 表示匹配任意数字。

import re
pattern_range = '[a - z]'
text_range = 'Hello 123 World'
matches_range = re.findall(pattern_range, text_range)
print(matches_range)

否定字符类：在方括号内，以 ^ 开头表示否定字符类，即匹配除了方括号内字符之外的任意字符。例如，[^a - z] 表示匹配除了小写字母之外的任意字符。

import re
pattern_negation = '[^a - z]'
text_negation = 'Hello 123 World'
matches_negation = re.findall(pattern_negation, text_negation)
print(matches_negation)

重复匹配

限定符

* 限定符：表示匹配前面的字符零次或多次。例如，a* 可以匹配空字符串、a、aa、aaa 等。

import re
pattern_star = 'a*'
text_star = 'aaaa bbb'
matches_star = re.findall(pattern_star, text_star)
print(matches_star)

+ 限定符：表示匹配前面的字符一次或多次。例如，a+ 可以匹配 a、aa、aaa 等，但不能匹配空字符串。

import re
pattern_plus = 'a+'
text_plus = 'aaaa bbb'
matches_plus = re.findall(pattern_plus, text_plus)
print(matches_plus)

? 限定符：表示匹配前面的字符零次或一次。例如，a? 可以匹配空字符串或 a。

import re
pattern_question = 'a?'
text_question = 'aaaa bbb'
matches_question = re.findall(pattern_question, text_question)
print(matches_question)

{n} 限定符：表示匹配前面的字符恰好 n 次。例如，a{3} 只能匹配 aaa。

import re
pattern_n = 'a{3}'
text_n = 'aaaa bbb'
matches_n = re.findall(pattern_n, text_n)
print(matches_n)

{n,} 限定符：表示匹配前面的字符至少 n 次。例如，a{3,} 可以匹配 aaa、aaaa、aaaaa 等。

import re
pattern_n_plus = 'a{3,}'
text_n_plus = 'aaaa bbb'
matches_n_plus = re.findall(pattern_n_plus, text_n_plus)
print(matches_n_plus)

{n,m} 限定符：表示匹配前面的字符至少 n 次，但不超过 m 次。例如，a{2,4} 可以匹配 aa、aaa、aaaa。

import re
pattern_n_m = 'a{2,4}'
text_n_m = 'aaaa bbb'
matches_n_m = re.findall(pattern_n_m, text_n_m)
print(matches_n_m)

贪婪与非贪婪匹配

贪婪匹配：在Python正则表达式中，限定符默认是贪婪的，即尽可能多地匹配字符。例如，a.*b 会匹配从 a 开始到最后一个 b 之间的所有字符。

import re
pattern_greedy = 'a.*b'
text_greedy = 'a123b 456a789b'
matches_greedy = re.findall(pattern_greedy, text_greedy)
print(matches_greedy)

非贪婪匹配：在限定符后加上 ? 可以使其变为非贪婪匹配，即尽可能少地匹配字符。例如，a.*?b 会匹配从 a 开始到第一个 b 之间的字符。

import re
pattern_non_greedy = 'a.*?b'
text_non_greedy = 'a123b 456a789b'
matches_non_greedy = re.findall(pattern_non_greedy, text_non_greedy)
print(matches_non_greedy)

分组与捕获

分组

基本分组：使用圆括号 () 可以将多个字符组合成一个组。例如，(ab)+ 表示匹配 ab 一次或多次。

import re
pattern_group = '(ab)+'
text_group = 'ababab'
matches_group = re.findall(pattern_group, text_group)
print(matches_group)

分组编号：分组是有编号的，从1开始。例如，在 (a(bc)) 中，(a(bc)) 是第1组，(bc) 是第2组。

import re
pattern_group_num = '(a(bc))'
text_group_num = 'abc'
match_group_num = re.search(pattern_group_num, text_group_num)
if match_group_num:
    print('完整匹配:', match_group_num.group(0))
    print('第1组:', match_group_num.group(1))
    print('第2组:', match_group_num.group(2))

命名分组

语法：可以使用 (?P<name>pattern) 的形式对分组进行命名。例如，(?P<first>a)(?P<second>b) 分别命名了两个分组。

import re
pattern_named_group = '(?P<first>a)(?P<second>b)'
text_named_group = 'ab'
match_named_group = re.search(pattern_named_group, text_named_group)
if match_named_group:
    print('完整匹配:', match_named_group.group(0))
    print('first组:', match_named_group.group('first'))
    print('second组:', match_named_group.group('second'))

反向引用

数字反向引用：在正则表达式中，可以使用 \n（n 是分组编号）来引用之前捕获的分组内容。例如，(a)\1 会匹配两个连续的 a。

import re
pattern_backref = '(a)\1'
text_backref = 'aa'
match_backref = re.search(pattern_backref, text_backref)
if match_backref:
    print('匹配成功')
else:
    print('匹配失败')

命名反向引用：对于命名分组，可以使用 (?P=name) 的形式进行反向引用。例如，(?P<letter>a)(?P=letter) 同样会匹配两个连续的 a。

import re
pattern_named_backref = '(?P<letter>a)(?P=letter)'
text_named_backref = 'aa'
match_named_backref = re.search(pattern_named_backref, text_named_backref)
if match_named_backref:
    print('匹配成功')
else:
    print('匹配失败')

预定义字符类

常见预定义字符类

\d：匹配任意数字字符，等价于 [0 - 9]。

import re
pattern_digit = '\d'
text_digit = 'abc123def'
matches_digit = re.findall(pattern_digit, text_digit)
print(matches_digit)

\D：匹配任意非数字字符，等价于 [^0 - 9]。

import re
pattern_non_digit = '\D'
text_non_digit = 'abc123def'
matches_non_digit = re.findall(pattern_non_digit, text_non_digit)
print(matches_non_digit)

\w：匹配任意字母、数字或下划线字符，等价于 [a - zA - Z0 - 9_]。

import re
pattern_word = '\w'
text_word = 'abc_123'
matches_word = re.findall(pattern_word, text_word)
print(matches_word)

\W：匹配任意非字母、数字或下划线字符，等价于 [^a - zA - Z0 - 9_]。

import re
pattern_non_word = '\W'
text_non_word = 'abc_123!@#'
matches_non_word = re.findall(pattern_non_word, text_non_word)
print(matches_non_word)

\s：匹配任意空白字符，包括空格、制表符、换行符等，等价于 [ \t\n\r\f\v]。

import re
pattern_whitespace = '\s'
text_whitespace = 'abc 123\ndef'
matches_whitespace = re.findall(pattern_whitespace, text_whitespace)
print(matches_whitespace)

\S：匹配任意非空白字符，等价于 [^ \t\n\r\f\v]。

import re
pattern_non_whitespace = '\S'
text_non_whitespace = 'abc 123\ndef'
matches_non_whitespace = re.findall(pattern_non_whitespace, text_non_whitespace)
print(matches_non_whitespace)

边界匹配

单词边界

\b：匹配单词边界，即单词与非单词字符的交界处。例如，\bcat\b 会匹配 cat，但不会匹配 category。

import re
pattern_word_boundary = '\bcat\b'
text_word_boundary = 'I have a cat'
match_word_boundary = re.search(pattern_word_boundary, text_word_boundary)
if match_word_boundary:
    print('匹配成功')
else:
    print('匹配失败')

\B：匹配非单词边界，即两个单词字符或两个非单词字符的交界处。例如，\Bcat\B 不会匹配 cat，但会匹配 category 中的 cat。

import re
pattern_non_word_boundary = '\Bcat\B'
text_non_word_boundary = 'category'
match_non_word_boundary = re.search(pattern_non_word_boundary, text_non_word_boundary)
if match_non_word_boundary:
    print('匹配成功')
else:
    print('匹配失败')

字符串边界

^ 和 $ 回顾：^ 表示字符串的开头，$ 表示字符串的结尾。例如，^hello$ 只能匹配 hello 这个完整的字符串。

import re
pattern_start_end = '^hello$'
text_start_end = 'hello'
match_start_end = re.search(pattern_start_end, text_start_end)
if match_start_end:
    print('匹配成功')
else:
    print('匹配失败')

正则表达式的高级应用

替换操作

re.sub() 函数：用于在字符串中替换匹配正则表达式的部分。例如，将字符串中的数字替换为 X。

import re
text_sub = 'abc123def456'
pattern_sub = '\d'
replacement = 'X'
new_text = re.sub(pattern_sub, replacement, text_sub)
print(new_text)

使用函数进行替换：re.sub() 的 repl 参数也可以是一个函数，该函数会对每个匹配项进行处理。例如，将字符串中的数字加倍。

import re
def double_number(match):
    num = int(match.group(0))
    return str(num * 2)
text_sub_func = 'abc123def456'
pattern_sub_func = '\d+'
new_text_func = re.sub(pattern_sub_func, double_number, text_sub_func)
print(new_text_func)

分割操作

re.split() 函数：用于根据正则表达式分割字符串。例如，根据空格、逗号或分号分割字符串。

import re
text_split = 'a, b; c  d'
pattern_split = '[,; ]+'
parts = re.split(pattern_split, text_split)
print(parts)

编译正则表达式

re.compile() 函数：可以将正则表达式编译成一个 Pattern 对象，这样可以提高效率，特别是在多次使用同一个正则表达式时。

import re
pattern_compile = re.compile('\d+')
text_compile = 'abc123def456'
matches_compile = pattern_compile.findall(text_compile)
print(matches_compile)

处理复杂文本场景

解析HTML/XML

简单示例：虽然不推荐用正则表达式解析复杂的HTML/XML，但对于简单场景可以尝试。例如，提取HTML标签内的文本。

import re
html_text = '<p>Hello, World!</p>'
pattern_html = '<p>(.*?)</p>'
match_html = re.search(pattern_html, html_text)
if match_html:
    print(match_html.group(1))

提取邮箱地址

正则表达式设计：设计一个正则表达式来提取邮箱地址。邮箱地址一般由用户名、@符号和域名组成。

import re
text_email = 'My email is john@example.com and jane@example.org'
pattern_email = r'\b[A - Za - z0 - 9._%+-]+@[A - Za - z0 - 9.-]+\.[A - Za - z]{2,}\b'
matches_email = re.findall(pattern_email, text_email)
print(matches_email)

验证密码强度

密码强度规则：假设密码强度规则为至少8位，包含大写字母、小写字母、数字和特殊字符。

import re
def validate_password(password):
    if len(password) < 8:
        return False
    has_upper = re.search('[A - Z]', password)
    has_lower = re.search('[a - z]', password)
    has_digit = re.search('\d', password)
    has_special = re.search('[!@#$%^&*(),.?":{}|<>]', password)
    return has_upper and has_lower and has_digit and has_special

password = 'Abc123!@#'
if validate_password(password):
    print('密码强度符合要求')
else:
    print('密码强度不符合要求')

通过以上内容，全面深入地介绍了Python正则表达式的使用与匹配，涵盖了基础概念、各种匹配模式、高级应用以及复杂文本场景处理，希望能帮助开发者在实际工作中更好地运用正则表达式解决文本处理问题。