Python正则表达式处理文本格式

一、Python 正则表达式基础

1.1 什么是正则表达式

正则表达式（Regular Expression）是一种用于匹配和处理字符串的强大工具。它通过一种特定的语法规则，能够定义复杂的字符模式。在文本处理中，正则表达式可用于查找、替换、分割字符串等操作。例如，我们想要在一段文本中找出所有的邮箱地址，或者将所有的数字替换成特定格式等，正则表达式就能派上用场。

1.2 Python 中的 re 模块

Python 通过 re 模块来支持正则表达式操作。在使用正则表达式之前，需要先导入 re 模块，如下：

import re

re 模块提供了一系列函数，如 search()、match()、findall()、sub() 等，用于实现不同的正则表达式功能。

1.3 基本字符匹配

普通字符：在正则表达式中，普通字符直接匹配自身。例如，正则表达式 'abc' 会匹配字符串 'abc' 中的 'abc' 子串。

import re
text = "abcdef"
pattern = "abc"
match = re.search(pattern, text)
if match:
    print(match.group())

元字符：正则表达式中有一些具有特殊含义的字符，称为元字符。常见的元字符包括 .、^、$、*、+、?、{}、[]、()、| 等。
- .：匹配除换行符 \n 之外的任意单个字符。例如，正则表达式 'a.c' 可以匹配 'abc'、'aec' 等，但不能匹配 'a\nc'。

text = "abc aec a\nc"
pattern = "a.c"
matches = re.findall(pattern, text)
print(matches)

- **`^`**：匹配字符串的开头。例如，正则表达式 `'^abc'` 只会匹配以 `'abc'` 开头的字符串，如 `'abcdef'`，而不会匹配 `'defabc'`。

text1 = "abcdef"
text2 = "defabc"
pattern = "^abc"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
    print(match1.group())
if not match2:
    print("未匹配到")

- **`$`**：匹配字符串的结尾。例如，正则表达式 `'abc$'` 只会匹配以 `'abc'` 结尾的字符串，如 `'defabc'`，而不会匹配 `'abcdef'`。

text1 = "abcdef"
text2 = "defabc"
pattern = "abc$"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match2:
    print(match2.group())
if not match1:
    print("未匹配到")

二、字符类与重复匹配

2.1 字符类 `[]`

字符类 [] 用于匹配方括号内的任意一个字符。例如，[abc] 会匹配 'a'、'b' 或 'c' 中的任意一个字符。

text = "abcdef"
pattern = "[abc]"
matches = re.findall(pattern, text)
print(matches)

还可以使用连字符 - 来表示字符范围。例如，[a - z] 匹配任意小写字母，[0 - 9] 匹配任意数字。

text = "a1b2c3"
pattern1 = "[a - z]"
pattern2 = "[0 - 9]"
matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)
print(matches1)
print(matches2)

2.2 否定字符类 `[^]`

否定字符类 [^] 匹配不在方括号内的任意一个字符。例如，[^abc] 会匹配除 'a'、'b'、'c' 之外的任意字符。

text = "abcdef"
pattern = "[^abc]"
matches = re.findall(pattern, text)
print(matches)

2.3 重复匹配

*：匹配前一个字符 0 次或多次。例如，'a*' 可以匹配空字符串，也可以匹配一个或多个 'a' 字符。

text = "a aa aaa"
pattern = "a*"
matches = re.findall(pattern, text)
print(matches)

+：匹配前一个字符 1 次或多次。例如，'a+' 至少匹配一个 'a' 字符，不会匹配空字符串。

text = "a aa aaa"
pattern = "a+"
matches = re.findall(pattern, text)
print(matches)

?：匹配前一个字符 0 次或 1 次。例如，'a?' 可以匹配空字符串或一个 'a' 字符。

text = "a aa"
pattern = "a?"
matches = re.findall(pattern, text)
print(matches)

{n}：匹配前一个字符恰好 n 次。例如，'a{3}' 只匹配连续的三个 'a' 字符。

text = "a aa aaa"
pattern = "a{3}"
matches = re.findall(pattern, text)
print(matches)

{n,}：匹配前一个字符至少 n 次。例如，'a{2,}' 匹配连续两个或更多的 'a' 字符。

text = "a aa aaa"
pattern = "a{2,}"
matches = re.findall(pattern, text)
print(matches)

{n,m}：匹配前一个字符至少 n 次，最多 m 次。例如，'a{2,4}' 匹配连续两个到四个 'a' 字符。

text = "a aa aaa aaaa"
pattern = "a{2,4}"
matches = re.findall(pattern, text)
print(matches)

三、分组与捕获

3.1 分组 `()`

使用括号 () 可以将正则表达式的一部分括起来，形成一个分组。分组主要有两个作用：一是改变优先级，例如 '(a|b)c' 会匹配 'ac' 或 'bc'，而 'a|bc' 会匹配 'a' 或 'bc'；二是用于捕获分组内的内容。

text = "ac bc"
pattern1 = "(a|b)c"
pattern2 = "a|bc"
matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)
print(matches1)
print(matches2)

3.2 捕获组

当使用括号 () 进行分组时，匹配到的分组内容会被捕获。在 Python 中，可以通过 group() 方法来获取捕获组的内容。例如，'(a)(bc)' 这个正则表达式有两个捕获组，'a' 是第一个捕获组，'bc' 是第二个捕获组。

text = "abc"
pattern = "(a)(bc)"
match = re.search(pattern, text)
if match:
    print(match.group(0))  # 整个匹配的字符串
    print(match.group(1))  # 第一个捕获组
    print(match.group(2))  # 第二个捕获组

3.3 命名捕获组

在 Python 中，还可以给捕获组命名。使用 (?P<name>pattern) 的形式，其中 name 是组的名称，pattern 是组内的正则表达式。通过名称获取捕获组的内容更加直观。

text = "abc"
pattern = "(?P<first>a)(?P<second>bc)"
match = re.search(pattern, text)
if match:
    print(match.group("first"))
    print(match.group("second"))

四、零宽断言

4.1 正向肯定断言 `(?=pattern)`

正向肯定断言 (?=pattern) 断言在当前位置之后能够匹配 pattern，但不消耗字符。例如，'a(?=b)' 匹配后面跟着 'b' 的 'a'，但不会匹配 'ab' 中的 'b'。

text = "ab ac"
pattern = "a(?=b)"
matches = re.findall(pattern, text)
print(matches)

4.2 正向否定断言 `(?!pattern)`

正向否定断言 (?!pattern) 断言在当前位置之后不能匹配 pattern。例如，'a(?!b)' 匹配后面不跟着 'b' 的 'a'。

text = "ab ac"
pattern = "a(?!b)"
matches = re.findall(pattern, text)
print(matches)

4.3 反向肯定断言 `(?<=pattern)`

反向肯定断言 (?<=pattern) 断言在当前位置之前能够匹配 pattern，同样不消耗字符。Python 的 re 模块在 3.7 及以上版本支持反向肯定断言。例如，'(?<=a)b' 匹配前面是 'a' 的 'b'。

import re
text = "ab cb"
pattern = "(?<=a)b"
matches = re.findall(pattern, text)
print(matches)

4.4 反向否定断言 `(?<!pattern)`

反向否定断言 (?<!pattern) 断言在当前位置之前不能匹配 pattern。例如，'(?<!a)b' 匹配前面不是 'a' 的 'b'。

import re
text = "ab cb"
pattern = "(?<!a)b"
matches = re.findall(pattern, text)
print(matches)

五、正则表达式在文本处理中的应用

5.1 查找文本

re.search()：在字符串中查找第一个匹配的子串。它返回一个 Match 对象，如果没有找到则返回 None。

text = "hello world"
pattern = "world"
match = re.search(pattern, text)
if match:
    print(match.group())

re.findall()：查找字符串中所有匹配的子串，并以列表形式返回。

text = "apple banana apple"
pattern = "apple"
matches = re.findall(pattern, text)
print(matches)

re.finditer()：查找字符串中所有匹配的子串，并返回一个迭代器，迭代器中的每个元素是一个 Match 对象。

text = "apple banana apple"
pattern = "apple"
for match in re.finditer(pattern, text):
    print(match.group())

5.2 替换文本

re.sub()：用于替换字符串中所有匹配的子串。其基本语法为 re.sub(pattern, repl, string, count = 0)，其中 pattern 是正则表达式，repl 是替换的字符串或函数，string 是要处理的字符串，count 表示替换的最大次数，默认为 0（表示替换所有匹配的子串）。

text = "apple banana apple"
pattern = "apple"
new_text = re.sub(pattern, "orange", text)
print(new_text)

使用函数进行替换：repl 参数也可以是一个函数。函数会接收一个 Match 对象作为参数，并返回替换的字符串。

def replace_with_length(match):
    return str(len(match.group()))

text = "apple banana"
pattern = "[a - z]+"
new_text = re.sub(pattern, replace_with_length, text)
print(new_text)

5.3 分割文本

re.split()：根据匹配的子串分割字符串，并返回一个列表。例如，使用正则表达式 '[ ,.]' 可以根据空格、逗号或句号分割字符串。

text = "apple,banana. orange"
pattern = "[ ,.]"
parts = re.split(pattern, text)
print(parts)

六、处理复杂文本格式

6.1 匹配 HTML 标签

HTML 标签具有一定的结构和格式，使用正则表达式可以对其进行匹配。例如，匹配 <div> 标签及其内部内容，可以使用类似 '<div.*?>.*?</div>' 的正则表达式。这里 .*? 表示非贪婪匹配，即尽可能少地匹配字符，以确保只匹配到第一个结束的 </div> 标签。

html = "<div>content</div> <div>another content</div>"
pattern = "<div.*?>.*?</div>"
matches = re.findall(pattern, html)
print(matches)

6.2 解析 XML 数据

XML 数据同样可以使用正则表达式进行初步解析。例如，匹配 XML 元素及其属性，'<\w+.*?>' 可以匹配 XML 元素开始标签，其中 \w+ 匹配元素名称，.*? 匹配属性部分。

xml = "<book id='1'><title>Python Programming</title></book>"
pattern = "<\w+.*?>"
matches = re.findall(pattern, xml)
print(matches)

6.3 处理日志文件

日志文件通常有特定的格式，比如时间戳、日志级别、日志内容等。假设日志格式为 [YYYY - MM - DD HH:MM:SS] [INFO/ERROR] message，可以使用正则表达式来提取相关信息。

log = "[2023 - 10 - 01 12:00:00] [INFO] Application started"
pattern = "\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.*)"
match = re.search(pattern, log)
if match:
    print("时间:", match.group(1))
    print("日志级别:", match.group(2))
    print("日志内容:", match.group(3))

七、优化正则表达式性能

7.1 减少回溯

回溯是正则表达式匹配过程中一种耗时的操作。例如，在使用 *、+ 等贪婪量词时，如果模式匹配失败，正则表达式引擎会回溯尝试其他可能的匹配。为了减少回溯，可以尽量使用非贪婪量词 *?、+? 等。

text = "aabaaa"
pattern1 = "a.*a"  # 贪婪匹配
pattern2 = "a.*?a"  # 非贪婪匹配
match1 = re.search(pattern1, text)
match2 = re.search(pattern2, text)
print(match1.group())
print(match2.group())

7.2 预编译正则表达式

如果要多次使用同一个正则表达式，可以使用 re.compile() 方法预编译它。预编译后的正则表达式对象具有更高的执行效率。

pattern = re.compile("a+")
text1 = "aaa"
text2 = "aaaa"
match1 = pattern.search(text1)
match2 = pattern.search(text2)
print(match1.group())
print(match2.group())

7.3 避免不必要的分组

分组会增加正则表达式的复杂度和匹配时间。如果不需要捕获分组内容，尽量避免使用括号进行分组。例如，'(a|b)' 可以写成 'a|b' ，除非需要捕获 'a' 或 'b'。

text = "ab ac"
pattern1 = "(a|b)c"
pattern2 = "a|bc"
matches1 = re.findall(pattern1, text)
matches2 = re.findall(pattern2, text)
print(matches1)
print(matches2)

八、正则表达式的陷阱与注意事项

8.1 特殊字符转义

在正则表达式中，特殊字符需要进行转义才能匹配其字面意义。例如，要匹配 '.' 字符，需要使用 '\.'；要匹配 '(' 字符，需要使用 '\('。

text = "a.b (c)"
pattern = "\. \("
matches = re.findall(pattern, text)
print(matches)

8.2 边界条件处理

在处理文本时，要注意边界条件。比如在匹配字符串开头或结尾时，使用 ^ 和 $ 要确保逻辑正确。另外，在使用重复匹配量词时，要考虑到空字符串或单个字符的情况。

text1 = "abc"
text2 = ""
pattern1 = "^abc$"
pattern2 = "a?"
match1 = re.search(pattern1, text1)
match2 = re.search(pattern1, text2)
matches3 = re.findall(pattern2, text1)
matches4 = re.findall(pattern2, text2)
if match1:
    print(match1.group())
if not match2:
    print("未匹配到")
print(matches3)
print(matches4)

8.3 跨平台兼容性

虽然 Python 的 re 模块在不同平台上基本保持一致，但在与其他语言或工具交互时，可能会出现正则表达式语法兼容性问题。例如，某些语言可能对反向引用的语法略有不同。因此，在跨平台使用正则表达式时，要进行充分的测试。