Ruby正则表达式匹配与替换技巧

Ruby 正则表达式基础

在 Ruby 中，正则表达式是一种强大的文本处理工具，用于匹配、搜索和替换文本。正则表达式由普通字符（例如字母 a 到 z）以及特殊字符（元字符）组成。

定义正则表达式

在 Ruby 中，可以使用两种方式定义正则表达式：字面量形式和 Regexp 类。

字面量形式：使用斜杠 / 包围正则表达式。例如，要匹配字符串中的数字，可以写成 /\d/。这里，\d 是一个元字符，表示任意一个数字。

regex = /\d/
string = "abc123"
match = string.match(regex)
puts match

上述代码定义了一个匹配任意数字的正则表达式，并在字符串 "abc123" 中进行匹配，match 会返回匹配到的结果。

使用 Regexp 类：通过 Regexp.new 方法创建正则表达式对象。例如：

regex = Regexp.new("\d")
string = "abc123"
match = string.match(regex)
puts match

这两种方式本质上是等价的，但字面量形式更为简洁，而 Regexp.new 方式在需要动态构建正则表达式时更为有用，例如：

digit = 5
regex = Regexp.new("\d#{digit}")
string = "abc1235"
match = string.match(regex)
puts match

这里，根据变量 digit 的值动态构建了正则表达式。

元字符

字符类：
- [abc]：匹配方括号内的任意一个字符。例如，/[abc]/ 可以匹配字符串中的 a、b 或 c。

regex = /[abc]/
string1 = "apple"
string2 = "banana"
string3 = "cherry"
puts string1.match(regex)
puts string2.match(regex)
puts string3.match(regex)

- **`[^abc]`**：匹配不在方括号内的任意一个字符。例如，`/[^abc]/` 可以匹配除了 `a`、`b`、`c` 之外的字符。

regex = /[^abc]/
string = "def"
puts string.match(regex)

- **`[a - z]`**：匹配指定范围内的字符。例如，`/[a - z]/` 匹配任意小写字母。

regex = /[a - z]/
string = "Hello"
puts string.match(regex)

预定义字符类：
- \d：匹配任意数字，等价于 [0 - 9]。
- \D：匹配任意非数字字符，等价于 [^0 - 9]。
- \w：匹配任意单词字符（字母、数字、下划线），等价于 [a - zA - Z0 - 9_]。
- \W：匹配任意非单词字符，等价于 [^a - zA - Z0 - 9_]。
- \s：匹配任意空白字符（空格、制表符、换行符等），等价于 [\t\n\r\f\v]。
- \S：匹配任意非空白字符，等价于 [^\t\n\r\f\v]。

regex_digit = /\d/
regex_non_digit = /\D/
regex_word = /\w/
regex_non_word = /\W/
regex_whitespace = /\s/
regex_non_whitespace = /\S/

string = "Hello 123, World!"
puts string.match(regex_digit)
puts string.match(regex_non_digit)
puts string.match(regex_word)
puts string.match(regex_non_word)
puts string.match(regex_whitespace)
puts string.match(regex_non_whitespace)

边界匹配：
- ^：匹配字符串的开始位置。例如，/^Hello/ 只会匹配以 Hello 开头的字符串。

regex_start = /^Hello/
string1 = "Hello World"
string2 = "World Hello"
puts string1.match(regex_start)
puts string2.match(regex_start)

- **`$`**：匹配字符串的结束位置。例如，`/World$/` 只会匹配以 `World` 结尾的字符串。

regex_end = /World$/
string1 = "Hello World"
string2 = "World Hello"
puts string1.match(regex_end)
puts string2.match(regex_end)

- **`\b`**：匹配单词边界。例如，`/\bHello\b/` 只会匹配独立的 `Hello` 单词，而不会匹配 `HelloWorld` 中的 `Hello`。

regex_word_boundary = /\bHello\b/
string1 = "Hello"
string2 = "HelloWorld"
puts string1.match(regex_word_boundary)
puts string2.match(regex_word_boundary)

- **`\B`**：匹配非单词边界。例如，`/\BHello\B/` 会匹配 `HelloWorld` 中的 `Hello`，但不会匹配独立的 `Hello` 单词。

regex_non_word_boundary = /\BHello\B/
string1 = "Hello"
string2 = "HelloWorld"
puts string1.match(regex_non_word_boundary)
puts string2.match(regex_non_word_boundary)

量词：
- *：匹配前面的字符零次或多次。例如，/a*/ 可以匹配空字符串，也可以匹配多个 a。

regex_star = /a*/
string1 = ""
string2 = "a"
string3 = "aaa"
puts string1.match(regex_star)
puts string2.match(regex_star)
puts string3.match(regex_star)

- **`+`**：匹配前面的字符一次或多次。例如，`/a+/` 至少匹配一个 `a`。

regex_plus = /a+/
string1 = ""
string2 = "a"
string3 = "aaa"
puts string1.match(regex_plus)
puts string2.match(regex_plus)
puts string3.match(regex_plus)

- **`?`**：匹配前面的字符零次或一次。例如，`/a?/` 可以匹配空字符串或一个 `a`。

regex_question = /a?/
string1 = ""
string2 = "a"
string3 = "aaa"
puts string1.match(regex_question)
puts string2.match(regex_question)
puts string3.match(regex_question)

- **`{n}`**：匹配前面的字符恰好 `n` 次。例如，`/a{3}/` 只匹配连续的三个 `a`。

regex_exact = /a{3}/
string1 = "aa"
string2 = "aaa"
string3 = "aaaa"
puts string1.match(regex_exact)
puts string2.match(regex_exact)
puts string3.match(regex_exact)

- **`{n,}`**：匹配前面的字符至少 `n` 次。例如，`/a{3,}/` 匹配连续三个或更多的 `a`。

regex_min = /a{3,}/
string1 = "aa"
string2 = "aaa"
string3 = "aaaa"
puts string1.match(regex_min)
puts string2.match(regex_min)
puts string3.match(regex_min)

- **`{n,m}`**：匹配前面的字符至少 `n` 次，最多 `m` 次。例如，`/a{3,5}/` 匹配连续三个到五个 `a`。

regex_range = /a{3,5}/
string1 = "aa"
string2 = "aaa"
string3 = "aaaa"
string4 = "aaaaa"
string5 = "aaaaaa"
puts string1.match(regex_range)
puts string2.match(regex_range)
puts string3.match(regex_range)
puts string4.match(regex_range)
puts string5.match(regex_range)

分组：使用圆括号 () 进行分组。分组可以将多个字符视为一个整体，并且可以用于反向引用。例如，/(ab)+/ 匹配一个或多个连续的 ab。

regex_group = /(ab)+/
string1 = "ab"
string2 = "abab"
string3 = "aabb"
puts string1.match(regex_group)
puts string2.match(regex_group)
puts string3.match(regex_group)

反向引用：在正则表达式中，可以使用 \1、\2 等引用前面的分组。例如，/(.)\1/ 匹配两个连续相同的字符。

regex_backref = /(.)\1/
string1 = "aa"
string2 = "ab"
puts string1.match(regex_backref)
puts string2.match(regex_backref)

正则表达式匹配

在 Ruby 中，有多种方法用于正则表达式匹配。

`match` 方法

match 方法用于在字符串中搜索正则表达式，并返回一个 MatchData 对象，如果没有匹配则返回 nil。MatchData 对象包含了匹配的详细信息，例如匹配的内容、捕获组等。

regex = /\d+/
string = "There are 123 apples"
match = string.match(regex)
if match
  puts "Matched: #{match[0]}"
else
  puts "No match"
end

这里，match[0] 表示整个匹配的内容。如果正则表达式中有分组，match[1]、match[2] 等分别表示各个捕获组的内容。

regex = /(\d+)\s+(\w+)/
string = "There are 123 apples"
match = string.match(regex)
if match
  puts "Matched: #{match[0]}"
  puts "Group 1: #{match[1]}"
  puts "Group 2: #{match[2]}"
else
  puts "No match"
end

`=~` 操作符

=~ 操作符用于在字符串中搜索正则表达式，并返回匹配开始的位置，如果没有匹配则返回 -1。

regex = /\d+/
string = "There are 123 apples"
position = string =~ regex
if position != -1
  puts "Match found at position #{position}"
else
  puts "No match"
end

`!~` 操作符

!~ 操作符与 =~ 相反，用于判断字符串中是否没有匹配正则表达式。如果没有匹配则返回 true，否则返回 false。

regex = /\d+/
string = "There are apples"
result = string!~ regex
if result
  puts "No match"
else
  puts "Match found"
end

`scan` 方法

scan 方法用于在字符串中查找所有匹配正则表达式的子串，并返回一个包含所有匹配结果的数组。如果正则表达式中有分组，返回的数组将包含每个分组的结果。

regex = /\d+/
string = "There are 123 apples and 456 oranges"
matches = string.scan(regex)
puts matches

regex = /(\d+)\s+(\w+)/
string = "There are 123 apples and 456 oranges"
matches = string.scan(regex)
puts matches.inspect

这里，matches 是一个二维数组，每个子数组包含了各个分组的匹配结果。

`grep` 方法

grep 方法用于在数组中查找所有匹配正则表达式的元素，并返回一个包含这些元素的新数组。

array = ["apple", "123", "banana", "456"]
regex = /\d+/
result = array.grep(regex)
puts result

正则表达式替换

在 Ruby 中，可以使用正则表达式进行字符串替换。

`sub` 方法

sub 方法用于在字符串中替换第一个匹配正则表达式的子串。

regex = /\d+/
string = "There are 123 apples"
new_string = string.sub(regex, "many")
puts new_string

这里，将字符串中第一个匹配 \d+（即数字）的子串替换为 "many"。

`gsub` 方法

gsub 方法用于在字符串中替换所有匹配正则表达式的子串。

regex = /\d+/
string = "There are 123 apples and 456 oranges"
new_string = string.gsub(regex, "many")
puts new_string

替换字符串中的捕获组

在替换时，可以引用正则表达式中的捕获组。例如，将日期格式从 YYYY - MM - DD 转换为 DD/MM/YYYY。

regex = /(\d{4})-(\d{2})-(\d{2})/
string = "2023 - 05 - 10"
new_string = string.gsub(regex, '\3/\2/\1')
puts new_string

这里，\1、\2、\3 分别引用了正则表达式中的三个捕获组。

使用代码块进行替换

gsub 方法还可以接受一个代码块，在代码块中可以根据匹配的内容进行动态替换。例如，将字符串中的数字加倍。

regex = /\d+/
string = "There are 123 apples and 456 oranges"
new_string = string.gsub(regex) { |match| (match.to_i * 2).to_s }
puts new_string

在代码块中，match 是当前匹配的子串，通过将其转换为整数并加倍，再转换回字符串进行替换。

正则表达式修饰符

在 Ruby 中，正则表达式可以使用修饰符来改变其匹配行为。

`i` 修饰符

i 修饰符使正则表达式不区分大小写。例如，/hello/i 可以匹配 Hello、HELLO 等。

regex = /hello/i
string1 = "Hello"
string2 = "hello"
string3 = "HELLO"
puts string1.match(regex)
puts string2.match(regex)
puts string3.match(regex)

`m` 修饰符

m 修饰符使 ^ 和 $ 匹配行的开始和结束，而不仅仅是字符串的开始和结束。例如，对于包含换行符的字符串，/^Hello$/m 可以匹配每一行以 Hello 开头和结尾的情况。

regex = /^Hello$/m
string = "Hello\nHello\nWorld"
matches = string.scan(regex)
puts matches

`x` 修饰符

x 修饰符允许在正则表达式中使用空白字符和注释，以提高可读性。例如：

regex = /
  \d+ # 匹配一个或多个数字
  \s+ # 匹配一个或多个空白字符
  \w+ # 匹配一个或多个单词字符
/x
string = "123 apples"
match = string.match(regex)
puts match

`o` 修饰符

o 修饰符只在第一次使用时对正则表达式进行编译，之后不会重新编译。这在正则表达式是动态构建且不会改变的情况下可以提高性能。例如：

n = 5
regex = Regexp.new("\d{#{n}}", nil, :o)
string = "12345"
match = string.match(regex)
puts match

`u` 修饰符

u 修饰符用于处理 Unicode 字符串，确保正则表达式正确处理 Unicode 字符。例如，匹配中文字符：

regex = /[\u4e00-\u9fff]/u
string = "你好世界"
match = string.match(regex)
puts match

复杂正则表达式示例

验证邮箱格式：

email_regex = /\A[\w+\-.]+@[a - z\d\-.]+\.[a - z]+\z/i
email1 = "test@example.com"
email2 = "test.example.com"
puts email1.match(email_regex)
puts email2.match(email_regex)

匹配 HTML 标签：

html_tag_regex = /<(\w+)[^>]*>(.*?)<\/\1>/m
html = "<div class='test'>content</div>"
match = html.match(html_tag_regex)
if match
  puts "Tag: #{match[1]}"
  puts "Content: #{match[2]}"
end

提取 URL：

url_regex = /\b(https?:\/\/\S+)\b/
text = "Visit my website at https://example.com"
match = text.match(url_regex)
if match
  puts "URL: #{match[1]}"
end

性能优化

减少回溯：在正则表达式中，回溯是指当匹配失败时，正则表达式引擎会尝试不同的组合。复杂的量词和嵌套结构可能导致大量的回溯，降低性能。例如，尽量避免使用 .* 这样的贪婪量词，如果可以，使用 .*? 这样的非贪婪量词。

# 不好的示例，可能导致大量回溯
regex_bad = /<.*>/
string = "<div>content</div><div>more content</div>"
match_bad = string.match(regex_bad)

# 好的示例，使用非贪婪量词
regex_good = /<.*?>/
match_good = string.match(regex_good)

预编译正则表达式：如果在循环中多次使用同一个正则表达式，预编译可以提高性能。可以使用 Regexp.new 并结合 o 修饰符。

regex = Regexp.new("\d+", nil, :o)
1000.times do |i|
  string = "Number: #{i}"
  match = string.match(regex)
  # 处理匹配结果
end

避免不必要的分组：分组会增加正则表达式的复杂度和匹配时间。如果不需要捕获组，尽量避免使用括号进行分组。

# 不必要的分组
regex_unnecessary = /(\d+)/
# 更好的方式
regex_better = /\d+/

常见问题与解决方法

匹配结果不符合预期：仔细检查正则表达式的语法和逻辑，确保元字符、量词、分组等使用正确。可以使用在线正则表达式测试工具辅助调试。
性能问题：参考上述性能优化部分，检查是否存在大量回溯、是否可以预编译正则表达式等。
处理 Unicode 字符：确保使用 u 修饰符，并了解 Unicode 字符的范围和表示方式，以正确匹配和处理 Unicode 字符串。

通过深入理解和熟练运用 Ruby 正则表达式的匹配与替换技巧，可以高效地处理各种文本处理任务，无论是数据清洗、文本分析还是格式化处理等。希望本文的内容能帮助你在实际编程中更好地使用正则表达式。

Ruby正则表达式匹配与替换技巧

Ruby 正则表达式基础

定义正则表达式

元字符

正则表达式匹配

match 方法

=~ 操作符

!~ 操作符

scan 方法

grep 方法