Ruby 的数据清洗与预处理

数据清洗与预处理概述

在处理数据时，原始数据往往存在各种问题，如缺失值、重复值、噪声数据等。数据清洗与预处理旨在解决这些问题，将原始数据转换为更干净、更适合分析和建模的形式。在 Ruby 中，我们有多种工具和方法来实现这一过程。

缺失值处理

识别缺失值

在 Ruby 中处理数据，首先要能够识别缺失值。对于数组或哈希表，缺失值可能表现为 nil。

data_array = [1, nil, 3]
data_hash = { key1: 'value1', key2: nil }

我们可以通过以下方法检查数组中的缺失值：

def count_nil_in_array(arr)
  arr.count { |element| element.nil? }
end
puts count_nil_in_array(data_array)

对于哈希表，可以这样检查：

def count_nil_in_hash(hash)
  hash.values.count { |value| value.nil? }
end
puts count_nil_in_hash(data_hash)

缺失值填充

常量填充 对于数值型数据，我们可以用一个固定的常量（如 0 或均值）填充缺失值。

data_array = [1, nil, 3]
mean = data_array.compact.sum / data_array.compact.length.to_f
filled_array = data_array.map { |element| element.nil?? mean : element }
puts filled_array.inspect

对于非数值型数据，可以用一个默认字符串，如 “unknown”。

string_array = ['apple', nil, 'banana']
filled_string_array = string_array.map { |element| element.nil?? 'unknown' : element }
puts filled_string_array.inspect

使用 Gem 库填充 imputation 库可以更方便地处理缺失值。首先安装 imputation：

gem install imputation

然后使用它来填充缺失值：

require 'imputation'
data_array = [1, nil, 3]
imputed_array = Imputation::MeanImputer.new(data_array).impute
puts imputed_array.inspect

重复值处理

识别重复值

在数组中，我们可以使用 uniq 方法的逆过程来识别重复值。

data_array = [1, 2, 2, 3]
duplicates = data_array.group_by { |element| element }.select { |_, group| group.length > 1 }.keys
puts duplicates.inspect

对于哈希表，如果哈希表中的值重复，可以这样查找：

data_hash = { key1: 'value1', key2: 'value1', key3: 'value2' }
duplicate_values = data_hash.group_by { |_, value| value }.select { |_, group| group.length > 1 }.keys
puts duplicate_values.inspect

去除重复值

数组去重 直接使用 uniq 方法去除数组中的重复值。

data_array = [1, 2, 2, 3]
unique_array = data_array.uniq
puts unique_array.inspect

哈希表去重 如果要去除哈希表中值重复的键值对，可以这样做：

data_hash = { key1: 'value1', key2: 'value1', key3: 'value2' }
unique_hash = data_hash.each_with_object({}) do |(key, value), result|
  result[key] = value unless result.values.include?(value)
end
puts unique_hash.inspect

噪声数据处理

异常值检测与处理

基于统计方法检测异常值 对于数值型数据，可以使用均值和标准差来检测异常值。假设数据服从正态分布，数据点如果距离均值超过 3 倍标准差，可以认为是异常值。

data_array = [1, 2, 3, 100]
mean = data_array.sum / data_array.length.to_f
std_dev = Math.sqrt(data_array.map { |x| (x - mean) ** 2 }.sum / data_array.length)
lower_bound = mean - 3 * std_dev
upper_bound = mean + 3 * std_dev
outliers = data_array.select { |x| x < lower_bound || x > upper_bound }
puts outliers.inspect

处理异常值 对于检测到的异常值，可以用均值替换。

data_array = [1, 2, 3, 100]
mean = data_array.sum / data_array.length.to_f
std_dev = Math.sqrt(data_array.map { |x| (x - mean) ** 2 }.sum / data_array.length)
lower_bound = mean - 3 * std_dev
upper_bound = mean + 3 * std_dev
cleaned_array = data_array.map do |x|
  if x < lower_bound || x > upper_bound
    mean
  else
    x
  end
end
puts cleaned_array.inspect

数据平滑

移动平均法 移动平均法是一种简单的数据平滑方法，常用于时间序列数据。假设我们有一个数组表示时间序列数据，计算其移动平均值。

data_array = [1, 2, 3, 4, 5]
window_size = 3
smoothed_array = []
(0..data_array.length - window_size).each do |i|
  sub_array = data_array[i, window_size]
  smoothed_value = sub_array.sum / window_size.to_f
  smoothed_array << smoothed_value
end
puts smoothed_array.inspect

加权移动平均法 加权移动平均法给不同时间的数据点赋予不同的权重。

data_array = [1, 2, 3, 4, 5]
weights = [0.2, 0.3, 0.5]
smoothed_array = []
(0..data_array.length - weights.length).each do |i|
  sub_array = data_array[i, weights.length]
  weighted_sum = sub_array.zip(weights).map { |value, weight| value * weight }.sum
  smoothed_value = weighted_sum / weights.sum
  smoothed_array << smoothed_value
end
puts smoothed_array.inspect

数据标准化与归一化

数据标准化（Z - Score 标准化）

Z - Score 标准化是将数据转换为均值为 0，标准差为 1 的分布。

data_array = [1, 2, 3, 4, 5]
mean = data_array.sum / data_array.length.to_f
std_dev = Math.sqrt(data_array.map { |x| (x - mean) ** 2 }.sum / data_array.length)
standardized_array = data_array.map { |x| (x - mean) / std_dev }
puts standardized_array.inspect

数据归一化（Min - Max 归一化）

Min - Max 归一化将数据转换到 [0, 1] 区间。

data_array = [1, 2, 3, 4, 5]
min_value = data_array.min
max_value = data_array.max
normalized_array = data_array.map { |x| (x - min_value) / (max_value - min_value) }
puts normalized_array.inspect

数据编码

标签编码

对于分类数据，标签编码将每个类别映射到一个唯一的整数。

categories = ['apple', 'banana', 'apple', 'cherry']
category_hash = categories.uniq.each_with_index.to_h
encoded_categories = categories.map { |category| category_hash[category] }
puts encoded_categories.inspect

独热编码

独热编码将每个类别转换为一个二进制向量。

require 'matrix'
categories = ['apple', 'banana', 'apple', 'cherry']
unique_categories = categories.uniq
encoded_matrix = Matrix.build(categories.length, unique_categories.length) do |i, j|
  categories[i] == unique_categories[j]? 1 : 0
end
puts encoded_matrix.inspect

文本数据清洗与预处理

文本清理

去除特殊字符 可以使用正则表达式去除文本中的特殊字符。

text = "Hello, world! (This is a test.)"
cleaned_text = text.gsub(/[[:punct:]]/, '')
puts cleaned_text

转换为小写 将文本转换为小写，以便于统一处理。

text = "Hello, World!"
lowercase_text = text.downcase
puts lowercase_text

分词

使用 split 方法分词 简单地按空格分词。

text = "Hello world this is a test"
words = text.split
puts words.inspect

使用 nokogiri 进行更复杂分词（适用于 HTML 文本） 首先安装 nokogiri：

gem install nokogiri

然后进行分词：

require 'nokogiri'
html_text = "<p>Hello, world! This is a <b>test</b>.</p>"
doc = Nokogiri::HTML(html_text)
text_content = doc.text
words = text_content.split
puts words.inspect

去除停用词

自定义停用词列表

stopwords = ['a', 'an', 'the', 'is', 'are']
text = "The dog is running"
words = text.split
filtered_words = words.reject { |word| stopwords.include?(word) }
puts filtered_words.inspect

使用 stopwords 库 安装 stopwords：

gem install stopwords

require'stopwords'
text = "The dog is running"
words = text.split
filtered_words = words.reject { |word| Stopwords::EN.include?(word) }
puts filtered_words.inspect

日期和时间数据处理

解析日期和时间

使用 Date 和 Time 类

date_str = '2023 - 10 - 01'
date = Date.parse(date_str)
puts date.inspect 

time_str = '2023 - 10 - 01 12:30:00'
time = Time.parse(time_str)
puts time.inspect

处理不同格式的日期时间 对于一些非标准格式，可以使用正则表达式辅助解析。

date_str = '10/01/2023'
date = Date.strptime(date_str, '%m/%d/%Y')
puts date.inspect

日期和时间标准化

统一日期格式 将不同格式的日期统一为 YYYY - MM - DD 格式。

date1 = Date.strptime('10/01/2023', '%m/%d/%Y')
date2 = Date.strptime('2023 - 10 - 01', '%Y - %m - %d')
puts date1.strftime('%Y - %m - %d') 
puts date2.strftime('%Y - %m - %d')

处理时区问题 如果时间数据有时区信息，可以使用 ActiveSupport::TimeZone（需要安装 activesupport gem）。

gem install activesupport

require 'active_support/time'
time_without_zone = Time.parse('2023 - 10 - 01 12:30:00')
time_in_ny = ActiveSupport::TimeZone['America/New_York'].local(time_without_zone.year, time_without_zone.month, time_without_zone.day, time_without_zone.hour, time_without_zone.min, time_without_zone.sec)
puts time_in_ny.inspect

数据集成与合并

数组合并

简单合并 将两个数组合并为一个。

array1 = [1, 2]
array2 = [3, 4]
merged_array = array1 + array2
puts merged_array.inspect

按特定规则合并 例如，将两个数组按索引对应元素相加。

array1 = [1, 2]
array2 = [3, 4]
merged_array = array1.zip(array2).map { |pair| pair.sum }
puts merged_array.inspect

哈希表合并

简单合并 将两个哈希表合并，后面的哈希表会覆盖前面哈希表相同键的值。

hash1 = { key1: 'value1', key2: 'value2' }
hash2 = { key2: 'new_value2', key3: 'value3' }
merged_hash = hash1.merge(hash2)
puts merged_hash.inspect

按特定规则合并 例如，对于相同键的值进行累加（假设值为数值型）。

hash1 = { key1: 1, key2: 2 }
hash2 = { key1: 3, key3: 4 }
merged_hash = hash1.merge(hash2) do |key, old_value, new_value|
  if key == 'key1'
    old_value + new_value
  else
    new_value
  end
end
puts merged_hash.inspect

处理大数据集

分块处理

当处理大数据集时，分块处理可以避免内存溢出。例如，读取大文件时按行分块处理。

file_path = 'large_file.txt'
chunk_size = 1000
File.foreach(file_path) do |line|
  # 这里可以对每一行进行处理
  # 当处理的行数达到 chunk_size 时，进行批量处理
  if (line_count % chunk_size) == 0 && line_count > 0
    # 批量处理逻辑
  end
  line_count += 1
end
# 处理剩余不足 chunk_size 的数据
if line_count % chunk_size != 0
  # 处理剩余数据逻辑
end

使用数据库

对于大数据集，可以将数据存储在数据库中，利用数据库的查询和处理能力。以 SQLite 为例：

安装 sqlite3 gem

gem install sqlite3

创建表并插入数据

require'sqlite3'
db = SQLite3::Database.new('test.db')
db.execute('CREATE TABLE data (id INTEGER PRIMARY KEY, value TEXT)')
data = [['value1'], ['value2']]
data.each do |row|
  db.execute('INSERT INTO data (value) VALUES (?)', row)
end

查询和处理数据

results = db.execute('SELECT * FROM data')
results.each do |row|
  puts row.inspect 
end

通过上述方法，在 Ruby 中我们可以有效地进行数据清洗与预处理，为后续的数据分析和建模打下坚实的基础。无论是简单的数组和哈希表数据，还是复杂的文本、日期时间数据，以及大数据集，都能找到合适的处理方式。