Python批量文件处理技巧

Python批量文件处理基础

文件操作基础函数

在Python中，对文件的基本操作通过内置函数实现。open()函数用于打开文件，它接受文件名和打开模式作为参数。例如，以只读模式打开文件：

file = open('example.txt', 'r')

这里，'r'表示只读模式。其他常见模式包括'w'（写入模式，会覆盖原有文件内容）、'a'（追加模式，在文件末尾添加内容）、'x'（创建新文件，如果文件已存在则报错）。当操作完成后，需要使用file.close()关闭文件，以释放资源。

在批量处理文件时，这种基础操作是构建复杂逻辑的基石。例如，我们要批量读取一系列文本文件的内容，可以先获取文件列表，然后逐个以只读模式打开并读取。

import os

file_dir = 'your_file_directory'
for filename in os.listdir(file_dir):
    if filename.endswith('.txt'):
        file_path = os.path.join(file_dir, filename)
        with open(file_path, 'r') as file:
            content = file.read()
            print(f"Content of {filename}: {content}")

上述代码通过os.listdir()获取指定目录下的所有文件和文件夹名称，使用os.path.join()构建完整文件路径，再通过with open()语句打开文件。with语句会在代码块结束后自动关闭文件，无需手动调用close()方法。

文件读取方式

read()方法：一次性读取文件的全部内容，并将其作为一个字符串返回。例如：

with open('example.txt', 'r') as file:
    all_content = file.read()
    print(all_content)

这种方式适用于文件较小的情况，若文件过大，可能会占用大量内存。 2. readline()方法：逐行读取文件内容。每次调用readline()会读取文件的一行，返回的字符串包含行末的换行符'\n'。示例如下：

with open('example.txt', 'r') as file:
    line = file.readline()
    while line:
        print(line.strip())  # strip()方法去除行末的换行符
        line = file.readline()

readlines()方法：将文件的每一行作为一个字符串元素，存储在一个列表中返回。

with open('example.txt', 'r') as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

在批量处理文本文件时，根据文件大小和处理需求选择合适的读取方式至关重要。例如，处理日志文件时，由于日志文件通常较大且需要逐行分析，readline()或逐行迭代文件对象（for line in file）的方式更为合适。

批量文件处理的场景与实现

批量重命名文件

在日常工作中，可能需要对大量文件进行重命名操作。比如，将一系列图片文件按照特定规则重新命名。Python的os模块提供了rename()函数来实现文件重命名。

import os

file_dir = 'image_directory'
prefix = 'new_name_'
count = 1
for filename in os.listdir(file_dir):
    if filename.endswith(('.jpg', '.png')):
        new_filename = f"{prefix}{count}{os.path.splitext(filename)[1]}"
        old_file_path = os.path.join(file_dir, filename)
        new_file_path = os.path.join(file_dir, new_filename)
        os.rename(old_file_path, new_file_path)
        count += 1

上述代码遍历指定目录下的图片文件，为每个文件生成一个新的文件名，格式为new_name_序号.文件扩展名，然后使用os.rename()函数完成重命名。

批量复制文件

有时候需要将一批文件从一个目录复制到另一个目录。Python的shutil模块提供了强大的文件和目录操作功能，其中copy()函数用于复制文件。

import shutil
import os

source_dir = 'source_directory'
destination_dir = 'destination_directory'
for filename in os.listdir(source_dir):
    if os.path.isfile(os.path.join(source_dir, filename)):
        shutil.copy(os.path.join(source_dir, filename), destination_dir)

上述代码遍历源目录下的所有文件，通过shutil.copy()函数将文件复制到目标目录。如果目标目录不存在，需要先使用os.makedirs()函数创建。

import shutil
import os

source_dir = 'source_directory'
destination_dir = 'destination_directory'
if not os.path.exists(destination_dir):
    os.makedirs(destination_dir)
for filename in os.listdir(source_dir):
    if os.path.isfile(os.path.join(source_dir, filename)):
        shutil.copy(os.path.join(source_dir, filename), destination_dir)

批量删除文件

批量删除文件也是常见的需求，比如清理临时文件。同样使用os模块，remove()函数用于删除文件，rmdir()函数用于删除空目录。如果要删除非空目录及其所有内容，shutil模块的rmtree()函数可以实现。

import os
import shutil

temp_dir = 'temp_directory'
if os.path.isfile(temp_dir):
    os.remove(temp_dir)
elif os.path.isdir(temp_dir):
    if os.listdir(temp_dir):
        shutil.rmtree(temp_dir)
    else:
        os.rmdir(temp_dir)

上述代码首先判断目标路径是文件还是目录。如果是文件，直接使用os.remove()删除；如果是目录，判断目录是否为空，空目录使用os.rmdir()删除，非空目录使用shutil.rmtree()删除。

文本文件的批量内容处理

批量替换文本内容

在多个文本文件中替换特定字符串是常见的文本处理任务。可以结合文件读取和写入操作实现。

import os

file_dir = 'text_files_directory'
old_text = 'old_string'
new_text = 'new_string'
for filename in os.listdir(file_dir):
    if filename.endswith('.txt'):
        file_path = os.path.join(file_dir, filename)
        with open(file_path, 'r') as file:
            content = file.read()
        new_content = content.replace(old_text, new_text)
        with open(file_path, 'w') as file:
            file.write(new_content)

上述代码遍历指定目录下的所有文本文件，读取文件内容，使用replace()方法替换特定字符串，然后将新内容写回文件。注意，以写入模式打开文件会覆盖原有内容。

批量提取文本信息

假设我们有一批日志文件，需要从中提取特定格式的信息，比如IP地址。可以使用正则表达式来实现。

import os
import re

log_dir = 'log_files_directory'
ip_pattern = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b')
for filename in os.listdir(log_dir):
    if filename.endswith('.log'):
        file_path = os.path.join(log_dir, filename)
        with open(file_path, 'r') as file:
            content = file.read()
            ips = ip_pattern.findall(content)
            for ip in ips:
                print(f"IP found in {filename}: {ip}")

上述代码使用re.compile()编译正则表达式模式，然后在每个日志文件的内容中查找匹配的IP地址。findall()方法返回所有匹配的字符串列表。

二进制文件的批量处理

批量处理图片文件

Python的Pillow库（PIL的一个分支）提供了强大的图像处理功能。可以用于批量调整图片大小、格式转换等操作。

批量调整图片大小：

from PIL import Image
import os

image_dir = 'image_directory'
new_width = 800
new_height = 600
for filename in os.listdir(image_dir):
    if filename.endswith(('.jpg', '.png')):
        image_path = os.path.join(image_dir, filename)
        with Image.open(image_path) as img:
            resized_img = img.resize((new_width, new_height))
            resized_img.save(os.path.join(image_dir, f'resized_{filename}'))

上述代码使用Image.open()打开图片，resize()方法调整图片大小，然后使用save()方法保存调整后的图片。 2. 批量转换图片格式：

from PIL import Image
import os

image_dir = 'image_directory'
for filename in os.listdir(image_dir):
    if filename.endswith('.jpg'):
        image_path = os.path.join(image_dir, filename)
        with Image.open(image_path) as img:
            new_filename = os.path.splitext(filename)[0] + '.png'
            img.save(os.path.join(image_dir, new_filename), 'PNG')

上述代码将指定目录下的所有JPEG图片转换为PNG格式。

批量处理音频文件

pydub库可以用于处理音频文件，比如批量裁剪音频、合并音频等。首先需要安装pydub库：pip install pydub。

批量裁剪音频：

from pydub import AudioSegment
import os

audio_dir = 'audio_directory'
start_time = 10000  # 10 seconds in milliseconds
end_time = 20000  # 20 seconds in milliseconds
for filename in os.listdir(audio_dir):
    if filename.endswith('.mp3'):
        audio_path = os.path.join(audio_dir, filename)
        audio = AudioSegment.from_mp3(audio_path)
        cropped_audio = audio[start_time:end_time]
        cropped_audio.export(os.path.join(audio_dir, f'cropped_{filename}'), format='mp3')

上述代码从指定目录下的MP3音频文件中裁剪出10秒到20秒的音频片段，并保存为新的文件。 2. 批量合并音频：

from pydub import AudioSegment
import os

audio_dir = 'audio_directory'
audio_files = [os.path.join(audio_dir, filename) for filename in os.listdir(audio_dir) if filename.endswith('.mp3')]
combined_audio = AudioSegment.empty()
for file in audio_files:
    audio = AudioSegment.from_mp3(file)
    combined_audio += audio
combined_audio.export(os.path.join(audio_dir, 'combined.mp3'), format='mp3')

上述代码将指定目录下的所有MP3音频文件合并为一个文件。

基于多线程和多进程的批量文件处理优化

多线程处理文件

当进行I/O密集型的批量文件操作时，多线程可以提高效率。Python的threading模块用于创建和管理线程。

import threading
import os
import shutil

source_dir = 'source_directory'
destination_dir = 'destination_directory'


def copy_file(filename):
    if os.path.isfile(os.path.join(source_dir, filename)):
        shutil.copy(os.path.join(source_dir, filename), destination_dir)


threads = []
for filename in os.listdir(source_dir):
    thread = threading.Thread(target=copy_file, args=(filename,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

上述代码为每个文件复制操作创建一个线程，从而实现并发执行。但要注意，Python的全局解释器锁（GIL）会限制多线程在CPU密集型任务中的效率提升。

多进程处理文件

对于CPU密集型的批量文件处理任务，多进程是更好的选择。Python的multiprocessing模块提供了多进程支持。

import multiprocessing
import os
import shutil

source_dir = 'source_directory'
destination_dir = 'destination_directory'


def copy_file(filename):
    if os.path.isfile(os.path.join(source_dir, filename)):
        shutil.copy(os.path.join(source_dir, filename), destination_dir)


if __name__ == '__main__':
    pool = multiprocessing.Pool()
    pool.map(copy_file, os.listdir(source_dir))
    pool.close()
    pool.join()

上述代码使用multiprocessing.Pool创建进程池，通过map()方法将文件复制任务分配到各个进程中执行。if __name__ == '__main__':语句在Windows系统上是必需的，以避免多进程启动时的一些问题。

利用第三方库提升批量文件处理效率

`tqdm`库实现进度条

在批量处理大量文件时，了解处理进度是很有必要的。tqdm库可以方便地在命令行中添加进度条。首先安装pip install tqdm。

import os
import shutil
from tqdm import tqdm

source_dir = 'source_directory'
destination_dir = 'destination_directory'
file_list = os.listdir(source_dir)
for filename in tqdm(file_list, desc='Copying files'):
    if os.path.isfile(os.path.join(source_dir, filename)):
        shutil.copy(os.path.join(source_dir, filename), destination_dir)

上述代码在文件复制过程中显示一个进度条，desc参数用于设置进度条的描述信息。

`pathlib`库简化文件路径操作

pathlib库提供了面向对象的文件路径操作方式，相比os.path模块更加直观和易用。

from pathlib import Path

source_dir = Path('source_directory')
destination_dir = Path('destination_directory')
destination_dir.mkdir(exist_ok=True)
for file in source_dir.iterdir():
    if file.is_file():
        shutil.copy(file, destination_dir)

上述代码使用Path类表示文件路径，mkdir()方法创建目标目录，exist_ok=True参数表示如果目录已存在则不报错。iterdir()方法遍历源目录下的所有文件和文件夹，is_file()方法判断是否为文件。

通过上述各种技巧和方法，Python能够高效地完成各种批量文件处理任务，无论是简单的文本文件操作，还是复杂的二进制文件处理，都能应对自如。在实际应用中，根据具体需求选择合适的方法和库，能够显著提升工作效率。