ElasticSearch文档操作定义的深度解析

ElasticSearch 文档操作基础概念

在 ElasticSearch 中，文档（Document）是最基本的数据单元，它是一个自包含的 JSON 对象，包含了一个或多个字段（Field）。这些文档被存储在索引（Index）中，索引类似于关系型数据库中的数据库概念，是一组相关文档的集合。

每个文档都有一个唯一的标识符（ID），可以由 ElasticSearch 自动生成，也可以由用户指定。文档的结构是灵活的，不像关系型数据库那样需要预先定义表结构。这使得 ElasticSearch 非常适合处理半结构化或非结构化的数据。

文档的创建

在 ElasticSearch 中创建文档非常简单。通过 HTTP 的 POST 或 PUT 请求，将 JSON 格式的文档数据发送到指定的索引和类型（在 ElasticSearch 7.0 之后，类型的概念逐渐被弱化，很多操作可以不指定类型）。

以下是使用 Java 客户端创建文档的代码示例：

import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class DocumentCreationExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        IndexRequest request = new IndexRequest("my_index");
        request.id("1");
        request.source("{\"title\":\"ElasticSearch Tutorial\",\"content\":\"Learn ElasticSearch basics\"}", XContentType.JSON);

        IndexResponse response = client.index(request, RequestOptions.DEFAULT);

        System.out.println(response.getResult());

        client.close();
    }
}

在上述代码中，我们创建了一个 IndexRequest 对象，指定了索引名为 my_index，文档 ID 为 1，并设置了文档的内容。然后通过 RestHighLevelClient 执行索引操作，获取 IndexResponse，并打印操作结果。

如果使用 ElasticSearch 的 REST API，通过 cURL 命令创建文档如下：

curl -X POST "localhost:9200/my_index/_doc/1" -H 'Content-Type: application/json' -d'
{
    "title": "ElasticSearch Tutorial",
    "content": "Learn ElasticSearch basics"
}'

此 cURL 命令向 my_index 索引的 _doc 类型（在新版本中类型可省略）下创建了 ID 为 1 的文档。

文档的自动生成 ID

当不指定文档 ID 时，ElasticSearch 会自动生成一个唯一的 ID。在 Java 客户端中，代码如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class AutoGenerateIdDocumentCreationExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        IndexRequest request = new IndexRequest("my_index");
        request.source("{\"title\":\"Another ElasticSearch Article\",\"content\":\"Explore more features\"}", XContentType.JSON);

        IndexResponse response = client.index(request, RequestOptions.DEFAULT);

        System.out.println("Generated ID: " + response.getId());

        client.close();
    }
}

在上述代码中，没有调用 request.id() 方法，ElasticSearch 会自动生成一个 ID。IndexResponse 中的 getId() 方法可以获取这个自动生成的 ID。

使用 REST API，通过 cURL 命令自动生成 ID 创建文档如下：

curl -X POST "localhost:9200/my_index/_doc" -H 'Content-Type: application/json' -d'
{
    "title": "Another ElasticSearch Article",
    "content": "Explore more features"
}'

执行此命令后，ElasticSearch 会返回创建的文档信息，其中包含自动生成的 ID。

文档的读取

从 ElasticSearch 中读取文档是一项常见操作。可以通过文档 ID 来获取特定的文档。

使用 Java 客户端读取文档

import org.apache.http.HttpHost;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

import java.io.IOException;

public class DocumentReadingExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        GetRequest request = new GetRequest("my_index", "1");

        GetResponse response = client.get(request, RequestOptions.DEFAULT);

        if (response.isExists()) {
            System.out.println(response.getSourceAsString());
        } else {
            System.out.println("Document not found");
        }

        client.close();
    }
}

在上述代码中，我们创建了一个 GetRequest 对象，指定了索引名 my_index 和文档 ID 1。通过 RestHighLevelClient 执行获取操作，GetResponse 的 isExists() 方法可以判断文档是否存在，如果存在则通过 getSourceAsString() 方法获取文档内容。

使用 REST API 读取文档

通过 cURL 命令读取文档如下：

curl -X GET "localhost:9200/my_index/_doc/1"

执行此命令后，如果文档存在，会返回文档的详细信息，包括文档的元数据和内容。

文档的更新

ElasticSearch 中的文档更新操作不是直接修改原文档，而是先删除原文档，然后重新创建一个新的文档。这是因为 ElasticSearch 是基于 Lucene 的，Lucene 中的文档一旦写入就不可变。

全量更新

全量更新意味着用新的文档内容完全替换旧的文档。在 Java 客户端中，代码如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class DocumentFullUpdateExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        UpdateRequest request = new UpdateRequest("my_index", "1");
        request.doc("{\"title\":\"Updated ElasticSearch Tutorial\",\"content\":\"New content for the tutorial\"}", XContentType.JSON);

        UpdateResponse response = client.update(request, RequestOptions.DEFAULT);

        System.out.println(response.getResult());

        client.close();
    }
}

在上述代码中，UpdateRequest 指定了索引名 my_index 和文档 ID 1，通过 doc 方法设置了新的文档内容。RestHighLevelClient 执行更新操作后，UpdateResponse 的 getResult() 方法可以获取更新结果。

使用 REST API 全量更新文档，通过 cURL 命令如下：

curl -X POST "localhost:9200/my_index/_doc/1/_update" -H 'Content-Type: application/json' -d'
{
    "doc": {
        "title": "Updated ElasticSearch Tutorial",
        "content": "New content for the tutorial"
    }
}'

此命令向 my_index 索引下 ID 为 1 的文档发送全量更新请求，doc 字段包含了新的文档内容。

部分更新

部分更新允许只修改文档中的某些字段，而不是整个文档。在 Java 客户端中，代码如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class DocumentPartialUpdateExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        UpdateRequest request = new UpdateRequest("my_index", "1");
        request.doc("{\"content\":\"Add some more details to the content\"}", XContentType.JSON);

        UpdateResponse response = client.update(request, RequestOptions.DEFAULT);

        System.out.println(response.getResult());

        client.close();
    }
}

在上述代码中，只更新了 content 字段。UpdateRequest 的 doc 方法只设置了需要更新的字段。

使用 REST API 部分更新文档，通过 cURL 命令如下：

curl -X POST "localhost:9200/my_index/_doc/1/_update" -H 'Content-Type: application/json' -d'
{
    "doc": {
        "content": "Add some more details to the content"
    }
}'

此命令只更新了 content 字段，文档的其他部分保持不变。

文档的删除

删除文档操作相对简单，通过文档 ID 即可删除指定的文档。

使用 Java 客户端删除文档

import org.apache.http.HttpHost;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

import java.io.IOException;

public class DocumentDeletionExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        DeleteRequest request = new DeleteRequest("my_index", "1");

        DeleteResponse response = client.delete(request, RequestOptions.DEFAULT);

        System.out.println(response.getResult());

        client.close();
    }
}

在上述代码中，DeleteRequest 指定了索引名 my_index 和文档 ID 1。RestHighLevelClient 执行删除操作后，DeleteResponse 的 getResult() 方法可以获取删除结果。

使用 REST API 删除文档

通过 cURL 命令删除文档如下：

curl -X DELETE "localhost:9200/my_index/_doc/1"

执行此命令后，如果文档存在，会将其从 my_index 索引中删除，并返回删除操作的结果。

文档的版本控制

ElasticSearch 支持文档的版本控制，每个文档都有一个版本号。当文档被创建时，版本号为 1，每次文档更新时，版本号会递增。

版本控制的作用

版本控制主要用于确保并发更新时的数据一致性。当多个进程尝试同时更新同一个文档时，版本号可以防止数据丢失或覆盖。例如，如果进程 A 和进程 B 同时获取了文档的版本号为 5，进程 A 先更新成功，此时文档版本号变为 6。当进程 B 尝试更新时，由于它使用的版本号是 5，与当前文档版本号 6 不匹配，更新操作会失败。

在 Java 客户端中使用版本控制

import org.apache.http.HttpHost;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class DocumentVersionControlExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        // 获取当前文档版本号
        GetRequest getRequest = new GetRequest("my_index", "1");
        GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
        long version = getResponse.getVersion();

        UpdateRequest updateRequest = new UpdateRequest("my_index", "1");
        updateRequest.doc("{\"content\":\"Update content with version control\"}", XContentType.JSON);
        updateRequest.version(version);

        UpdateResponse updateResponse = client.update(updateRequest, RequestOptions.DEFAULT);

        System.out.println(updateResponse.getResult());

        client.close();
    }
}

在上述代码中，首先通过 GetRequest 获取文档的当前版本号，然后在 UpdateRequest 中设置这个版本号。如果在获取版本号和执行更新操作之间，文档没有被其他进程更新，更新操作会成功；否则，由于版本号不匹配，更新操作会失败。

在 REST API 中使用版本控制

通过 cURL 命令在更新文档时使用版本控制如下：

# 获取当前文档版本号
curl -X GET "localhost:9200/my_index/_doc/1" | grep _version

# 更新文档并指定版本号
curl -X POST "localhost:9200/my_index/_doc/1/_update?version=3" -H 'Content-Type: application/json' -d'
{
    "doc": {
        "content": "Update content with version control"
    }
}'

在上述 cURL 命令中，先通过 grep _version 获取文档当前版本号，假设版本号为 3，然后在更新命令中通过 ?version=3 指定版本号进行更新。如果版本号匹配，更新成功；否则更新失败。

文档的路由

在 ElasticSearch 中，文档的路由（Routing）是一种将文档分配到特定分片（Shard）的机制。默认情况下，ElasticSearch 根据文档的 ID 计算哈希值，然后根据哈希值将文档分配到相应的分片。

自定义路由

有时候，我们可能希望根据特定的业务逻辑将文档分配到特定的分片。例如，在一个多租户的应用中，希望每个租户的数据都存储在同一个分片上，以便提高查询性能。

在 Java 客户端中自定义路由如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class DocumentRoutingExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        IndexRequest request = new IndexRequest("my_index");
        request.id("1");
        request.source("{\"title\":\"Document with custom routing\",\"content\":\"This document has a custom routing\"}", XContentType.JSON);
        request.routing("tenant1");

        IndexResponse response = client.index(request, RequestOptions.DEFAULT);

        System.out.println(response.getResult());

        client.close();
    }
}

在上述代码中，通过 request.routing("tenant1") 设置了自定义路由为 tenant1。这意味着这个文档会根据 tenant1 这个路由值被分配到特定的分片。

使用 REST API 自定义路由，通过 cURL 命令如下：

curl -X POST "localhost:9200/my_index/_doc/1?routing=tenant1" -H 'Content-Type: application/json' -d'
{
    "title": "Document with custom routing",
    "content": "This document has a custom routing"
}'

此命令在创建文档时，通过 ?routing=tenant1 设置了自定义路由。

路由对查询的影响

当使用自定义路由存储文档后，查询时也需要指定相同的路由，才能确保查询只在相关的分片上执行，提高查询效率。

在 Java 客户端中，查询时指定路由如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;

import java.io.IOException;

public class DocumentQueryWithRoutingExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        SearchRequest request = new SearchRequest("my_index");
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.query(QueryBuilders.matchAllQuery());
        request.source(sourceBuilder);
        request.routing("tenant1");

        SearchResponse response = client.search(request, RequestOptions.DEFAULT);

        System.out.println(response.getHits().getTotalHits().value);

        client.close();
    }
}

在上述代码中，SearchRequest 的 routing("tenant1") 方法指定了查询的路由为 tenant1，这样查询只会在与 tenant1 路由相关的分片上执行。

使用 REST API 查询时指定路由，通过 cURL 命令如下：

curl -X GET "localhost:9200/my_index/_search?routing=tenant1" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    }
}'

此命令在查询时通过 ?routing=tenant1 指定了路由，查询结果只会包含与 tenant1 路由相关的文档。

文档的元数据

ElasticSearch 中的文档除了包含用户定义的数据外，还包含一些元数据（Metadata），这些元数据提供了关于文档的额外信息。

常见的元数据字段

_index：文档所属的索引名。例如，在 my_index 索引中创建的文档，其 _index 字段值为 my_index。
_type：文档的类型（在 ElasticSearch 7.0 之后逐渐弱化）。例如，_doc 类型。
_id：文档的唯一标识符。可以是用户指定的，也可以是 ElasticSearch 自动生成的。
_version：文档的版本号，每次文档更新时递增。
_seq_no 和 _primary_term：用于乐观并发控制，确保文档在集群中的一致性。

获取文档元数据

在 Java 客户端中获取文档元数据如下：

import org.apache.http.HttpHost;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

import java.io.IOException;

public class DocumentMetadataExample {
    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        GetRequest request = new GetRequest("my_index", "1");

        GetResponse response = client.get(request, RequestOptions.DEFAULT);

        if (response.isExists()) {
            System.out.println("Index: " + response.getIndex());
            System.out.println("Type: " + response.getType());
            System.out.println("ID: " + response.getId());
            System.out.println("Version: " + response.getVersion());
            System.out.println("Seq No: " + response.getSeqNo());
            System.out.println("Primary Term: " + response.getPrimaryTerm());
        } else {
            System.out.println("Document not found");
        }

        client.close();
    }
}

在上述代码中，通过 GetResponse 的不同方法获取了文档的各种元数据字段。

使用 REST API 获取文档元数据，通过 cURL 命令如下：

curl -X GET "localhost:9200/my_index/_doc/1"

执行此命令后，返回的文档信息中包含了 _index、_type、_id、_version 等元数据字段。

文档的存储和检索优化

在使用 ElasticSearch 进行文档操作时，合理的存储和检索策略可以提高系统的性能和效率。

文档存储优化

字段映射优化：合理定义字段的映射类型，避免不必要的字段索引。例如，如果某个字段只用于显示，不需要进行搜索，可以将其设置为 index: false，这样可以减少索引的大小和维护成本。
文档大小控制：尽量避免创建过大的文档。过大的文档会增加存储和传输的开销，并且在更新和删除时也会影响性能。可以将大文档拆分成多个小文档，或者只存储关键信息，通过关联查询获取详细内容。
使用压缩：ElasticSearch 支持对存储的数据进行压缩，通过设置合适的压缩算法（如 lz4、snappy 等），可以减少磁盘空间的占用，提高数据传输速度。

文档检索优化

使用合适的查询语句：根据查询需求选择合适的查询类型，如 match 查询用于全文搜索，term 查询用于精确匹配。避免使用过于复杂或低效的查询语句，例如在大文本字段上使用 wildcard 查询可能会导致性能问题。
缓存查询结果：对于一些不经常变化的数据，可以将查询结果进行缓存。ElasticSearch 本身提供了一些缓存机制，如查询缓存（Query Cache）和字段数据缓存（Field Data Cache），合理配置这些缓存可以减少查询的响应时间。
分页优化：在进行分页查询时，避免使用过大的 from 和 size 参数。可以使用滚动（Scroll）API 来处理大量数据的分页，或者采用基于游标（Cursor）的分页方式，提高分页查询的效率。

通过对文档存储和检索的优化，可以使 ElasticSearch 在处理大量文档时保持高效稳定的性能。

以上就是对 ElasticSearch 文档操作定义的深度解析，涵盖了文档的创建、读取、更新、删除、版本控制、路由、元数据以及存储和检索优化等方面，希望能帮助读者更好地理解和使用 ElasticSearch 进行文档相关的操作。