龙空技术网

elasticsearch索引并查询文档

王显锋 699

前言:

现时看官们对“es查询所有索引的java接口”大致比较关切,我们都需要分析一些“es查询所有索引的java接口”的相关知识。那么小编在网络上搜集了一些有关“es查询所有索引的java接口””的相关资讯,希望小伙伴们能喜欢,咱们快快来了解一下吧!

为什么有这篇文章

这篇文档99%copy的官方文档,经过我的验证。

虽然有官方文档,但是头条不知道有没有,搬过来让需要的人看看。

添加了我搭建的服务器,可以开箱即用,不需要再搭建服务器,方便学习

版权归官方所有,我只是个搬运工

我的服务器地址

如果需要用户名密码,请输入elastic/changeme

在线文档地址

索引文档

在Elasticsearch集群中添加3条雇员信息:

PUT /megacorp/employee/1{ "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}PUT /megacorp/employee/2{ "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests": [ "music" ]}PUT /megacorp/employee/3{ "first_name" : "Douglas", "last_name" : "Fir", "age" : 35, "about": "I like to build cabinets", "interests": [ "forestry" ]}
检索文档

检索到单个雇员的数据

GET /megacorp/employee/1

将 HTTP 命令由 PUT 改为 GET 可以用来检索文档,同样的,可以使用 DELETE 命令来删除文档,以及使用 HEAD 指令来检查文档是否存在。如果想更新已存在的文档,只需再次 PUT 。

返回信息如下:

{ "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] }}
轻量搜索

搜索所有雇员

GET /megacorp/employee/_search

可以看到,我们仍然使用索引库 megacorp 以及类型 employee,但与指定一个文档 ID 不同,这次使用_search 。返回结果包括了所有三个文档,放在数组 hits 中。一个搜索默认返回十条结果。

返回信息如下:

{ "took": 6, "timed_out": false, "_shards": { ... }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "3", "_score": 1, "_source": { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": [ "forestry" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 1, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 1, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] }}

搜索姓氏为”Smith”的雇员

为此,我们将使用一个 高亮 搜索,很容易通过命令行完成。这个方法一般涉及到一个 查询字符串 (query-string) 搜索,因为我们通过一个URL参数来传递查询信息给搜索接口:

GET /megacorp/employee/_search?q=last_name:Smith

我们仍然在请求路径中使用 _search 端点,并将查询本身赋值给参数 q= 。返回结果给出了所有的 Smith:

{ ... "hits": { "total": 2, "max_score": 0.30685282, "hits": [ { ... "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { ... "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] }}
使用查询表达式搜索

Query-string 搜索通过命令非常方便地进行临时性的即席搜索 ,但它有自身的局限性(参见 轻量 搜索 )。Elasticsearch 提供一个丰富灵活的查询语言叫做 查询表达式 , 它支持构建更加复杂和健壮的查询。

领域特定语言 (DSL), 指定了使用一个 JSON 请求。我们可以像这样重写之前的查询所有 Smith 的搜索 :

GET /megacorp/employee/_search{ "query" : { "match" : { "last_name" : "Smith" } }}

返回结果与之前的查询一样,但还是可以看到有一些变化。其中之一是,不再使用 query-string 参数,而是一个请求体替代。这个请求使用 JSON 构造,并使用了一个 match 查询(属于查询类型之一)

更复杂的搜索

同样搜索姓氏为 Smith 的雇员,但这次我们只需要年龄大于 30 的。查询需要稍作调整,使用过滤器 filter ,它支持高效地执行一个结构化查询:

GET /megacorp/employee/_search

{

"query": {

"bool": {

"must": [

{

"match": {

"last_name": "Smith"

}

}

],

"filter": {

"range": {

"age": {

"gte": 30

}

}

}

}

}

}

查询结果:

{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.2876821, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 0.2876821, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] }}
全文搜索

截止目前的搜索相对都很简单:单个姓名,通过年龄过滤。现在尝试下稍微高级点儿的全文搜索——一项 传统数据库确实很难搞定的任务。

搜索下所有喜欢攀岩(rock climbing)的雇员:

GET /megacorp/employee/_search{ "query" : { "match" : { "about" : "rock climbing" } }}

返回结果:

{ "took": 13, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 0.53484553, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 0.53484553, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 0.26742277, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] }}

Elasticsearch 默认按照相关性得分排序,即每个文档跟查询的匹配程度。第一个最高得分的结果很明显:John Smith 的 about 属性清楚地写着 “rock climbing” 。

但为什么 Jane Smith 也作为结果返回了呢?原因是她的 about 属性里提到了 “rock” 。因为只有 “rock” 而没有 “climbing” ,所以她的相关性得分低于 John 的。

这是一个很好的案例,阐明了 Elasticsearch 如何 在 全文属性上搜索并返回相关性最强的结果。Elasticsearch中的 相关性 概念非常重要,也是完全区别于传统关系型数据库的一个概念,数据库中的一条记录要么匹配要么不匹配。

短语搜索

找出一个属性中的独立单词是没有问题的,但有时候想要精确匹配一系列单词或者短语 。 比如, 我们想执行这样一个查询,仅匹配同时包含 “rock” 和 “climbing” ,并且 二者以短语 “rock climbing” 的形式紧挨着的雇员记录。

为此对 match 查询稍作调整,使用一个叫做 match_phrase 的查询:

GET /megacorp/employee/_search{ "query": { "match_phrase": { "about": "rock climbing" } }}

毫无悬念,返回结果仅有 John Smith 的文档:

{ "took": 19, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.53484553, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 0.53484553, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } } ] }}
高亮搜索

许多应用都倾向于在每个搜索结果中 高亮 部分文本片段,以便让用户知道为何该文档符合查询条件。在 Elasticsearch 中检索出高亮片段也很容易。

再次执行前面的查询,并增加一个新的 highlight 参数:

GET /megacorp/employee/_search{ "query": { "match_phrase": { "about": "rock climbing" } }, "highlight": { "fields": { "about": {} } }}

当执行该查询时,返回结果与之前一样,与此同时结果中还多了一个叫做 highlight 的部分。这个部分包含了 about 属性匹配的文本片段,并以 HTML 标签 封装:

{ "took": 144, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.53484553, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 0.53484553, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] }, "highlight": { "about": [ "I love to go <em>rock</em> <em>climbing</em>" ] } } ] }}

关于高亮搜索片段,可以在 highlighting reference documentation 了解更多信息。

分析

终于到了最后一个业务需求:支持管理者对雇员目录做分析。 Elasticsearch 有一个功能叫聚合(aggregations),允许我们基于数据生成一些精细的分析结果。聚合与 SQL 中的 GROUP BY 类似但更强大。

举个例子,挖掘出雇员中最受欢迎的兴趣爱好:

GET /megacorp/employee/_search{ "aggs": { "all_interests": { "terms": { "field": "interests" } } }}

这个时候报错,返回结果如下:

{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead." } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "megacorp", "node": "_7Llp2cwTx-o5QnDaf5hww", "reason": { "type": "illegal_argument_exception", "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead." } } ] }, "status": 400}

聚合这些操作用单独的数据结构(fielddata)缓存到内存里了,需要单独开启,官方解释在此fielddata,简单来说就是在聚合前执行如下操作:

PUT megacorp/_mapping/employee/{ "properties": { "interests": { "type": "text", "fielddata": true } }}

返回结果:

{ "acknowledged": true}

再次执行聚合查询,返回结果如下:

{ "took": 40, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 1, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 1, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "3", "_score": 1, "_source": { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": [ "forestry" ] } } ] }, "aggregations": { "all_interests": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "music", "doc_count": 2 }, { "key": "forestry", "doc_count": 1 }, { "key": "sports", "doc_count": 1 } ] } }}

可以看到,两位员工对音乐感兴趣,一位对林地感兴趣,一位对运动感兴趣。这些聚合并非预先统计,而是从匹配当前查询的文档中即时生成。如果想知道叫 Smith 的雇员中最受欢迎的兴趣爱好,可以直接添加适当的查询来组合查询:

GET /megacorp/employee/_search{ "query": { "match": { "last_name": "smith" } }, "aggs": { "all_interests": { "terms": { "field": "interests" } } }}

返回结果:

{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 0.2876821, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 0.2876821, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 0.2876821, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } } ] }, "aggregations": { "all_interests": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "music", "doc_count": 2 }, { "key": "sports", "doc_count": 1 } ] } }}

聚合还支持分级汇总 。比如,查询特定兴趣爱好员工的平均年龄:

GET /megacorp/employee/_search{ "aggs" : { "all_interests" : { "terms" : { "field" : "interests" }, "aggs" : { "avg_age" : { "avg" : { "field" : "age" } } } } }}

返回结果:

{ "took": 11, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 1, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 1, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "3", "_score": 1, "_source": { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": [ "forestry" ] } } ] }, "aggregations": { "all_interests": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "music", "doc_count": 2, "avg_age": { "value": 28.5 } }, { "key": "forestry", "doc_count": 1, "avg_age": { "value": 35 } }, { "key": "sports", "doc_count": 1, "avg_age": { "value": 25 } } ] } }}

输出基本是第一次聚合的加强版。依然有一个兴趣及数量的列表,只不过每个兴趣都有了一个附加的 avg_age 属性,代表有这个兴趣爱好的所有员工的平均年龄。

标签: #es查询所有索引的java接口