Featured image of post ElasticSearch學習筆記

ElasticSearch學習筆記

職場就是能躲則躲,能飄則飄

本文會用到的連結

Beginner’s Crash Course to Elastic Stack - Part 1: Intro to Elasticsearch and Kibana

image-20231029160640473

我的主¡管底下有三個員工,我是其中一位,今天主管說有一個新需求要交代,派給了另外兩個同事,但沒派給我,為了展現本人一心向學的個性(快打考績了,要狗腿一點,並且希望可以苟過適用期),我便很假掰的問了一下

「不好意思,請問等等的會議,雖然我沒有被分配到工作,但我也想參加,不知道可以嗎?」

中午參加會議完後,主管便打電話給我

「Hoxton,剛剛的會議你有參加齁?現在有多了一個新的需求OOO,再請你處理一下」

ㄏㄏ ,結果新的需求會用到Elastic Search 我完全不懂,死定

at-that-point-i-knew-it-was-over-reece-harle-gif

安裝Elastic Search Kibana

本人強烈建議,一率安裝8.1.0的版本!!!

Elastic Search是一個程式,而Kibana則是它的GUI

image-20231029185147131

相關的安裝可以到Udemy上查看,這也是我最推薦的方式,安裝的部分是免費可以看的

教學連結Udemy

image-20231029153825942

image-20231029153808666

到這邊下載對應版本的Elastic Search,再次強調,安裝8.1.0

https://www.elastic.co/downloads/elasticsearch

image-20231029165245653

2024/4/27補充

下載完後,進到根目錄中,本人強烈建議編輯 config/elasticsearch.yml 中的兩個值

關閉ssl認證

1
2
xpack.security.http.ssl:
  enabled: false

關閉帳號密碼認證

1
xpack.security.enabled: false

關閉這兩個會讓你少很多痛苦,相信我

解壓鎖完後進入到Elastic Search的根目錄中,輸入以下指令,就可以啟用ElasticSearch了

1
bin/elasticsearch

image-20231029165700163

在啟動完成後,ElasticSearch會為我們創建一個超級使用者,並會把它的密碼輸出在terminal中

image-20231029165859153

如果你不幸忘記了密碼,可以輸入以下指令來重置

1
bin/elasticsearch-reset-password -u elastic 

還會產生一組Token用作Kibana的連接,這組Token會存續30分鐘

image-20231029170138155

當然,如果你不幸又忘記了,可以輸入以下指令來重新獲取

1
bin/elasticsearch-create-enrollment-token --scope kibana

接著安裝Kibana

再次再次強調,請安裝8.1.0的版本

image-20231029172152420

安裝解壓縮完,到Kibana的根目錄下執行

1
bin/kibana

就可以啟動Kibana,並且訪問對應的頁面

URL : http://localhost:5601/app/home#/

image-20231029185814041

接著按照他的要求,把Elastic Search所提供的token填入Kibana中

image-20231029185917645

然後帳號密碼的部分也是請查看Elastic Search的Terminal

image-20231029185934237

image-20231029190008273

這樣就完成登入了!真的是操你媽的8.11版本,我搞超久,後來降成8.1.0就沒問題了,我真的是幹你媽的Elastic Search

image-20231029190020589

如何使用Kibana

URL : http://localhost:5601/app/home#/

進入Kibana後點選這邊

image-20231029191647939

按下這個賤,就可以送出請求囉!

image-20231030222038203

Elastic Search 介紹

image-20231030222238463

常見的使用場景有:Uber在搜尋駕駛、搜尋附近的餐廳、遊戲的數據搜集、Tinder配對、火星好奇心號的數據收集、Log紀錄、安全性分析等等

並不是一個Database,他更像是一個搜尋與分析的工具

Index

Document

Field

ElasticSearch的Request格式

本章節參考至

Beginner’s Crash Course to Elastic Stack - Part 1: Intro to Elasticsearch and Kibana

文章內容來自:Part-1-Intro-to-Elasticsearch-and-Kibana

編輯一下elastic/congig/elasticsearch.yml的檔案,將這cluster跟node命名如下

image-20231031014132180

查看cluster狀態

1
GET _API/參數

大概是長這樣,比如說以下的格式就是去查cluster的健康狀況

1
GET _cluster/health

回應的結果就是這樣

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 11,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

image-20231031015146828

查看node狀態

再發一個請求,查看node的狀態,確認我們對node的重新命名有生效

1
GET _nodes/stats
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "yz_GM-vITti54G1nS6GiEg" : {
      "timestamp" : 1698688524532,
      "name" : "curd_node",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ...以下略
}

image-20231031015657944

從這個Response我們也可以確定,我們對elasticsearch.yml的修改(更改cluster以及node名稱,確實是有生效的)

創建index

1
PUT {NameOfTheIndex}
1
PUT favorite_candy

image-20231031020205194

image-20231031021009493

儲存Document到index中(Create)

有兩種方式,一種是透過PUT、另一種是透過POST,兩者的差別如下

  1. 當使用POST時,elastic search會自動為你的document創建id
  2. 使用PUT時,代表你要自己指定document的id是什麼

Post方式

Reqeust

1
2
3
4
POST favorite_candy/_doc
{"first_name":"Lisa",
"candy":"Sour Skittles"
}

Response

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "_index" : "favorite_candy",
  "_id" : "RXvLgYsBNuMe8RFD4lGF",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

image-20231031021612251

Put方式

那個1代表我要Assign的Id名稱

1
2
3
4
PUT favorite_candy/_doc/1
{"first_name":"Lisa",
"candy":"Sour Skittles"
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "_index" : "favorite_candy",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

image-20231031023505205

我們在試著多塞一些資料進去,因為我們等等要retrive這些資料

1
2
3
4
5
6
7
8
9
PUT favorite_candy/_doc/2
{"first_name":"Rachel",
"candy":"Rolos"
}

PUT favorite_candy/_doc/3
{"first_name":"Tom",
"candy":"Sweet Tarts"
}

這邊特別注意一下,像現在的情況,id 1已經有一組資料了,如果重複PUT資料到id 1的情況,返回的結果會是這樣,version會變成2,result會變成update

image-20231031024625583

也有另一種方式,類似SQL中的 craete if not exist…的用法,就是這樣

1
2
3
4
PUT favorite_candy/_create/1
{"first_name":"Rachel",
"candy":"Rolos"
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
  "error" : {
    "root_cause" : [
      {
        "type" : "version_conflict_engine_exception",
        "reason" : "[1]: version conflict, document already exists (current version [2])",
        "index_uuid" : "PscoViAvQF6UvIEvN35Cpg",
        "shard" : "0",
        "index" : "favorite_candy"
      }
    ],
    "type" : "version_conflict_engine_exception",
    "reason" : "[1]: version conflict, document already exists (current version [2])",
    "index_uuid" : "PscoViAvQF6UvIEvN35Cpg",
    "shard" : "0",
    "index" : "favorite_candy"
  },
  "status" : 409
}

這樣子,如果id已經存在,就不會overwrite這個document

image-20231031033202475

查看Ducument(Retrive)

GET {index名稱}/_doc/{id}

1
GET favorite_candy/_doc/1

image-20231031024427798

更新Document(Update)

1
2
3
4
5
6
7
POST Name-of-the-Index/_update/id-of-the-document-you-want-to-update
{
  "doc": {
    "field1": "value",
    "field2": "value",
  }
} 
1
2
3
4
5
6
POST favorite_candy/_update/1
{
  "doc": {
    "candy": "M&M's"
  }
}

image-20231031034236909

刪除Document(Delete)

1
DELETE Name-of-the-Index/_doc/id-of-the-document-you-want-to-delete
1
DELETE favorite_candy/_doc/1

image-20231031034703949

Precision And Recall

使用實際資料來練習Elastic Search

Part-2-Understanding-the-relevance-of-your-search-with-Elasticsearch-and-Kibana

資料下載連結:Kaggle

資料大概長這樣,就是一個新聞網站的json資料,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
  {
    "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
    "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
    "category": "U.S. NEWS",
    "short_description": "Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.",
    "authors": "Carla K. Johnson, AP",
    "date": "2022-09-23"
  },
  {
    "link": "https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe",
    "headline": "American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video",
    "category": "U.S. NEWS",
    "short_description": "He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",
    "authors": "Mary Papenfuss",
    "date": "2022-09-23"
  },
  {
    "link": "https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02",
    "headline": "23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)",
    "category": "COMEDY",
    "short_description": "\"Until you have a dog you don't understand what could be eaten.\"",
    "authors": "Elyse Wanshel",
    "date": "2022-09-23"
  }
]

匯入資料

index name : news_headlines

image-20240427155756322

查看有哪些index

1
GET /_stats/indexing

image-20240427155727987

查看Index底下有哪些Document

1
GET {index}/_search

如果數量太多,會顯示預設的10個,relation也會顯示 gte 代表 great than

image-20231101003947771

為了在大型數據集上提高響應速度,Elasticsearch默認限制了總計數為10,000。如果您想要確切的總命中數,請使用以下查詢。

1
2
3
4
GET news_headlines/_search
{
  "track_total_hits": true
}

image-20231101003957441

兩種不同的搜尋 Query & Aggregation

Query用來搜尋一些符合特定指標的Document

Aggregation是將數據總結為指標、統計數據和其他分析的過程,更接近分析數據

使用Query依照關鍵字來做搜尋

match中搭配filedname即可達到搜尋的效果-

1
2
3
4
5
6
7
8
GET user/_search
{
  "query": {
    "match": {
      "username": "abc123"
    }
  }
}
1
2
3
4
5
6
GET news_headlines/_search
{
  "query": {"match": {
    "headline": "Khloe Kardashian Kendall Jenner"
  }}
}

值得注意的是,這個不像是SQL的

1
select * from news_headlines where headline = 'Khloe Kardashian Kendall Jenner'

而是像這樣,match query是一種全文檢索(Fulltext query)

1
select * from news_headlines where headline like '%Khloe%' or headline like '%Kardashian%' or headline like '%Kendall%' or headline like '%Jenner%' 

image-20240427184337171

使用operator 提升 Query精準度

如果我們是希望提升精准度,要找出包含 Khloe、Kardashian、Kendall、Jenner的,我們可以在搜尋加上operator

1
2
3
4
5
6
7
8
9
GET news_headlines/_search
{
  "query": {"match": {
    "headline": {
      "query": "Khloe Kardashian Kendall Jenner",
	    "operator":"AND"
    }
  }}
}

image-20240427184737362

使用minium_should_match

1
2
3
4
5
6
7
8
9
GET news_headlines/_search
{
  "query": {"match": {
    "headline": {
      "query": "Khloe Kardashian Kendall Jenner",
	    "minium_should_match": 3
    }
  }}
}

使用Query依照時間來做搜尋

範例:

gte : greater than

lte : less than

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_the_index_here/_search
{
  "query": {
    "Specify the type of query here": {
      "Enter name of the field here": {
        "gte": "Enter lowest value of the range here",
        "lte": "Enter highest value of the range here"
      }
    }
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  }
}

image-20231101004148376

使用Query依照時間來做搜尋並排序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
GET news_headlines/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  },
  "sort": [
    {
      "date": {
        "order": "asc"  
      }
    }
  ]
}

各種不同的Match 語法

Match_Phrase

假設我們要找一首歌「Shape of You」,如果只用以下的搜尋,會出現很多不相關的結果,因為Match Quert會把 Shape 、 of 、 You 當成三個不同的詞彙去找,想要避免這樣,就需要使用

1
2
3
4
5
6
7
8
GET news_headlines/_search
{
  "query": {
    "match": {
      "headline": "shape of you"
    }
  }
}

image-20240429234135693

1
2
3
4
5
6
7
8
GET news_headlines/_search
{
  "query": {
    "match_phrase": {
      "headline": "shape of you"
    }
  }
}

image-20240429234249361

Multi_Match

在不同的Field中搜尋,效果就有點像SQL的

Select * from news_headlines where headline like ‘%Michelle Obama%’ or short_description like ‘%Michelle Obama%’ or authors like like ‘%Michelle Obama%’

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline",
        "short_description",
        "authors"
      ]
    }
  }
}

但這樣其實有一些問題,我們可能最主要是想找Michelle Obama的文章,但只要short_description裡面有提到Michelle Obama就會被包含進來,我們可能更 Focus 在 headline 的權重上面

加權型Multi_Match

就是在Field後多加一個次方,就能將 headline 有 Michelle Obama 的結果先列出來了(Per Field Boosting)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline^2",
        "short_description",
        "authors"
      ]
    }
  }
}

將Multi-Match跟Match_Pharse結合

集百家之長,結合出來的搜尋

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline^2",
        "short_description",
        "authors"
      ],
      "type":"phrase"
    }
  }
}

使用Bool Query來進行搜尋

TimeCode標好了 Beginner’s Crash Course to Elastic Stack - Part 3: Full text queries

文章內容 :https://github.com/LisaHJung/Part-3-Running-full-text-queries-and-combined-queries-with-Elasticsearch-and-Kibana

所謂的bool query其實就是將不同條件整合再一起的一種搜尋,相當於SQL不是會有那種 where id = 1 and status=1的那種where and or語句嗎,bool query就是在做這個部分

常見的bool query有以下幾種

  • must:相當於AND 代表一定要有
  • filter:表示過濾條件,類似where的用法
  • should:不會影響到搜尋的結果,但會影響到排序,符合的會靠上
  • Must_not:表示不匹配的,相當於NOT
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
GET name_of_index/_search
{
  "query": {
    "bool": {
      "must": [
        {One or more queries can be specified here. A document MUST match all of these queries to be considered as a hit.}
      ],
      "must_not": [
        {A document must NOT match any of the queries specified here. It it does, it is excluded from the search results.}
      ],
      "should": [
        {A document does not have to match any queries specified here. However, it if it does match, this document is given a higher score.}
      ],
      "filter": [
        {These filters(queries) place documents in either yes or no category. Ones that fall into the yes category are included in the hits. }
      ]
    }
  }
}

這些bool也會跟基本查詢

  • Term

  • Match

  • Range

  • Terms

做搭配

比如說像這樣

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Search" }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": [
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2019-01-01" }}} 
      ],
      "should": [
        { "match": { "author": "John" }},
        { "match": { "author": "Doe" }}
      ],
      "must_not": [
        { "match": { "category": "Marketing" }}
      ]
    }
  }
}

而一般Query是長這樣,可以稍微比較感覺到Query 跟Bool Query的差異

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  }
}

以下示範一些常見的Bool Query的寫法

Bool Must查詢

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "username": "Hoxton"
          }
        }
      ]
    }
  }
}

Bool Filter過濾

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "status": "1"
          }
        },{
          "range":{
            "loginTime":{"gte":"2023-01-01","lte":"2024-12-31"}
          }
        }
      ]
    }
  }
}

Bool Should

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
GET /my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "Elasticsearch" }},
        { "match": { "content": "search" }}
      ]
    }
  }
}

Bool MustNot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
 "query": {
   "bool": {
     "must_not": [
       { "match": { "title": "Elasticsearch" }},
       { "match": { "content": "search" }}
     ]
   }
 }
}

Bool 多重搜尋(Match)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
   "query":{
      "bool":{
         "must":[
            {
               "match":{
                  "userName":"Hoxton"
               }
            },
            {
               "match":{
                  "gender":1
               }
            }
         ]
      }
   }
}

Bool Terms 搜尋

2024/05/25更新

找出某個Filed中符合條件的document

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
GET news_headlines/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "category": [
              "U.S. NEWS",
              "COMEDY"
            ]
          }
        }
      ]
    }
  }
}

使用Aggregation做搜尋

Kaggle 網址 :Kaggle E-Commerce Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_the_index_here/_search
{
  "aggs": {
    "name your aggregation here": {
      "specify aggregation type here": {
        "field": "name the field you want to aggregate here",
        "size": state how many buckets you want returned here
      }
    }
  }
}

Aggs: 代表你要送一個Aggregation

By_category: 你Aggregation出來的東西要叫什麼

Terms: 以字段做分析

Filed:字段具體的key是什麼

Size:Filed最大上限,假設現在你的原始資料,key其實有1000個,但你只想要看100個,就可以在這邊限制,底下我有附圖片,可以看差距

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 100
      }
    }
  }
}

原始資料

image-20240427161300394

聚合出來的結果

image-20231101005948322

Size參數的影響

image-20231101010010756

image-20231101010340003

又使用Query 又使用Aggregation做搜尋

搜尋某一類別中最重要的詞語

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
GET enter_name_of_the_index_here/_search
{
  "query": {
    "match": {
      "Enter the name of the field": "Enter the value you are looking for"
    }
  },
  "aggregations": {
    "Name your aggregation here": {
      "significant_text": {
        "field": "Enter the name of the field you are searching for"
      }
    }
  }
}

以下的搜尋可以這樣讀

「我想要找到,在ENTERTAINMENT這個目錄中,headline的中,最具重要性詞語有哪些

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
GET news_headlines/_search
{
  "query": {
    "match": {
      "category": "ENTERTAINMENT"
    }
  },
  "aggregations": {
    "popular_in_entertainment": {
      "significant_text": {
        "field": "headline"
      }
    }
  }
}

image-20231101012720758

Aggregation 實戰

Kaggle 網址 :Kaggle E-Commerce Data

Youtube 網址 :Beginner’s Crash Course to Elastic Stack - Part 4: Aggregations

資料的樣子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[
      {
        "_index": "ecommerce_data",
        "_id": "U3hHL48B6YD2b_RB4N-A",
        "_score": 1,
        "_source": {
          "UnitPrice": 0.55,
          "Description": "AGED GLASS SILVER T-LIGHT HOLDER",
          "Quantity": 144,
          "Country": "United Kingdom",
          "InvoiceNo": "543456",
          "InvoiceDate": "2/8/2011 12:41",
          "CustomerID": 15753,
          "StockCode": "21326"
        }
      },
      {
        "_index": "ecommerce_data",
        "_id": "VHhHL48B6YD2b_RB4N-A",
        "_score": 1,
        "_source": {
          "UnitPrice": 1.25,
          "Description": "SET OF SALT AND PEPPER TOADSTOOLS",
          "Quantity": 12,
          "Country": "United Kingdom",
          "InvoiceNo": "543457",
          "InvoiceDate": "2/8/2011 12:47",
          "CustomerID": 17428,
          "StockCode": "22892"
        }
      },
  ]

三種Aggregation

  1. Metric Aggregation:用來計算數字型的資料,比如說計算最大值、最小值、平均值等等
  2. Bucket Aggregation:想要聚合複數Subset的資料時,使用Bucket Aggregation (一桶一桶的)
  3. Combined Aggregation

Metric Aggregation-求總和

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size":0,
  "aggs": {
    "sum_unit_price": {
      "sum": {
        "field": "UnitPrice"
      }
    }
  }
}

Size:0 影響的是會不會顯示hint的資料,如果0就只會顯示聚合結果

image-20240430221519932

Metric Aggregation-求最低

Size:0 影響的是會不會顯示hint的資料,如果0就只會顯示聚合結果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size":0,
  "aggs": {
    "sum_unit_price": {
      "min": {
        "field": "UnitPrice"
      }
    }
  }
}

image-20240430222032330

Metric Aggregation-求最高

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size": 0,
  "aggs": {
    "highest_unit_price": {
      "max": {
        "field": "UnitPrice"
      }
    }
  }
}

Size:0 影響的是會不會顯示hint的資料,如果0就只會顯示聚合結果

image-20240430222128024

Metric Aggregation-求平均

Size:0 影響的是會不會顯示hint的資料,如果0就只會顯示聚合結果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size": 0,
  "aggs": {
    "average_unit_price": {
      "avg": {
        "field": "UnitPrice"
      }
    }
  }
}

image-20240430222313108

Mertic Aggregation - 求所有(總和、最低、最高、平均)

Size:0 影響的是會不會顯示hint的資料,如果0就只會顯示聚合結果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size": 0,
  "aggs": {
    "all_stats_unit_price": {
      "stats": {
        "field": "UnitPrice"
      }
    }
  }
}

image-20240430222557689

使用Query條件來限制Aggregation的範圍

限制只統計Country為Germany的平均單位售價

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
GET ecommerce_data/_search
{
  "size": 0,
  "query": {
    "match": {
      "Country": "Germany"
    }
  },
  "aggs": {
    "germany_average_unit_price": {
      "avg": {
        "field": "UnitPrice"
      }
    }
  }
}

Cardinality Aggregation

關鍵字:去重複、去重

離散數學中的Cardinality,指的是一個集合中,不重複元素的個數,例如集合A={1,2,7,7,7},那麼A的Cardinality就是3,寫作|A|=3。

在ES中的Cardinality Aggregation也是類似的概念

統計不重複的 CustomerID

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET ecommerce_data/_search
{
  "size": 0,
  "aggs": {
    "number_unique_customers": {
      "cardinality": {
        "field": "CustomerID"
      }
    }
  }
}

共有4359個不重複的CustomerId

image-20240430223830936

Bucket Aggregation

四種不同的Bucket Aggregation

  1. Date Historgram Aggregation:使用時間區間來 Group 資料
  2. Histogram Aggregation
  3. Range Aggregation
  4. Terms Aggregation

Historgram :直方圖

image-20240430225507690

Date Historgram Aggregation

按InvoiceDate,把資料每八個小時分成一桶

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
GET ecommerce_data/_search
{
  "size": 0,
  "aggs": {
    "transactions_by_8_hrs": {
      "date_histogram": {
        "field": "InvoiceDate",
        "fixed_interval": "8h"
      }
    }
  }
}

精確與召回(Precision And Recall)

幹,這真的好難翻,精確就有點像是AND,召回就有點像是OR,使用召回,有可能會找到不太相干的資料,比如找臘腸,卻找到臘腸狗那樣,搜尋結果較多,但較不準確

Increase Recall

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
GET enter_name_of_index_here/_search
{
  "query": {
    "match": {
      "Specify the field you want to search": {
        "query": "Enter search terms"
      }
    }
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
GET news_headlines/_search
{
  "query": {
    "match": {
      "headline": {
        "query": "Khloe Kardashian Kendall Jenner"
      }
    }
  }
}

image-20231101013712366

Increase Precision

這個搜尋是精確搜尋,會找到更符合預期的結果,語法其實就是多增加一個operator and就可以了啦

唉,好累,不知道為啥10/31半夜還在這邊卷

sleepy-tired-gif

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_index_here/_search
{
  "query": {
    "match": {
      "Specify the field you want to search": {
        "query": "Enter search terms",
        "operator": "and"
      }
    }
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "query": {
    "match": {
      "headline": {
        "query": "Khloe Kardashian Kendall Jenner",
        "operator": "and"
      }
    }
  }
}

image-20231101015222615

最小匹配搜尋(minimum_should_match)

minimum_should_match是用來微調(fined tuning)匹配量的,我們的搜尋最初是包含Khloe Kardashian Kendall Jenner這四個詞的搜尋,但我們可以選擇到底要匹配幾個,比如說可以匹配3個,那就是C4取3,可以看下面的範例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_index_here/_search
{
  "query": {
    "match": {
      "headline": {
        "query": "Enter search term here",
        "minimum_should_match": Enter a number here
      }
    }
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "query": {
    "match": {
      "headline": {
        "query": "Khloe Kardashian Kendall Jenner",
        "minimum_should_match": 3
      }
    }
  }
}

不同最小匹配量的值所影響到的結果

image-20231101015715026

image-20231101015740997

image-20231101015832703

Licensed under CC BY-NC-SA 4.0