本文會用到的連結

Beginner’s Crash Course to Elastic Stack - Part 1: Intro to Elasticsearch and Kibana

我的主¡管底下有三個員工，我是其中一位，今天主管說有一個新需求要交代，派給了另外兩個同事，但沒派給我，為了展現本人一心向學的個性(快打考績了，要狗腿一點，並且希望可以苟過適用期)，我便很假掰的問了一下

「不好意思，請問等等的會議，雖然我沒有被分配到工作，但我也想參加，不知道可以嗎？」

中午參加會議完後，主管便打電話給我

「Hoxton，剛剛的會議你有參加齁？現在有多了一個新的需求OOO，再請你處理一下」

ㄏㄏ，結果新的需求會用到Elastic Search 我完全不懂，死定

at-that-point-i-knew-it-was-over-reece-harle-gif

安裝Elastic Search Kibana

本人強烈建議，一率安裝8.1.0的版本！！！

Elastic Search是一個程式，而Kibana則是它的GUI

相關的安裝可以到Udemy上查看，這也是我最推薦的方式，安裝的部分是免費可以看的

教學連結Udemy

到這邊下載對應版本的Elastic Search，再次強調，安裝8.1.0

https://www.elastic.co/downloads/elasticsearch

2024/4/27補充

下載完後，進到根目錄中，本人強烈建議編輯 config/elasticsearch.yml 中的兩個值

關閉ssl認證

1
2
xpack.security.http.ssl:
  enabled: false

關閉帳號密碼認證

1
xpack.security.enabled: false

關閉這兩個會讓你少很多痛苦，相信我

解壓鎖完後進入到Elastic Search的根目錄中，輸入以下指令，就可以啟用ElasticSearch了

1
bin/elasticsearch

在啟動完成後，ElasticSearch會為我們創建一個超級使用者，並會把它的密碼輸出在terminal中

如果你不幸忘記了密碼，可以輸入以下指令來重置

1
bin/elasticsearch-reset-password -u elastic 

還會產生一組Token用作Kibana的連接，這組Token會存續30分鐘

當然，如果你不幸又忘記了，可以輸入以下指令來重新獲取

1
bin/elasticsearch-create-enrollment-token --scope kibana

接著安裝Kibana

再次再次強調，請安裝8.1.0的版本

安裝解壓縮完，到Kibana的根目錄下執行

1
bin/kibana

就可以啟動Kibana，並且訪問對應的頁面

URL : http://localhost:5601/app/home#/

接著按照他的要求，把Elastic Search所提供的token填入Kibana中

然後帳號密碼的部分也是請查看Elastic Search的Terminal

這樣就完成登入了！真的是操你媽的8.11版本，我搞超久，後來降成8.1.0就沒問題了，我真的是幹你媽的Elastic Search

如何使用Kibana

URL : http://localhost:5601/app/home#/

進入Kibana後點選這邊

按下這個賤，就可以送出請求囉！

Elastic Search 介紹

常見的使用場景有：Uber在搜尋駕駛、搜尋附近的餐廳、遊戲的數據搜集、Tinder配對、火星好奇心號的數據收集、Log紀錄、安全性分析等等

並不是一個Database，他更像是一個搜尋與分析的工具

Index

Document

Field

ElasticSearch的Request格式

本章節參考至

Beginner’s Crash Course to Elastic Stack - Part 1: Intro to Elasticsearch and Kibana

文章內容來自：Part-1-Intro-to-Elasticsearch-and-Kibana

編輯一下elastic/congig/elasticsearch.yml的檔案，將這cluster跟node命名如下

查看cluster狀態

1
GET _API/參數

大概是長這樣，比如說以下的格式就是去查cluster的健康狀況

1
GET _cluster/health

回應的結果就是這樣

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 11,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

查看node狀態

再發一個請求，查看node的狀態，確認我們對node的重新命名有生效

1
GET _nodes/stats

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "yz_GM-vITti54G1nS6GiEg" : {
      "timestamp" : 1698688524532,
      "name" : "curd_node",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ...以下略
}

從這個Response我們也可以確定，我們對elasticsearch.yml的修改(更改cluster以及node名稱，確實是有生效的)

創建index

1
PUT {NameOfTheIndex}

1
PUT favorite_candy

儲存Document到index中(Create)

有兩種方式，一種是透過PUT、另一種是透過POST，兩者的差別如下

當使用POST時，elastic search會自動為你的document創建id
使用PUT時，代表你要自己指定document的id是什麼

Post方式

Reqeust

1
2
3
4
POST favorite_candy/_doc
{"first_name":"Lisa",
"candy":"Sour Skittles"
}

Response

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "_index" : "favorite_candy",
  "_id" : "RXvLgYsBNuMe8RFD4lGF",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Put方式

那個1代表我要Assign的Id名稱

1
2
3
4
PUT favorite_candy/_doc/1
{"first_name":"Lisa",
"candy":"Sour Skittles"
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "_index" : "favorite_candy",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

我們在試著多塞一些資料進去，因為我們等等要retrive這些資料

1
2
3
4
5
6
7
8
9
PUT favorite_candy/_doc/2
{"first_name":"Rachel",
"candy":"Rolos"
}

PUT favorite_candy/_doc/3
{"first_name":"Tom",
"candy":"Sweet Tarts"
}

這邊特別注意一下，像現在的情況，id 1已經有一組資料了，如果重複PUT資料到id 1的情況，返回的結果會是這樣，version會變成2，result會變成update

也有另一種方式，類似SQL中的 craete if not exist…的用法，就是這樣

1
2
3
4
PUT favorite_candy/_create/1
{"first_name":"Rachel",
"candy":"Rolos"
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
  "error" : {
    "root_cause" : [
      {
        "type" : "version_conflict_engine_exception",
        "reason" : "[1]: version conflict, document already exists (current version [2])",
        "index_uuid" : "PscoViAvQF6UvIEvN35Cpg",
        "shard" : "0",
        "index" : "favorite_candy"
      }
    ],
    "type" : "version_conflict_engine_exception",
    "reason" : "[1]: version conflict, document already exists (current version [2])",
    "index_uuid" : "PscoViAvQF6UvIEvN35Cpg",
    "shard" : "0",
    "index" : "favorite_candy"
  },
  "status" : 409
}

這樣子，如果id已經存在，就不會overwrite這個document

查看Ducument(Retrive)

GET {index名稱}/_doc/{id}

1
GET favorite_candy/_doc/1

更新Document(Update)

1
2
3
4
5
6
7
POST Name-of-the-Index/_update/id-of-the-document-you-want-to-update
{
  "doc": {
    "field1": "value",
    "field2": "value",
  }
} 

1
2
3
4
5
6
POST favorite_candy/_update/1
{
  "doc": {
    "candy": "M&M's"
  }
}

刪除Document(Delete)

1
DELETE Name-of-the-Index/_doc/id-of-the-document-you-want-to-delete

1
DELETE favorite_candy/_doc/1

Precision And Recall

使用實際資料來練習Elastic Search

Part-2-Understanding-the-relevance-of-your-search-with-Elasticsearch-and-Kibana

資料下載連結：Kaggle

資料大概長這樣，就是一個新聞網站的json資料，

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
  {
    "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
    "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
    "category": "U.S. NEWS",
    "short_description": "Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.",
    "authors": "Carla K. Johnson, AP",
    "date": "2022-09-23"
  },
  {
    "link": "https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe",
    "headline": "American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video",
    "category": "U.S. NEWS",
    "short_description": "He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",
    "authors": "Mary Papenfuss",
    "date": "2022-09-23"
  },
  {
    "link": "https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02",
    "headline": "23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)",
    "category": "COMEDY",
    "short_description": "\"Until you have a dog you don't understand what could be eaten.\"",
    "authors": "Elyse Wanshel",
    "date": "2022-09-23"
  }
]

匯入資料

index name : news_headlines

查看有哪些index

1
GET /_stats/indexing

查看Index底下有哪些Document

1
GET {index}/_search

如果數量太多，會顯示預設的10個，relation也會顯示 gte 代表 great than

為了在大型數據集上提高響應速度，Elasticsearch默認限制了總計數為10,000。如果您想要確切的總命中數，請使用以下查詢。

1
2
3
4
GET news_headlines/_search
{
  "track_total_hits": true
}

兩種不同的搜尋 Query & Aggregation

Query用來搜尋一些符合特定指標的Document

Aggregation是將數據總結為指標、統計數據和其他分析的過程，更接近分析數據

使用Query依照關鍵字來做搜尋

match中搭配filedname即可達到搜尋的效果－

1
2
3
4
5
6
7
8
GET user/_search
{
  "query": {
    "match": {
      "username": "abc123"
    }
  }
}

1
2
3
4
5
6
GET news_headlines/_search
{
  "query": {"match": {
    "headline": "Khloe Kardashian Kendall Jenner"
  }}
}

值得注意的是，這個不像是SQL的

1
select * from news_headlines where headline = 'Khloe Kardashian Kendall Jenner'

而是像這樣，match query是一種全文檢索(Fulltext query)

1
select * from news_headlines where headline like '%Khloe%' or headline like '%Kardashian%' or headline like '%Kendall%' or headline like '%Jenner%' 

使用operator 提升 Query精準度

如果我們是希望提升精准度，要找出包含 Khloe、Kardashian、Kendall、Jenner的，我們可以在搜尋加上operator

1
2
3
4
5
6
7
8
9
GET news_headlines/_search
{
  "query": {"match": {
    "headline": {
      "query": "Khloe Kardashian Kendall Jenner",
	    "operator":"AND"
    }
  }}
}

使用minium_should_match

1
2
3
4
5
6
7
8
9
GET news_headlines/_search
{
  "query": {"match": {
    "headline": {
      "query": "Khloe Kardashian Kendall Jenner",
	    "minium_should_match": 3
    }
  }}
}

使用Query依照時間來做搜尋

範例：

gte : greater than

lte : less than

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_the_index_here/_search
{
  "query": {
    "Specify the type of query here": {
      "Enter name of the field here": {
        "gte": "Enter lowest value of the range here",
        "lte": "Enter highest value of the range here"
      }
    }
  }
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  }
}

使用Query依照時間來做搜尋並排序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
GET news_headlines/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  },
  "sort": [
    {
      "date": {
        "order": "asc"  
      }
    }
  ]
}

各種不同的Match 語法

Match_Phrase

假設我們要找一首歌「Shape of You」，如果只用以下的搜尋，會出現很多不相關的結果，因為Match Quert會把 Shape 、 of 、 You 當成三個不同的詞彙去找，想要避免這樣，就需要使用

1
2
3
4
5
6
7
8
GET news_headlines/_search
{
  "query": {
    "match": {
      "headline": "shape of you"
    }
  }
}

1
2
3
4
5
6
7
8
GET news_headlines/_search
{
  "query": {
    "match_phrase": {
      "headline": "shape of you"
    }
  }
}

Multi_Match

在不同的Field中搜尋，效果就有點像SQL的

Select * from news_headlines where headline like ‘%Michelle Obama%’ or short_description like ‘%Michelle Obama%’ or authors like like ‘%Michelle Obama%’

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline",
        "short_description",
        "authors"
      ]
    }
  }
}

但這樣其實有一些問題，我們可能最主要是想找Michelle Obama的文章，但只要short_description裡面有提到Michelle Obama就會被包含進來，我們可能更 Focus 在 headline 的權重上面

加權型Multi_Match

就是在Field後多加一個次方，就能將 headline 有 Michelle Obama 的結果先列出來了(Per Field Boosting)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline^2",
        "short_description",
        "authors"
      ]
    }
  }
}

將Multi-Match跟Match_Pharse結合

集百家之長，結合出來的搜尋

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
GET news_headlines/_search
{
  "query": {
    "multi_match": {
      "query": "Michelle Obama",
      "fields": [
        "headline^2",
        "short_description",
        "authors"
      ],
      "type":"phrase"
    }
  }
}

使用Bool Query來進行搜尋

TimeCode標好了 Beginner’s Crash Course to Elastic Stack - Part 3: Full text queries

文章內容：https://github.com/LisaHJung/Part-3-Running-full-text-queries-and-combined-queries-with-Elasticsearch-and-Kibana

所謂的bool query其實就是將不同條件整合再一起的一種搜尋，相當於SQL不是會有那種 where id = 1 and status=1的那種where and or語句嗎，bool query就是在做這個部分

常見的bool query有以下幾種

must：相當於AND 代表一定要有
filter：表示過濾條件，類似where的用法
should：不會影響到搜尋的結果，但會影響到排序，符合的會靠上
Must_not：表示不匹配的，相當於NOT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
GET name_of_index/_search
{
  "query": {
    "bool": {
      "must": [
        {One or more queries can be specified here. A document MUST match all of these queries to be considered as a hit.}
      ],
      "must_not": [
        {A document must NOT match any of the queries specified here. It it does, it is excluded from the search results.}
      ],
      "should": [
        {A document does not have to match any queries specified here. However, it if it does match, this document is given a higher score.}
      ],
      "filter": [
        {These filters(queries) place documents in either yes or no category. Ones that fall into the yes category are included in the hits. }
      ]
    }
  }
}

這些bool也會跟基本查詢

Term
Match
Range
Terms

做搭配

比如說像這樣

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Search" }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": [
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2019-01-01" }}} 
      ],
      "should": [
        { "match": { "author": "John" }},
        { "match": { "author": "Doe" }}
      ],
      "must_not": [
        { "match": { "category": "Marketing" }}
      ]
    }
  }
}

而一般Query是長這樣，可以稍微比較感覺到Query 跟Bool Query的差異

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "query": {
    "range": {
      "date": {
        "gte": "2015-06-20",
        "lte": "2015-09-22"
      }
    }
  }
}

以下示範一些常見的Bool Query的寫法

Bool Must查詢

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "username": "Hoxton"
          }
        }
      ]
    }
  }
}

Bool Filter過濾

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "status": "1"
          }
        },{
          "range":{
            "loginTime":{"gte":"2023-01-01","lte":"2024-12-31"}
          }
        }
      ]
    }
  }
}

Bool Should

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
GET /my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "Elasticsearch" }},
        { "match": { "content": "search" }}
      ]
    }
  }
}

Bool MustNot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
 "query": {
   "bool": {
     "must_not": [
       { "match": { "title": "Elasticsearch" }},
       { "match": { "content": "search" }}
     ]
   }
 }
}

Bool 多重搜尋(Match)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
   "query":{
      "bool":{
         "must":[
            {
               "match":{
                  "userName":"Hoxton"
               }
            },
            {
               "match":{
                  "gender":1
               }
            }
         ]
      }
   }
}

使用Aggregation做搜尋

Kaggle 網址：Kaggle E-Commerce Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET enter_name_of_the_index_here/_search
{
  "aggs": {
    "name your aggregation here": {
      "specify aggregation type here": {
        "field": "name the field you want to aggregate here",
        "size": state how many buckets you want returned here
      }
    }
  }
}

Aggs: 代表你要送一個Aggregation

By_category: 你Aggregation出來的東西要叫什麼

Terms: 以字段做分析

Filed：字段具體的key是什麼

Size：Filed最大上限，假設現在你的原始資料，key其實有1000個，但你只想要看100個，就可以在這邊限制，底下我有附圖片，可以看差距

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
GET news_headlines/_search
{
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category",
        "size": 100
      }
    }
  }
}

原始資料

聚合出來的結果

Size參數的影響

又使用Query 又使用Aggregation做搜尋

搜尋某一類別中最重要的詞語

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
GET enter_name_of_the_index_here/_search
{
  "query": {
    "match": {
      "Enter the name of the field": "Enter the value you are looking for"
    }
  },
  "aggregations": {
    "Name your aggregation here": {
      "significant_text": {
        "field": "Enter the name of the field you are searching for"
      }
    }
  }
}

以下的搜尋可以這樣讀

「我想要找到，在ENTERTAINMENT這個目錄中，headline的中，最具重要性詞語有哪些

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
GET news_headlines/_search
{
  "query": {
    "match": {
      "category": "ENTERTAINMENT"
    }
  },
  "aggregations": {
    "popular_in_entertainment": {
      "significant_text": {
        "field": "headline"
      }
    }
  }
}

Aggregation 實戰

Kaggle 網址：Kaggle E-Commerce Data

Youtube 網址：Beginner’s Crash Course to Elastic Stack - Part 4: Aggregations

資料的樣子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[
      {
        "_index": "ecommerce_data",
        "_id": "U3hHL48B6YD2b_RB4N-A",
        "_score": 1,
        "_source": {
          "UnitPrice": 0.55,
          "Description": "AGED GLASS SILVER T-LIGHT HOLDER",
          "Quantity": 144,
          "Country": "United Kingdom",
          "InvoiceNo": "543456",
          "InvoiceDate": "2/8/2011 12:41",
          "CustomerID": 15753,
          "StockCode": "21326"
        }
      },
      {
        "_index": "ecommerce_data",
        "_id": "VHhHL48B6YD2b_RB4N-A",
        "_score": 1,
        "_source": {
          "UnitPrice": 1.25,
          "Description": "SET OF SALT AND PEPPER TOADSTOOLS",
          "Quantity": 12,
          "Country": "United Kingdom",
          "InvoiceNo": "543457",
          "InvoiceDate": "2/8/2011 12:47",
          "CustomerID": 17428,
          "StockCode": "22892"
        }
      },
  ]