[Elasticsearch] ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ (kNN, Vector Search)

[์›๋ณธ ๋งํฌ]

๊ด€๋ จ ํฌ์ŠคํŠธ
https://blog.naver.com/sssang97/223790220320
Elasticsearch๋Š” ๋ฒกํ„ฐ ๊ฒ€์ƒ‰์„ ์ง€์›ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ค‘ ํ•˜๋‚˜๋‹ค.
์‚ฌ์‹ค ์›๋ž˜ ์ฃผ ๋ชฉ์ ์€ ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์ด๊ธด ํ•œ๋ฐ, ์–ด์ฉŒ๋‹ค๋ณด๋‹ˆ Vector search ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ๋„ ์ด๋ฆ„์„ ๊ฝค ๋‚ ๋ฆฌ๊ฒŒ ๋˜์—ˆ๋‹ค.

๋ฒกํ„ฐ๊ฒ€์ƒ‰์€ Elasticsearch 8.0๋ถ€ํ„ฐ ์ง€์›๋œ๋‹ค.

๋ฒค์น˜๋งˆํฌ ๊ธฐ์ค€์œผ๋กœ ๋ณด๋ฉด ์ „๋ฌธ Vector Database๋“ค์— ๋น„ํ•ด์„œ๋Š” ์ฒ˜๋ฆฌ๋Ÿ‰์ด๋‚˜ ๊ธฐ๋ณธ ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•œ ๋ถ€๋ถ„์ด ์žˆ๊ธด ํ•˜์ง€๋งŒ, ๋น„์ „๋ฌธ Vector Database ์ค‘์—์„œ๋Š” ๊ฐ€์žฅ ๋›ฐ์–ด๋‚˜๋‹ค๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ธ ์ธ์‹์ธ ๊ฒƒ ๊ฐ™๋‹ค.




kNN (k-nearest neighor)์™€ HNSW

Elasticsearch๋Š” k-nearest neighor์ด๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•ด์„œ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰์„ ์ง€์›ํ•œ๋‹ค.
์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž์ฒด๋Š” ๊ต‰์žฅํžˆ ์˜ค๋ž˜๋œ ๋…€์„์ด๋‹ค. 1951๋…„์— ์ฒ˜์Œ ๊ณ ์•ˆ๋˜์—ˆ๋‹ค.

https://www.ibm.com/kr-ko/think/topics/knn
๊ธฐ๋ณธ์ ์ธ ๊ฐœ๋… ์ž์ฒด๋Š” ๋‹จ์ˆœํ•˜๋‹ค.
ํŠน์ • ์œ„์น˜์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ 3๊ฐœ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ํ•˜๋ฉด, ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์œผ๋กœ ์กฐํšŒ๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋ฌผ๋ก  elasticsearch์˜ knn์€ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์™ธ์—๋„ ์ฝ”์‚ฌ์ธ๊ณผ dot product๋ฅผ ์ง€์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ˆœ์ˆ˜ํ•œ kNN์ด๋ผ๊ณ  ํ•˜๊ธฐ๋Š” ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

์•„๋ฌดํŠผ ์ด๊ฑด ๊ฐœ๋…์ ์ธ ์šฉ์–ด์ผ ๋ฟ์ด๊ณ , ์‹ค์ œ ์ธ๋ฑ์Šค ๊ตฌํ˜„์€ HNSW ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ ธ์žˆ๋‹ค.

๊ทธ๋Ÿผ ๋Œ€๊ฐ• ์‚ฌ์šฉํ•ด๋ณด๋ฉด์„œ ๊ทธ ๊ตฌ์กฐ๋ฅผ ์งš์–ด๋ณด๊ฒ ๋‹ค.




์ธ๋ฑ์Šค ์ƒ์„ฑ

๋ฒกํ„ฐ ๊ฒ€์ƒ‰์šฉ ์ธ๋ฑ์Šค๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ๊ทธ๋ฆฌ ์–ด๋ ต์ง€ ์•Š๋‹ค.

### 
PUT http://{{HOST}}:{{PORT}}/vector_index
Content-Type: application/json

{
    "settings": {},
    "mappings": {
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 256,
                "index": true,
                "similarity": "dot_product"
            }
        }
    }
}
###

ํƒ€์ž…์„ "dense_vector"๋กœ ๋‘๊ณ , ๋ฒกํ„ฐ์˜ ๊ธธ์ด("dims"), ์œ ์‚ฌ๋„ ๋ฐฉ๋ฒ•์„ ์ •ํ•ด์ฃผ๋ฉด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.
์ €๋Ÿฌ๋ฉด 256 ๋ฒกํ„ฐ์— ๋Œ€ํ•ด์„œ dot product ์œ ์‚ฌ๋„ ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ๋Š” ๋ฒกํ„ฐ๊ฐ€ ์™„์„ฑ๋œ๋‹ค.



์œ ์‚ฌ๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋ฐฉ๊ธˆ ์ €๊ธฐ์„œ๋Š” dot product๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, 3๊ฐ€์ง€์˜ ๊ธฐ๋ณธ์ ์ธ ์œ ์‚ฌ๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•œ๋‹ค.

  1. l2_norm - ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ
  2. cosine - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
  3. dot_product - dot product

์ž˜ ๋ชจ๋ฅธ๋‹ค๋ฉด, ๊ฐ๊ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋ณ„๋„ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜๋Š” ๊ฒƒ์„ ๊ถŒํ•œ๋‹ค.
https://blog.naver.com/sssang97/223790220320



HNSW ์ธ๋ฑ์Šค ์˜ต์…˜

Elasticsearch์˜ ๋ฒกํ„ฐ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•๋ก ์€ HNSW ๋ฟ์ด์ง€๋งŒ, ์„ฑ๋Šฅ ์š”๊ตฌ์‚ฌํ•ญ์ด๋‚˜ ์‚ฌ์šฉ์‚ฌ๋ก€์— ๋”ฐ๋ผ์„œ ์ถ”๊ฐ€ ์กฐ์ •์„ ํ•  ๋ถ€๋ถ„์€ ์žˆ๋‹ค.

์ด๋Ÿฐ ์‹์œผ๋กœ ํƒ€์ž…์ด๋‚˜ ๋…ธ๋“œ์˜ ๊ฐœ์ˆ˜ ๋“ฑ์„ ๋ฏธ์„ธ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

"type"์€ ์ด๋Ÿฐ ๊ฒƒ๋“ค์ด ์žˆ๋‹ค.

  1. hnsw
  2. int8_hnsw - ๊ธฐ๋ณธ๊ฐ’. ์ •ํ™•๋„๋ฅผ ํฌ์ƒํ•ด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 4๋ฐฐ๊นŒ์ง€ ์ค„์ž„
  3. int4_hnsw - ์ •ํ™•๋„๋ฅผ ํฌ์ƒํ•ด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 8๋ฐฐ๊นŒ์ง€ ์ค„์ž„
  4. bbq_hnsw - ์ •ํ™•๋„๋ฅผ ํฌ์ƒํ•ด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 32๋ฐฐ๊นŒ์ง€ ์ค„์ž„
  5. flat
  6. int8_flat
  7. int4_flat
  8. bbq_flat

flat์ด ๋ถ™๋Š” ๋…€์„์€ ๋‹จ์ˆœ๋ฌด์‹ํ•˜๊ฒŒ ์ •ํ™•ํ•œ kNN ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฑฐ๋ผ์„œ, ์˜ค์ฐจ๊ฐ€ ์—†๋Š” ๋Œ€์‹  ๋งค์šฐ ๋А๋ฆฌ๊ฑฐ๋‚˜ ๋ฆฌ์†Œ์Šค๋ฅผ ๊ณผ๋‹คํ•˜๊ฒŒ ์†Œ๋ชจํ•  ์ˆ˜ ์žˆ๋‹ค.
hnsw์ด ์˜ค์ฐจ๊ฐ€ ์žˆ๋Š” ๋Œ€์‹  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š” ANN ์ธ๋ฑ์Šค ์„ค์ •์ด๋ผ์„œ, ๋Œ€๊ฐœ๋Š” ์ด๊ฒƒ๋“ค์„ ๊ธฐ๋ณธ์œผ๋กœ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ๋งž๋‹ค.

"m"์€ HNSW ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ์—์„œ ๋…ธ๋“œ๊ฐ€ ๋ช‡๊ฐœ์˜ ์ด์›ƒ๊ณผ ์—ฐ๊ฒฐ๋˜๋Š”์ง€๋ฅผ ์ •์˜ํ•œ๋‹ค.
์ด๊ฑธ ๋Š˜๋ฆฌ๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋Š˜์–ด๋‚˜๋Š” ๋Œ€์‹  ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋œ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 16์ด๋‹ค.

"ef_construction"์€ ๋…ธ๋“œ๋ฅผ ๊ตฌ์„ฑํ• ๋•Œ ํƒ์ง€ํ•˜๋Š” ํ›„๋ณด์˜ ๊ฐœ์ˆ˜๋‹ค. ์ด๊ฒŒ ์ปค์ง€๋ฉด ์ธ๋ฑ์Šค ๊ตฌ์„ฑ ์‹œ๊ฐ„์ด ๋А๋Š” ๋Œ€์‹ ์— ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋œ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 100์ด๋‹ค.

๋ณดํ†ต ์ตœ์ ํ™”๋ฅผ ํ• ๋•Œ๋Š” type์ด๋‚˜ "ef_construction"๋Š” ์ž˜ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ , ์˜ต์…˜ "m"์„ ์ ์ ˆํžˆ ์กฐ์ •ํ•˜๊ณ  ์ฟผ๋ฆฌํ• ๋•Œ ํ›„๋ณด ์ˆ˜๋ฅผ ์ ์ ˆํžˆ ์ฃผ๋Š” ๊ฒƒ์— ์ฃผ์•ˆ์ ์„ ์ฃผ๊ฒŒ ๋œ๋‹ค.




๋ฐ์ดํ„ฐ ๋„ฃ๊ณ  ๊ฒ€์ƒ‰ํ•ด๋ณด๊ธฐ

๋ฐ์ดํ„ฐ๋ฅผ ์‚ฝ์ž…ํ•˜๋Š” ๊ฒƒ์€ ๊ธฐ์กด ์ธ๋ฑ์Šค์™€ ๋ณ„๋กœ ๋‹ค๋ฅด์ง€ ์•Š๋‹ค.
dot product์ด๋‹ˆ ์ผ๋ฐ˜ํ™”๋œ ๋ฒกํ„ฐ ๋ฐฐ์—ด๋กœ ๋„ฃ์–ด์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.

๊ฒ€์ƒ‰์„ ํ• ๋•Œ๋Š” ๊ธฐ์กด ์ธ๋ฑ์Šค์™€ ์ข€ ๋‹ค๋ฅด๋‹ค.
knn ์˜ต์…˜์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ์š”์ฒญ์„ ์ •์˜ํ•˜๋Š”๋ฐ

์ด๋Ÿฐ ์‹์œผ๋กœ ๊ฐ€์ ธ์˜ฌ ๊ฐœ์ˆ˜(k)๊ณผ, ๊ณ ๋ คํ•  ํ›„๋ณด๊ตฐ(num_candicaties)๋ฅผ ์ง€์ •ํ•ด์ค˜์•ผ ํ•œ๋‹ค.
์—ฌ๊ธฐ์„œ๋Š” ์‚ฌ์‹ค num_candidates๊ฐ€ ์ข€ ํ•ต์‹ฌ์ด์—ˆ๋‹ค.

Elasticsearch๋Š” ์•Œ๋‹ค์‹œํ”ผ ๋ฉ€ํ‹ฐ ์ƒค๋“œ๋กœ ๊ตฌ์„ฑ๋  ์ˆ˜ ์žˆ๋Š” ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‹ค.
์ƒค๋“œ๊ฐ€ 2๊ฐœ ์žˆ๊ณ , ์ƒค๋“œ๋‹น ๊ฐ๊ฐ ์•ฝ 1000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„์‚ฐ๋˜์–ด์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž.

๊ทธ๋Ÿฌ๋ฉด ์ด ๊ฒฝ์šฐ์— ์–ด๋–ค ์‹์œผ๋กœ ๊ฐ€์žฅ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ 2๊ฐœ๋งŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„๊นŒ?

๋Ÿฌํ”„ํ•˜๊ฒŒ ์ƒ๊ฐํ•œ๋‹ค๋ฉด, ์ƒค๋“œ๋งˆ๋‹ค ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ 2๊ฐœ๋ฅผ ๊ฐ€์ ธ์˜จ ๋‹ค์Œ์—, ๊ทธ๊ฑธ ๋ชจ์•„์„œ 4๊ฐœ ์ค‘์—์„œ ๋‹ค์‹œ 2๊ฐœ๋ฅผ ์„ ๋ณ„ํ•˜๋Š” ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.
๊ทผ๋ฐ ์ƒค๋“œ๋งˆ๋‹ค 2๊ฐœ๋ฅผ ์„ ํƒํ•˜๊ธฐ ์ „์—, ํ›„๋ณด๊ตฐ์„ ๋Œ€๋žต์ ์œผ๋กœ ์ถ”๋ฆฌ๋Š” ์ž‘์—…์ด ์ „์ฒ˜๋ฆฌ๋กœ ๋“ค์–ด๊ฐ„๋‹ค. ์—ฌ๊ธฐ์— ๊ด€์—ฌํ•˜๋Š” ๊ฒƒ์ด num_candidates๋‹ค.

https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch
num_candicaties์ด 100์ด๋ผ๊ณ  ํ•˜๋ฉด, ๊ฐ ์ƒค๋“œ๋งˆ๋‹ค ๊ทผ์‚ฌํ•œ ๋žญํ‚น ๊ธฐ๋ฐ˜์œผ๋กœ 100๊ฐœ๋ฅผ ํ›„๋ณด๊ตฐ์œผ๋กœ ์ถ”๋ฆฐ ๋‹ค์Œ์—, ๊ฑฐ๊ธฐ์„œ ์ƒค๋“œ๋ณ„๋กœ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ 2๊ฐœ๋ฅผ ์กฐํšŒํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ฝ”๋””๋„ค์ดํ„ฐ๊ฐ€ ์ตœ์ข…์ ์ธ 2๊ฐœ๋ฅผ ๋ฝ‘์•„์„œ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

num_candicaties ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ํ›„๋ณด๊ตฐ์˜ ๋ฒ”์œ„๊ฐ€ ๋„“์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ์˜ค์ฐจ ๋ฐœ์ƒ ํ™•๋ฅ ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ๋‹น์—ฐํžˆ ๋Œ€์‹  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด๋‚˜ ๋ฆฌ์†Œ์Šค ์†Œ๋ชจ๊ฐ€ ๋Š˜์–ด๋‚œ๋‹ค.
ํ•˜์ง€๋งŒ ์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํฌ์— ์˜ํ•˜๋ฉด num_candicaties ์ฆ๊ฐ€๋กœ ์ธํ•œ ๋ฆฌ์†Œ์Šค ์†Œ๋ชจ๋Ÿ‰ ๋Œ€๋น„ ์ •ํ™•๋„ ์ฆ๊ฐ€๋Š” ๋”ฑํžˆ.. ๋ˆˆ์— ๋„๋Š” ์ •๋„๋Š” ์•„๋‹ˆ๋ผ๊ณ  ํ•œ๋‹ค. ํ†ต์ƒ์ ์œผ๋กœ k์™€ ๊ฐ™๊ฑฐ๋‚˜ 2๋ฐฐ ์ •๋„๋กœ ์žก๋Š”๊ฒŒ ์ƒ์‹์ ์ธ ๊ฒƒ ๊ฐ™๋‹ค.

๊ทธ๋ฆฌ๊ณ  k์™€ num_candicaties ์˜ต์…˜์€ ๋น„๊ต์  ์ตœ๊ทผ ๋ฒ„์ „์ธ Elasticsearch 8.12๋ถ€ํ„ฐ๋Š” optional ๊ฐ’์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค.
๋ฌธ์„œ์—๋Š” ๋Œ€์ถฉ ์จ๋†จ๋Š”๋ฐ, num_candicaties๋ฅผ ๋ฏธ์„ค์ •ํ•  ๊ฒฝ์šฐ์—๋Š” k*1.5 ๊ฐ’์œผ๋กœ ์ฒ˜๋ฆฌ๋œ๋‹ค.



์ฐธ์กฐ
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring
https://danawalab.github.io/elastic/2022/07/08/ES-Similarity-Search.html
https://www.elastic.co/docs/solutions/search/vector/knn
https://www.ibm.com/think/topics/knn
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector
https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector
https://www.elastic.co/search-labs/blog/elasticsearch-knn-and-num-candidates-strategies

https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch
https://www.elastic.co/search-labs/blog/simplifying-knn-search
https://github.com/elastic/elasticsearch/pull/101209/files