[Qdrant] Full Text Filtering

Qdrant๋Š” ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์— ๋Œ€ํ•œ ๊ธฐ๋Šฅ๋„ ๊ฝค ์ž˜ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋Šฅ์˜ ๋ฒ”์œ„๊ฐ€ Elasticsearch๋ณด๋‹ค๋Š” ๋ชปํ•˜์ง€๋งŒ, ๊ฐ€๋ฒผ์šด ์‚ฌ์šฉ์‚ฌ๋ก€์—์„œ๋Š” ์“ธ๋งŒํ•˜๋‹ค.

qdrant์˜ Full Text Filtering์€ Payload Index์˜ ํ•œ๊ฐ€์ง€ ๋ณ€์ข…์œผ๋กœ์„œ ์ง€์›๋œ๋‹ค.




๋‹จ์–ด ๋ถ„๋ฆฌ (Tokenizer)

Qdrant๋Š” Elasticsearch์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ํ…์ŠคํŠธ๋ฅผ ํ˜•ํƒœ์†Œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‚˜๋ˆ ์„œ inverted index ๊ตฌ์กฐ ๊ธฐ๋ฐ˜์œผ๋กœ ๋น ๋ฅด๊ณ  ํ’ˆ์งˆ ๋†’์€ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด ์ง€์›๋˜๋Š” tokenizer ๋ชฉ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

whitespace - ๊ณต๋ฐฑ ๊ธฐ๋ฐ˜ ๋ถ„๋ฆฌ
word - ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์ , ํŠน์ˆ˜๋ฌธ์ž ๊ธฐ๋ฐ˜ ๋ถ„๋ฆฌ
prefix - word์™€ ๋™์ผํ•œ ์กฐ๊ฑด ํ•˜์— ์ ‘๋‘์‚ฌ ๊ฒ€์ƒ‰์šฉ ๋ถ„๋ฆฌ (red => r, re, red)
multilingual - ์ž์—ฐ์–ด์˜ ๋ฌธ์žฅ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ํŒŒ์•…ํ•ด์„œ ํ˜•ํƒœ์†Œ ๋ถ„์„ (๋‹ค๊ตญ์–ด ๊ฐ€๋Šฅ)

์œ„์—์„œ ์•„๋ž˜๋กœ ๊ฐˆ์ˆ˜๋ก ์„ฌ์„ธํ•œ ํ† ํฐํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‚˜ ๊ทธ๋งŒํผ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค.
๋‹จ์ˆœํ•œ ์‚ฌ์šฉ์‚ฌ๋ก€๋ผ๋ฉด word ์ •๋„๋กœ ์ถฉ๋ถ„ํ• ํ…Œ์ง€๋งŒ, ํ•œ๊ตญ์–ด์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ๋ถ„์„ ๊ฐ™์€๊ฑธ ํ•ด์•ผํ•œ๋‹ค๋ฉด multilingual์„ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค.

multilingual๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ meilisearch ํŒ€์—์„œ ๋งŒ๋“  ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
์ด๊ฑด ํ˜„์žฌ ๋ผํ‹ด๋ฌธ์ž, ์ค‘๊ตญ์–ด, ํ•œ๊ตญ์–ด, ํƒœ๊ตญ์–ด, ์ผ๋ณธ์–ด, ์•„๋ž์–ด ๋“ฑ์„ ์ง€์›ํ•œ๋‹ค.
์ง€์› ๋ชฉ๋ก์— ๋Œ€ํ•œ ์ƒ์„ธ ๋‚ด์šฉ์€ ์•„๋ž˜ ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜๋ฉด ๋œ๋‹ค.
https://github.com/meilisearch/charabia (์ผ๋ณธ์–ด ์ œ์™ธ)
https://github.com/daac-tools/vaporetto (์ผ๋ณธ์–ด ํ•œ์ •)




ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŒ…

ํ•œ๋ฒˆ ์ง์ ‘ ๋ฐ์ดํ„ฐ ๊น”๊ณ  ์ฟผ๋ฆฌ ๋‚ ๋ ค๋ณด๋ฉด์„œ ๋™์ž‘์„ ํ™•์ธํ•ด๋ณด์ž.
๋‚œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊น”์•˜๋‹ค.

PUT collections/testdb 
{
  "vectors": {
    "size": 4,
    "distance": "Euclid"
  }
}

PUT collections/testdb/points 
{
  "points": [
    {
      "id": 1, 
      "vector": [1,1,1,1], 
      "payload": {
        "name": "red banana"
      }
    },
    {
      "id": 2, 
      "vector": [1,2,1,1], 
      "payload": {
        "name": "red ramen"
      }
    },
    {
      "id": 3, 
      "vector": [1,2,1,3], 
      "payload": {
        "name": "blue banana"
      }
    },
    {
      "id": 4, 
      "vector": [1,2,2,1], 
      "payload": {
        "name": "blue wine"
      }
    },{
      "id": 5, 
      "vector": [1,4,2,1], 
      "payload": {
        "name": "red ramen"
      }
    },
    {
      "id": 6, 
      "vector": [3,2,1,1], 
      "payload": {
        "name": "red chip bag"
      }
    },{
      "id": 7, 
      "vector": [2,2,1,1], 
      "payload": {
        "name": "golden record"
      }
    },
    {
      "id": 8, 
      "vector": [2,2,2,2], 
      "payload": {
        "name": "gucci bag"
      }
    },{
      "id": 9, 
      "vector": [2,1,1,1], 
      "payload": {
        "name": "awesome chicken noodle"
      }
    },
    {
      "id": 10, 
      "vector": [1,2,3,4], 
      "payload": {
        "name": "red noodle"
      }
    }
  ]
}



๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์„ ์“ฐ๋ ค๋ฉด ์ธ๋ฑ์Šค๋ฅผ ๋จผ์ € ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค.

๋‹ค๋ฅธ ๊ธฐ๋ณธ ํ•„ํ„ฐ๋ง๋“ค์€ ์ธ๋ฑ์Šค๊ฐ€ ์—†์–ด๋„ ์„ฑ๋Šฅ์ด ๋А๋ฆด๋ฟ ๋™์ž‘์€ ํ•˜์ง€๋งŒ, text ๊ฒ€์ƒ‰์€ ์ „์šฉ text ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ˜๋“œ์‹œ ์กด์žฌํ•ด์•ผ ํ•œ๋‹ค. ์ƒ์„ฑ ๋ฐฉ์‹ ์ž์ฒด๋Š” ๋‹ค๋ฅธ payload index๋“ค๊ณผ ๋น„์Šทํ•˜๋‚˜, ์˜ต์…˜์ด ๋งค์šฐ ๋งŽ๋‹ค๋Š”๊ฒŒ ์ฐจ์ด์ ์ด๋‹ค.

PUT /collections/์ปฌ๋ ‰์…˜๋ช…/index
{
    "field_name": "ํ•„๋“œ๋ช…",
    "field_schema": {
        "type": "text",
        "tokenizer": "word"
    }
}



text ๊ฒ€์ƒ‰ (๋ฌธ์žฅ ๋‹จ์–ด ๋ฌด์กฐ๊ฑด ํฌํ•จ)

๋งŒ๋“ค๊ณ  ๋‚˜๋ฉด, ์ด์ œ filter ๋“ฑ์„ ํ†ตํ•ด์„œ ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.
์‚ฌ์šฉ๋ฒ• ์ž์ฒด๋Š” ๋‹จ์ˆœํ•˜๋‹ค. ๊ทธ๋ƒฅ text ์ ˆ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

POST /collections/testdb/points/query
{
  "with_payload": true,
  "filter": {
    "must": [
      {
        "key": "name",
        "match": {
          "text": "red"
        }
      }
    ]
  }
}

๊ทธ๋Ÿฌ๋ฉด ์œ„์™€ ๊ฐ™์ด ํ•ด๋‹น token(red)๊ฐ€ ํฌํ•จ๋œ ๊ฒƒ๋“ค์„ ํ•„ํ„ฐ๋งํ•ด์ค„ ๊ฒƒ์ด๋‹ค.


๋‹จ, text ํ•„ํ„ฐ๋Š” ๋‹จ์–ด๊ฐ€ ๊ฒน์น˜๋Š” ๋ชจ๋“  ๊ฒƒ์„ ๊ฐ€์ ธ์˜ค์ง„ ์•Š๋Š”๋‹ค. ํ˜„์žฌ ๊ฒ€์ƒ‰์–ด์— ํฌํ•จ๋œ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜๋Š” ํ•ญ๋ชฉ์„ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์—, ๋А์Šจํ•œ ๊ฒ€์ƒ‰์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.



text_any ๊ฒ€์ƒ‰ (๋ฌธ์žฅ ๋‹จ์–ด 1๊ฐœ ์ด์ƒ ํฌํ•จ)

๋‹จ์–ด์— ๋Œ€ํ•ด OR ๊ฒ€์ƒ‰์„ ํ•˜๋ ค๋ฉด text ๋Œ€์‹  text_any๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๊ฑด ๊ฒ€์ƒ‰์–ด๋ฅผ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌํ•œ ๋‹ค์Œ์—, ๊ฐ ํ† ํฐ์ด ํ•˜๋‚˜๋ผ๋„ ํฌํ•จ๋œ๋‹ค๋ฉด ๋ชจ๋‘ ๊ฐ€์ ธ์˜จ๋‹ค.



phrase ๊ฒ€์ƒ‰ (๋ฌธ์žฅ ์™„์ „ ํฌํ•จ)

๋ณด๋‹ค ์—„๊ฒฉํ•œ ๊ฒ€์ƒ‰์ด ํ•„์š”ํ•˜๋‹ค๋ฉด phrase๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฑด ํ˜„์žฌ ๊ฒ€์ƒ‰์–ด๊ฐ€ ๋ณ€ํ˜• ์—†์ด ํฌํ•จ๋˜๋Š” ๊ฒƒ๋งŒ ๊ฒ€์‚ฌํ•œ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์ธ๋ฑ์Šค์— ๋จผ์ € ์˜ต์…˜์„ ํ™œ์„ฑํ™”ํ•ด์•ผ ํ•œ๋‹ค.

๋‹ค๋ฅธ text ๊ฒ€์ƒ‰์€ ์ˆœ์„œ์™€ ์ƒ๊ด€์—†์ด ๊ฐœ๋ณ„ ๋‹จ์–ด์˜ ์ผ์น˜ ์—ฌ๋ถ€๋งŒ ๋ณด์ง€๋งŒ


phrase๋Š” ์œ„์ฒ˜๋Ÿผ ์ˆœ์„œ๋ฅผ ๊ผฌ๋Š” ๊ฒƒ์ด ํ†ตํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ฒ€์ƒ‰์–ด์˜ ๋ฌธ์žฅ์ด ์ •ํ™•ํ•˜๊ฒŒ ํฌํ•จ๋˜์–ด์•ผ ํ•œ๋‹ค.




Text ์ธ๋ฑ์Šค ์˜ต์…˜๋“ค

ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์— ๋Œ€ํ•œ ์„ธ๋ถ€์ ์ธ ํŠœ๋‹์€, ์ธ๋ฑ์Šค๋ฅผ ๋งŒ๋“œ๋Š” ์‹œ์ ์— ์„ค์ •ํ•ด์•ผ ํ•œ๋‹ค.

์ง€์›ํ•˜๋Š” ์ „์ฒด ์˜ต์…˜ ๋ชฉ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

lowercase๋Š” ๋Œ€์†Œ๋ฌธ์ž ๊ตฌ๋ถ„์— ๋Œ€ํ•œ ์˜ต์…˜์ด๋‹ค.
๊ธฐ๋ณธ๊ฐ’์€ true๊ณ , ๋ชจ๋“  ํ† ํฐ์„ ์†Œ๋ฌธ์ž๋กœ ์ €์žฅํ•ด์„œ ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ๊ธฐ๋ณธ ๋™์ž‘์ด๋ผ๋Š” ๋œป์ด๋‹ค.

ascii_folding์€ ๋ผํ‹ด ๋ณ€ํ˜• ๋ฌธ์ž๋“ค์„ ๋ญ‰์น˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ์˜ต์…˜์ด๋‹ค.
๊ธฐ๋ณธ๊ฐ’์€ false๋‹ค. ํ™œ์„ฑํ™”ํ•œ๋‹ค๋ฉด ๋ผํ‹ด ๋ฌธ์ž์—์„œ ์›€๋ผ์šฐํŠธ๋‚˜ ์•…์„ผํŠธ ๋“ฑ์„ ์ œ๊ฑฐํ•œ ์ฑ„๋กœ ์ €์žฅํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  stemmer๋Š” ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ• ๋•Œ ์–ด๊ทผ ๋‹จ์œ„๋กœ ๊ฐ€๊ณตํ• ์ง€๋ฅผ ์„ ํƒํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, shirts๋ผ๋Š” ํ† ํฐ์ด ๋‚˜์˜ค๋ฉด ๊ทธ๊ฑธ shirt๋ผ๊ณ  ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๊ธฐ๋ณธ ์„ค์ •์—์„œ๋Š” ์–ด๊ทผ ์ถ”์ถœ์ด ๊บผ์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, bags๋ฅผ ๋„ฃ๋”๋ผ๋„ bag๊ณผ ์ผ์น˜ํ•˜์ง€๋Š” ์•Š์•„ ๊ฒ€์ƒ‰๋˜์ง€ ์•Š๋Š”๋‹ค.

ํ•˜์ง€๋งŒ ์–ด๊ทผ ์„ค์ •์„ ์ผ ๋‹ค๋ฉด

ํ˜•ํƒœ์†Œ์˜ ์›ํ˜•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฒ€์ƒ‰ํ•˜๊ธฐ์— ์ข€ ๋” ์œ ์—ฐํ•œ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.




ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰ (Scoring)

text ํ•„ํ„ฐ๋ง ๋˜ํ•œ scoring์— ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ์€ ํ•˜๋‚˜ ์•„์ฃผ ๋งค๋„๋Ÿฝ๊ฒŒ ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค.

์ด ๋ฒกํ„ฐ ๊ฒ€์ƒ‰์— ํ…์ŠคํŠธ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜์˜ score๋ฅผ ๋”ํ•œ๋‹ค๊ณ  ํ•˜๋ฉด


์ž˜ ๋˜๊ธฐ๋Š” ๋œ๋‹ค.
์ด ๊ฒฝ์šฐ์—๋Š” ํ…์ŠคํŠธ ํ•„ํ„ฐ๋ง์— ๊ฑธ๋ฆฐ ํ•ญ๋ชฉ์— ๊ธฐ๋ณธ 1 score๊ฐ€ ๋”ํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ฌธ์ œ๋Š”, text, text_any ๋“ฑ์ด ๋‹จ์–ด์˜ ์ผ์น˜ ์ •๋„์— ๋น„๋ก€ํ•œ score๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์กฐ๊ฑด์ด ํ•˜๋‚˜๋ผ๋„ ๋งž์œผ๋ฉด 1์„ ๋ฐ˜ํ™˜ํ•˜๊ณ , ๋งž์ง€ ์•Š์œผ๋ฉด ์•„์˜ˆ 0์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

๊ทธ๋ž˜์„œ ์™„์ „ ์ผ์น˜ ํ…์ŠคํŠธ๋ฅผ ๋„ฃ๋“ , ๋ถ€๋ถ„ ์ผ์น˜ ํ…์ŠคํŠธ๋ฅผ ๋„ฃ๋“  ์œ„์ฒ˜๋Ÿผ ๋ฌด์กฐ๊ฑด ๊ณ ์ • ์ ์ˆ˜ 1์ด ๋ฐ˜ํ™˜๋˜๊ณ , ์œ ์‚ฌ๋„๋ฅผ ์ ์ ˆํžˆ ์กฐ์œจํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.

์ด๊ฑด ํ˜„์žฌ ์‹œ์Šคํ…œ ๊ธฐ๋Šฅ์˜ ํ•œ๊ณ„๋‹ค. ๋ฉ”์ธํ…Œ์ด๋„ˆํ•œํ…Œ ๋ฌผ์–ด๋ณด๋‹ˆ๊นŒ score๊นŒ์ง€ ์ œ์–ดํ•˜๋Š”๊ฑด ํ˜„์žฌ๋กœ์„œ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•œ ์š”๊ตฌ์‚ฌํ•ญ์ด๋ผ๊ณ  ํ•˜๋”๋ผ.
ํŽธ๋ฒ•์„ ์“ฐ๋ฉด ๋‹ฌ์„ฑํ•  ์ˆ˜๋Š” ์žˆ๋Š”๋ฐ, ์ง์ ‘ ํ† ํฐํ™”๋ฅผ ํ•œ๋‹ค์Œ์— ๊ทธ ํ‘œํ˜„์‹์„ ์ข…ํ•ฉํ•ด์„œ ์—ฐ์‚ฐํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฐ ์‹์œผ๋กœ ๋ง์ด๋‹ค.
์ด์ƒ์ ์ธ ๋ฐฉ๋ฒ•์€ ์•„๋‹ˆ์ง€๋งŒ, ํ˜„์žฌ๋กœ์„œ๋Š” ์ด๊ฒƒ ๋ฟ์ด๋‹ค.



์ฐธ์กฐ
https://qdrant.tech/documentation/concepts/indexing/#full-text-index
https://qdrant.tech/documentation/guides/text-search/