https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/?utm_source=chatgpt.com

https://huggingface.co/sentence-transformers

ํ”„๋กœ์ ํŠธ ์ƒ์„ฑ ๋ฐ ๊ฐ€์ƒํ™˜๊ฒฝ ์‹คํ–‰

uv init
uv run main.py

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

uv add sentence-transformers langchain-huggingface langchain-chroma fiftyone

sentence-transformers : hugging face์—์„œ ๊ด€๋ฆฌํ•˜๋Š” ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ / ๋ฆฌ๋žญํฌ ๋ชจ๋ธ

langchain-huggingface : Langchain - huggingface๋ฅผ ์ž‡๋Š” ์—ญํ• ์„ ํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

chromadb : ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” DB

fifyone : ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹ ์‹œ๊ฐํ™” ๋ฐ ๋ถ„์„ ํˆดํ‚ท, huggingface์™€ ํ†ตํ•ฉ ์ง€์›

HuggingFace๋งŒ ์ด์šฉํ•œ SBERT ์ง„ํ–‰

#1. ๋ชจ๋ธ ๋กœ๋“œ 
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sentence_transformers import SentenceTransformer
import fiftyone as fo
import numpy as np

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#2. ์ž„๋ฒ ๋”ฉํ•  ๋ฌธ์žฅ ์ค€๋น„
sentences = [
    "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค. ์ œ ์ด๋ฆ„์€ ์กฐํ˜„์ง„์ž…๋‹ˆ๋‹ค.",
    "์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ์กฐํ˜„์ง„์ด์—์š”.",
    "์˜ค๋Š˜ ๋‚ ์”จ๋Š” ๋ง‘์Šต๋‹ˆ๋‹ค."
]

#3. ๋ชจ๋ธ ์ด์šฉํ•ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ์ง„ํ–‰
embeddings = model.encode(sentences)
print(embeddings.shape)

#4. ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„ ๊ฒฐ๊ณผ ํ™•์ธ
similarity_matrix = model.similarity(embeddings, embeddings)
print(similarity_matrix)

#5. FittyOne์„ ์‚ฌ์šฉํ•œ ์œ ์‚ฌ๋„ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™” - ์ง„ํ–‰ ์•ˆํ•จ
# dataset = fo.Dataset("sentence_similarity")

# ์ƒ˜ํ”Œ ์ถ”๊ฐ€
# for i, sentence in enumerate(sentences):
#     sample = fo.Sample(
#       filepath=f"sample_{i}.txt",
#       text=sentence,
#       embedding=embeddings[i].tolist()
#   )
#   dataset.add_sample(sample)

# ์‹œ๊ฐํ™” ์‹คํ–‰
#session = fo.Session(dataset=dataset)
#session.launch()

๊ฒฐ๊ณผ

image.png

shape ์˜ ์ถœ๋ ฅ๊ฒฐ๊ณผ์ธ (3, 384) ๋Š” 3๊ฐœ์˜ ๋ฌธ์žฅ์„ ์ž„๋ฒ ๋”ฉํ–ˆ๊ณ , ๊ฐ ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ์˜ ์ฐจ์›์ˆ˜๊ฐ€ 384๋ผ๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

tensor๋Š” ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(cosineย similarity) ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.