- MTEB를 custom하여 오픈소스 임베딩 모델을 평가 (데이터셋 일부 수정)
- 평가 코드 업로드 예정
- IR / Clustering 평가 Dataset 추가 예정
- HuggingFace에 기재된 대로 진행하되, SentenceTransformer > Transformers 의 우선순위로 모델 load. (단, Flagembedding으로만 기재된 경우, SentenceTransformer와 Transformers 중 높은 성능의 것으로 기입 // Flagembedding로 Load 필요시, 추후 진행)
- pair sentence로 존재하는 Dataset 중, 중복 pair는 제거 ( (A, B) = (B, A) )
- LLM Based 임베딩 모델은 fp16/bf16으로 평가
- HuggingFace에 Query_fix가 기재된 경우, 추가 (IR에 대해서만 Query_fix가 명시되어 있을 경우에는 MTEB Instruction 기준으로 적용해보고, 만일 더 성능이 높다면 그대로 인정)
- 문의 사항이나, 평가가 필요한 모델은 issue에 남겨주세요.
- 잘못된 부분에 대한 조언/멘트는 감사히 받겠습니다.
평가 방식 살펴보기 : MTEB 코드 살펴보기 (2)
- STS : mean of {pearson, spearman, cosine_pearson, cosine_spearman, ..., euclidean_spearman}
- NLI : average precision
- Clustering : v-measure
- Retrieval : mean of NDCG @ 5, 10
- Weighted_Average => Retrieval : 33.3%, Clustering : 33.3%, [NLI, STS] : 33.3%
| STS_Average | NLI_Average | Clustering_Average | Retrieval_Average | Weighted_Average | Rank | |
|---|---|---|---|---|---|---|
| Qwen/Qwen3-Embedding-8B-bf16 | 89.36 | 77.71 | 65.79 | 78.73 | 66.74 | 1 |
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | 85.55 | 79.48 | 67.34 | 75.73 | 66.03 | 2 |
| intfloat/multilingual-e5-large-instruct | 82.24 | 65.69 | 70.4 | 73.4 | 64.37 | 3 |
| telepix/PIXIE-Rune-Preview | 82.04 | 60.54 | 65.03 | 79.81 | 64.12 | 4 |
| telepix/PIXIE-Spell-Preview-1.7B-fp16 | 78.11 | 58.75 | 66.42 | 79.61 | 63.88 | 5 |
| Qwen/Qwen3-Embedding-4B-bf16 | 89.2 | 76.4 | 58.74 | 77.68 | 63.87 | 6 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 81.9 | 60.21 | 63.82 | 79.44 | 63.54 | 7 |
| nlpai-lab/KURE-v1 | 83.37 | 64.79 | 61.6 | 77.24 | 62.74 | 8 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | 83.26 | 66.86 | 59.68 | 76.51 | 62.08 | 9 |
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | 79.97 | 67.21 | 65.2 | 71.88 | 62.05 | 10 |
| BAAI/bge-multilingual-gemma2-fp16 | 83.78 | 76.44 | 58.76 | 72.41 | 61.53 | 11 |
| nlpai-lab/KoE5 | 81.36 | 60.27 | 60.39 | 76.68 | 61.43 | 12 |
| Qwen/Qwen3-Embedding-0.6B-bf16 | 84.68 | 69.58 | 58.59 | 73.3 | 61.1 | 13 |
| BAAI/bge-m3 | 83.46 | 65.32 | 58.27 | 75.32 | 61.06 | 14 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 76.89 | 58.95 | 58.94 | 77.21 | 60.48 | 15 |
| dragonkue/BGE-m3-ko | 84.1 | 62.01 | 55.47 | 77.04 | 60.4 | 16 |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | 83.26 | 60.47 | 53.45 | 77.86 | 59.74 | 17 |
| facebook/drama-1b-fp16 | 80.76 | 61.09 | 51.1 | 77.29 | 58.56 | 18 |
| FronyAI/frony-embed-medium-ko-v2 | 80.23 | 61.14 | 53.75 | 74.02 | 58.3 | 19 |
| upskyy/bge-m3-korean | 84.67 | 70.82 | 42.74 | 69.98 | 54.85 | 20 |
| KLUE-STS | Kor-STS | STS17 | STS_Average | Rank | |
|---|---|---|---|---|---|
| Qwen/Qwen3-Embedding-8B-bf16 | 88.74 | 89.68 | 89.65 | 89.36 | 1 |
| Qwen/Qwen3-Embedding-4B-bf16 | 88.63 | 89.21 | 89.77 | 89.2 | 2 |
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | 89.65 | 83.3 | 83.69 | 85.55 | 3 |
| Qwen/Qwen3-Embedding-0.6B-bf16 | 84.11 | 85.56 | 84.36 | 84.68 | 4 |
| upskyy/bge-m3-korean | 86.73 | 82.82 | 84.45 | 84.67 | 5 |
| dragonkue/BGE-m3-ko | 87.35 | 81.76 | 83.19 | 84.1 | 6 |
| BAAI/bge-multilingual-gemma2-fp16 | 88.94 | 81.29 | 81.1 | 83.78 | 7 |
| BAAI/bge-m3 | 86.8 | 80.98 | 82.59 | 83.46 | 8 |
| nlpai-lab/KURE-v1 | 87.48 | 80.97 | 81.67 | 83.37 | 9 |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | 83.86 | 82.01 | 83.92 | 83.26 | 10 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | 85.94 | 81.22 | 82.63 | 83.26 | 10 |
| intfloat/multilingual-e5-large-instruct | 85.65 | 79.43 | 81.65 | 82.24 | 12 |
| telepix/PIXIE-Rune-Preview | 86.28 | 79.28 | 80.57 | 82.04 | 13 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 86.33 | 78.74 | 80.64 | 81.9 | 14 |
| nlpai-lab/KoE5 | 85.1 | 79.01 | 79.96 | 81.36 | 15 |
| facebook/drama-1b-fp16 | 84.65 | 78 | 79.63 | 80.76 | 16 |
| FronyAI/frony-embed-medium-ko-v2 | 78.01 | 80.47 | 82.21 | 80.23 | 17 |
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | 80.81 | 78.79 | 80.3 | 79.97 | 18 |
| telepix/PIXIE-Spell-Preview-1.7B-fp16 | 84.31 | 73.86 | 76.17 | 78.11 | 19 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 82.51 | 73.61 | 74.55 | 76.89 | 20 |
| KLUE-NLI | Kor-NLI | PawsX-PairClassification | NLI_Average | Rank | |
|---|---|---|---|---|---|
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | 80.19 | 89.18 | 69.07 | 79.48 | 1 |
| Qwen/Qwen3-Embedding-8B-bf16 | 79.45 | 88.86 | 64.83 | 77.71 | 2 |
| BAAI/bge-multilingual-gemma2-fp16 | 81.09 | 89.56 | 58.66 | 76.44 | 3 |
| Qwen/Qwen3-Embedding-4B-bf16 | 79.78 | 88.83 | 60.59 | 76.4 | 4 |
| upskyy/bge-m3-korean | 75.72 | 84.5 | 52.23 | 70.82 | 5 |
| Qwen/Qwen3-Embedding-0.6B-bf16 | 71.81 | 82.14 | 54.79 | 69.58 | 6 |
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | 68.29 | 80.78 | 52.56 | 67.21 | 7 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | 69.16 | 78.84 | 52.58 | 66.86 | 8 |
| intfloat/multilingual-e5-large-instruct | 69.87 | 75.51 | 51.69 | 65.69 | 9 |
| BAAI/bge-m3 | 68.34 | 74.88 | 52.75 | 65.32 | 10 |
| nlpai-lab/KURE-v1 | 67.34 | 75.12 | 51.92 | 64.79 | 11 |
| dragonkue/BGE-m3-ko | 65.41 | 68.71 | 51.92 | 62.01 | 12 |
| FronyAI/frony-embed-medium-ko-v2 | 63.08 | 68.57 | 51.78 | 61.14 | 13 |
| facebook/drama-1b-fp16 | 64.6 | 68.14 | 50.52 | 61.09 | 14 |
| telepix/PIXIE-Rune-Preview | 63.47 | 65.87 | 52.29 | 60.54 | 15 |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | 61.39 | 68.1 | 51.93 | 60.47 | 16 |
| nlpai-lab/KoE5 | 61.81 | 66.22 | 52.79 | 60.27 | 17 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 63.27 | 65.04 | 52.32 | 60.21 | 18 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 59.86 | 63.78 | 53.2 | 58.95 | 19 |
| telepix/PIXIE-Spell-Preview-1.7B-fp16 | 58.33 | 65.29 | 52.62 | 58.75 | 20 |
| sib200 | clustering_klue_mrc_context_domain | clustering_klue_mrc_ynat_title | Clustering_Average | Rank | |
|---|---|---|---|---|---|
| intfloat/multilingual-e5-large-instruct | 56.67 | 81.85 | 72.67 | 70.4 | 1 |
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | 40.31 | 80.1 | 81.6 | 67.34 | 2 |
| telepix/PIXIE-Spell-Preview-1.7B-fp16 | 52.35 | 69.53 | 77.37 | 66.42 | 3 |
| Qwen/Qwen3-Embedding-8B-bf16 | 48.26 | 68.97 | 80.15 | 65.79 | 4 |
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | 45.56 | 81.8 | 68.24 | 65.2 | 5 |
| telepix/PIXIE-Rune-Preview | 44.02 | 71.6 | 79.48 | 65.03 | 6 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 34.46 | 75.62 | 81.38 | 63.82 | 7 |
| nlpai-lab/KURE-v1 | 39.21 | 75.08 | 70.52 | 61.6 | 8 |
| nlpai-lab/KoE5 | 36.22 | 73.29 | 71.67 | 60.39 | 9 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | 33.53 | 80.24 | 65.26 | 59.68 | 10 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 47.8 | 71.71 | 57.31 | 58.94 | 11 |
| BAAI/bge-multilingual-gemma2-fp16 | 50.92 | 75.39 | 49.96 | 58.76 | 12 |
| Qwen/Qwen3-Embedding-4B-bf16 | 43.43 | 63.31 | 69.49 | 58.74 | 13 |
| Qwen/Qwen3-Embedding-0.6B-bf16 | 37.8 | 73.41 | 64.55 | 58.59 | 14 |
| BAAI/bge-m3 | 33.01 | 71.85 | 69.96 | 58.27 | 15 |
| dragonkue/BGE-m3-ko | 29.49 | 70.81 | 66.1 | 55.47 | 16 |
| FronyAI/frony-embed-medium-ko-v2 | 37.17 | 66.49 | 57.58 | 53.75 | 17 |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | 31.31 | 71.04 | 58.01 | 53.45 | 18 |
| facebook/drama-1b-fp16 | 36.06 | 71.07 | 46.18 | 51.1 | 19 |
| upskyy/bge-m3-korean | 22.08 | 73.49 | 32.66 | 42.74 | 20 |
| Ko-StrategyQA | AutoRAGRetrieval | PublicHealthQA | XPQARetrieval | facebook/belebele | webfaq-retrieval | Retrieval_Average | Rank | |
|---|---|---|---|---|---|---|---|---|
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 80.01 | 90.33 | 82.56 | 42.69 | 95.07 | 85.97 | 79.44 | 1 |
| Qwen/Qwen3-Embedding-8B-bf16 | 83.08 | 81.1 | 86.9 | 38.24 | 98.05 | 85.02 | 78.73 | 2 |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | 79.46 | 90.42 | 81.61 | 37.77 | 92.9 | 85.03 | 77.86 | 3 |
| Qwen/Qwen3-Embedding-4B-bf16 | 81.98 | 82.38 | 85.88 | 37.9 | 94.12 | 83.83 | 77.68 | 4 |
| facebook/drama-1b-fp16 | 79.57 | 86.77 | 80.06 | 36.97 | 95.73 | 84.62 | 77.29 | 5 |
| nlpai-lab/KURE-v1 | 79.2 | 86.66 | 81.1 | 36.49 | 94.91 | 85.06 | 77.24 | 6 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 79.57 | 83.61 | 80.78 | 41.33 | 92.51 | 85.44 | 77.21 | 7 |
| dragonkue/BGE-m3-ko | 78.75 | 86.65 | 80.49 | 36.42 | 94.88 | 85.04 | 77.04 | 8 |
| nlpai-lab/KoE5 | 78.85 | 84.46 | 83.68 | 34.73 | 94.2 | 84.17 | 76.68 | 9 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | 79.98 | 79.71 | 87.45 | 37.35 | 92.09 | 82.49 | 76.51 | 10 |
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | 80.29 | 75.12 | 80.93 | 39.86 | 94.56 | 83.6 | 75.73 | 11 |
| BAAI/bge-m3 | 78.58 | 82.32 | 79.36 | 34.64 | 92.87 | 84.15 | 75.32 | 12 |
| FronyAI/frony-embed-medium-ko-v2 | 77.53 | 80.19 | 79.28 | 33.51 | 92.04 | 81.58 | 74.02 | 13 |
| intfloat/multilingual-e5-large-instruct | 79.59 | 74.56 | 81.35 | 31.46 | 91.76 | 81.69 | 73.4 | 14 |
| Qwen/Qwen3-Embedding-0.6B-bf16 | 75.85 | 81.83 | 78.59 | 31.92 | 91.66 | 79.96 | 73.3 | 15 |
| BAAI/bge-multilingual-gemma2-fp16 | 78.2 | 75.64 | 65.6 | 37.08 | 95.79 | 82.14 | 72.41 | 16 |
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | 70.17 | 75.34 | 79.11 | 36.72 | 86.8 | 83.15 | 71.88 | 17 |
| upskyy/bge-m3-korean | 74.2 | 71.66 | 76.53 | 30.27 | 86.91 | 80.28 | 69.98 | 18 |
| STS | NLI | Clustering | IR | |
|---|---|---|---|---|
| BAAI/bge-m3 | ||||
| dragonkue/BGE-m3-ko | ||||
| upskyy/bge-m3-korean | ||||
| nlpai-lab/KoE5 | ||||
| nlpai-lab/KURE-v1 | ||||
| FronyAI/frony-embed-medium-ko-v2 | github issue 요청사항 적용 | |||
| McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised-bf16 | ||||
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | github issue 요청사항 적용 | |||
| Snowflake/snowflake-arctic-embed-l-v2.0 | query: | query: | query: | 허깅페이스 기준 적용 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | query: | query: | query: | 허깅페이스 기준 적용 |
| kakaocorp/kanana-nano-2.1b-embedding-fp16 | (큰 차이는 없었음)None or Instruct: retrieve semantically similar text. Query: |
Instruct: Identify the topic or theme of the given texts Query: |
허깅페이스 기준 적용 | |
| Alibaba-NLP/gte-Qwen2-7B-instruct-fp16 | Instruct: retrieve semantically similar text. Query: |
Instruct: retrieve semantically similar text. Query: |
Instruct: Given a web search query, retrieve relevant passages that answer the query Query: |
허깅페이스 기준 적용 |
| BAAI/bge-multilingual-gemma2-fp16 | retrieve semantically similar text. |
retrieve semantically similar text. |
Identify the topic or theme of the given texts |
허깅페이스 기준 적용 |
| intfloat/multilingual-e5-large-instruct | Instruct: retrieve semantically similar text. Query: |
Instruct: Determine whether the two given sentences express the same meaning or not. Query: |
Instruct: Identify the topic or theme of the given texts Query: |
허깅페이스 기준 적용 |
| facebook/drama-1b-fp16 | Query: | Query: | Query: | 허깅페이스 기준 적용 |
| Qwen3-Embedding | Instruct: retrieve semantically similar text. Query: |
Instruct: retrieve semantically similar text. Query: |
Instruct: Given a web search query, retrieve relevant passages that answer the query Query: |
허깅페이스 기준 적용 |
| telepix/PIXIE-Rune-Preview | query: | query: | query: | 허깅페이스 기준 적용 |
| telepix/PIXIE-Spell-Preview-1.7B | (큰 차이는 없었음)None or Instruct: Given a web search query, retrieve relevant passages that answer the query Query: |
Instruct: Given a web search query, retrieve relevant passages that answer the query Query: |
Instruct: Given a web search query, retrieve relevant passages that answer the query Query: |
허깅페이스 기준 적용 |