31B · Gemma4-31B-IT

ALLM.H

Korean Medical Foundation Model — KorMedMCQA SOTA 96.78%
KorMedMCQA Doctor Test 96.78% — 상용 모델 포함 전체 1위. Claude Opus 4(96.55%), o1-preview(92.72%), HARI(89.2%)를 넘어선 오픈소스 의료 AI SOTA. Gemma4-31B-IT 기반, SimPO + Self-Consistency 추론 파이프라인.
Model Card
Base Model
Gemma4-31B-IT
Parameters
31B
License
Gemma License + Acryl Research License
Languages
Korean, English
Modalities
Text
Hardware
8× NVIDIA B200 192GB (SimPO training ~13min)
Training Pipeline
Base Model
Gemma4-31B-IT (Google DeepMind)
SimPO Preference Optimization
Curated preference pairs + Style-Preserving Knowledge Distillation
Self-Consistency Inference
SC k=3 majority voting + enhanced answer extraction (pred_X=0)
What Makes This Different
기존 연구우리 차별점
Claude Opus 4 (96.55%)96.78% — 31B 오픈소스로 상용 모델 초과
HARI (SNUH, 73B, 89.2%)31B로 73B 모델을 7.58%p 앞섬
K-Med.ai (Naver+SNUH, 96.4%)비견되는 수치를 오픈소스로 달성
Qwen2.5-72B (78.86%)17.92%p 격차 — 모델 크기가 아닌 기술력
Paper Contributions
  • KorMedMCQA Doctor Test SOTA 96.78% (상용 모델 포함 전체 1위)
  • 31B 오픈소스 모델로 73B HARI, Claude Opus 4 초과
  • SimPO + Self-Consistency 추론 파이프라인
  • 국내 최초 Gemma4 파인튜닝 의료 모델
Benchmarks
BenchmarkScoreBaseline / SOTAMetric
KorMedMCQA Doctor Test (435)96.78%Claude Opus 4: 96.55%Accuracy
vs HARI (SNUH, 오픈소스 기존 최고)+7.58%pHARI-Q2.5-Thinking: 89.2%Accuracy
vs o1-preview (OpenAI)+4.06%po1-preview: 92.72%Accuracy
Training Data
Quick Start
# pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16")
tokenizer = AutoTokenizer.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16")

messages = [
    {"role": "user", "content": "65세 남성, 갑작스런 흉통과 호흡곤란. 감별 진단은?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=True))