31B · Gemma4-31B-IT
ALLM.H
Korean Medical Foundation Model — KorMedMCQA SOTA 96.78%
KorMedMCQA Doctor Test 96.78% — 상용 모델 포함 전체 1위. Claude Opus 4(96.55%), o1-preview(92.72%), HARI(89.2%)를 넘어선 오픈소스 의료 AI SOTA. Gemma4-31B-IT 기반, SimPO + Self-Consistency 추론 파이프라인.
Model Card
Base Model
Gemma4-31B-IT
Parameters
31B
License
Gemma License + Acryl Research License
Languages
Korean, English
Modalities
Text
Hardware
8× NVIDIA B200 192GB (SimPO training ~13min)
Training Pipeline
Base Model
Gemma4-31B-IT (Google DeepMind)
SimPO Preference Optimization
Curated preference pairs + Style-Preserving Knowledge Distillation
Self-Consistency Inference
SC k=3 majority voting + enhanced answer extraction (pred_X=0)
What Makes This Different
| 기존 연구 | 우리 차별점 |
|---|---|
| Claude Opus 4 (96.55%) | 96.78% — 31B 오픈소스로 상용 모델 초과 |
| HARI (SNUH, 73B, 89.2%) | 31B로 73B 모델을 7.58%p 앞섬 |
| K-Med.ai (Naver+SNUH, 96.4%) | 비견되는 수치를 오픈소스로 달성 |
| Qwen2.5-72B (78.86%) | 17.92%p 격차 — 모델 크기가 아닌 기술력 |
Paper Contributions
- KorMedMCQA Doctor Test SOTA 96.78% (상용 모델 포함 전체 1위)
- 31B 오픈소스 모델로 73B HARI, Claude Opus 4 초과
- SimPO + Self-Consistency 추론 파이프라인
- 국내 최초 Gemma4 파인튜닝 의료 모델
Benchmarks
| Benchmark | Score | Baseline / SOTA | Metric |
|---|---|---|---|
| KorMedMCQA Doctor Test (435) | 96.78% | Claude Opus 4: 96.55% | Accuracy |
| vs HARI (SNUH, 오픈소스 기존 최고) | +7.58%p | HARI-Q2.5-Thinking: 89.2% | Accuracy |
| vs o1-preview (OpenAI) | +4.06%p | o1-preview: 92.72% | Accuracy |
Training Data
- PREFERENCE848 curated preference pairs (SimPO)
- METHODStyle-Preserving KD + Quality Gate
Quick Start
# pip install transformers torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16") tokenizer = AutoTokenizer.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16") messages = [ {"role": "user", "content": "65세 남성, 갑작스런 흉통과 호흡곤란. 감별 진단은?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=1024) print(tokenizer.decode(output[0], skip_special_tokens=True))