31B · Gemma4-31B-IT

ALLM.H

Korean Medical Foundation Model — KorMedMCQA SOTA 96.78%

KorMedMCQA Doctor Test 96.78% — 상용 모델 포함 전체 1위. Claude Opus 4(96.55%), o1-preview(92.72%), HARI(89.2%)를 넘어선 오픈소스 의료 AI SOTA. Gemma4-31B-IT 기반.

HuggingFace Playground

Model Card

Base Model

Gemma4-31B-IT

Parameters

31B

License

Gemma License + Acryl Research License

Languages

Korean, English

Modalities

Text

Hardware

Multi-GPU cluster

Training Pipeline

Base Model

Gemma4-31B-IT (Google DeepMind)

Preference Optimization

Proprietary training pipeline

Inference Pipeline

Self-Consistency decoding with majority voting

What Makes This Different

기존 연구	우리 차별점
Claude Opus 4 (96.55%)	96.78% — 31B 오픈소스로 상용 모델 초과
HARI (SNUH, 73B, 89.2%)	31B로 73B 모델을 7.58%p 앞섬
K-Med.ai (Naver+SNUH, 96.4%)	비견되는 수치를 오픈소스로 달성
Qwen2.5-72B (78.86%)	17.92%p 격차 — 모델 크기가 아닌 기술력

Paper Contributions

KorMedMCQA Doctor Test SOTA 96.78% (상용 모델 포함 전체 1위)
31B 오픈소스 모델로 73B HARI, Claude Opus 4 초과
국내 최초 Gemma4 파인튜닝 의료 모델

Benchmarks

Benchmark	Score	Baseline / SOTA	Metric
KorMedMCQA Doctor Test (435)	96.78%	Claude Opus 4: 96.55%	Accuracy
vs HARI (SNUH, open-source prev. best)	+7.58%p	HARI-Q2.5-Thinking: 89.2%	Accuracy
vs o1-preview (OpenAI)	+4.06%p	o1-preview: 92.72%	Accuracy

Training Data

METHODProprietary preference optimization pipeline

Quick Start

# pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16")
tokenizer = AutoTokenizer.from_pretrained("Acryl-aLLM/ALLM.H-Bv4-Gemma4-31B-BF16")

messages = [
    {"role": "user", "content": "65세 남성, 갑작스런 흉통과 호흡곤란. 감별 진단은?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=True))