inspiration

Merging Is Not Training

devinfo.dev — June 12, 2026

devinfo.dev:2026.0030

#model-merging #mergekit #fine-tuning #inference

Save as PDF

Merging Is Not Training

You have two fine-tuned models. One is good at code. One is good at reasoning. You want both.

The obvious answer is to train a new model on a combined dataset. That costs data, time, and a GPU cluster. Model merging takes a different path: operate directly on the weights.

What Merging Actually Does

Every fine-tuned model is a pretrained base plus a delta — the weight changes introduced by fine-tuning. Merging algorithms work on those deltas. The base model is the shared coordinate system. The deltas are what you are combining.

The simplest approach, linear interpolation (model soup), just averages the weights. It works surprisingly well when the source models are similar. It fails when they are not — deltas interfere, capabilities cancel.

The Three Algorithms Worth Knowing

SLERP (Spherical Linear Interpolation) interpolates along the surface of a hypersphere rather than through Euclidean space. This preserves the magnitude of the weight vectors during interpolation, which matters because weight norms carry information. SLERP is the right tool for blending two models. It does not generalize to three or more.

TIES (TrIm, Elect Sign and Merge) addresses interference directly. When you merge task vectors from multiple models, conflicting parameters cancel each other. TIES trims the smallest delta values (they are likely noise), resolves sign conflicts by majority vote, and then merges only the surviving parameters. It was introduced by Yadav et al. at NeurIPS 2023 and remains the standard for multi-model merges.

DARE (Drop And REscale) takes a different approach to interference: randomly drop a fraction of delta parameters and rescale the remainder to compensate. This sparsifies the task vectors before merging, reducing interference without requiring sign resolution. DARE combines cleanly with TIES (dare_ties in mergekit) and is the recommended default when merging three or more models.

mergekit Is the Practical Interface

mergekit, developed by Charles Goddard at Arcee AI and published as open source in 2023, reduces model merging to a YAML configuration file. You specify which models to merge, which algorithm to use, and the weight assigned to each source. mergekit handles memory-efficient loading layer by layer — merges run on CPU or with as little as 8 GB of VRAM.

``yaml


merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: model-a
    parameters:
      weight: 0.5
      density: 0.6
  - model: model-b
    parameters:
      weight: 0.5
      density: 0.6
parameters:
  normalize: true
dtype: bfloat16

The result is a standard HuggingFace-compatible model. No special runtime required.

The Real Constraint

Merging only works between models that share the same architecture and were fine-tuned from the same base. You cannot merge a Llama 3 fine-tune with a Mistral fine-tune. The weight tensors must be structurally identical. The delta must be meaningful in the same coordinate space.

This is not a limitation of the algorithm. It is a constraint from the mathematics. Fine-tuning from the same base establishes a shared loss landscape. Merging exploits linear mode connectivity — the property that fine-tuned models often lie in the same loss basin as their base. Different base models have no such guarantee.

Why This Matters for Self-Hosted Practitioners

You do not need to choose between a coding model and a reasoning model. Merge them. The merged model will not always be better than either source on its strongest task — but it will be competent at both, and you run one model instead of two.

The Open LLM Leaderboard in early 2024 was briefly dominated by merged models, not fine-tuned ones. That is a signal worth taking seriously.

Merging is not a shortcut around training. It is a different operation entirely — one that is fast, cheap, and underused by most practitioners who have not looked past the defaults in their inference stack.

References

1. Yadav, P., Tam, D., Choshen, L., Raffel, C., & Bansal, M. (2023). TIES-Merging: Resolving Interference When Merging Models. Advances in Neural Information Processing Systems (NeurIPS 2023). https://papers.nips.cc/paper_files/paper/2023/file/1644c9af28ab7916874f6fd6228a9bcf-Paper-Conference.pdf

2. Yu, L., Yu, B., Yu, H., Huang, F., & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (DARE). arXiv:2311.03099. https://arxiv.org/abs/2311.03099

3. Goddard, C., et al. (2024). Arcee's MergeKit: A Toolkit for Merging Large Language Models. Proceedings of EMNLP 2024 (Industry Track). https://aclanthology.org/2024.emnlp-industry.36/

4. Arcee AI. (2024). mergekit — Tools for merging pretrained large language models. GitHub. https://github.com/arcee-ai/mergekit

5. Labonne, M. (2024). Merge Large Language Models with mergekit. Towards Data Science. https://towardsdatascience.com/merge-large-language-models-with-mergekit-2118fb392b54/

Cite as

devinfo.dev. (2026). "Merging Is Not Training." devinfo.dev:2026.0030. https://devinfo.dev/d/2026.0030

devinfo.dev | https://devinfo.dev/d/2026.0030
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

Merging Is Not Training

What Merging Actually Does

The Three Algorithms Worth Knowing

mergekit Is the Practical Interface

The Real Constraint

Why This Matters for Self-Hosted Practitioners

References

Cite as

See also