Meet the ‘North Mini Code’: Cohere’s 30B Open-Weight Mixed-Experts Model with 3B Active Parameters for Agent Coding

This week, the Cohere AI team sent their first developer-facing code model called ‘Little North Code’. ‘North Mini Code’ is open-minded and focused on application developers. It is a mixture of experts (MoE) model with a total of 30B parameters. Only 3B of those parameters work per token.
The release is positioned around “autonomous” AI. The idea is simple: use talented models on your terms. Small, efficient code models allow teams to run independently without large GPU clusters. The North Mini Code directly addresses that gap.
Little North Code
The North Mini Code is a 30B-A3B parameter model. A3B represents three billion active parameters per forward pass. Cohere has prepared it three functions: code generation, agent software engineering, and storage functions. The model is text-in, text-out. No photo or video input.
The window contains 256K tokens. The maximum output length is 64K tokens. Cohere lists a single H100 hardware bar in the FP8. Weights go under Apache 2.0 on Hugging Face. You can also access it through the Cohere API, Model Vault, and OpenRouter.
| The field | North-Mini-Code-1.0 |
|---|---|
| License | Apache 2.0 |
| Model size | 30B price; 3B is active |
| The length of the thread | 256K total; 64K is the highest generation |
| Prepared | Code generation, agent software engineering, end-to-end operations |
| Availability | Hugging Face, Cohere API, Cohere Model Vault, OpenRouter |
| Hardware (minimum) | 1× H100 @ FP8 |
Architecture
The North Mini Code is a decoder-only Transformer with overlapping MoE layers. Its concentration combines the two types in a ratio of 3:1. Sliding window attention uses RoPE in positions. Global attention does not use hierarchical embedding at all. The feed-forward block accommodates 128 technicians. Eight experts activate each token. Each expert is an FFN using SwiGLU.
The router uses a sigmoid prior to the maximum-k selection. One dense layer sits before the smaller layers. That mix keeps the computing power small while expanding the overall capacity. Cohere took the weights out of BF16.
After the training was conducted in two stages. Two stages of cascaded supervised fine-tuning (SFT) first emerged. Then came reinforced learning with guaranteed rewards (RLVR). Post-training focuses on agent writing. The model also supports centralized reasoning and the use of native tools.
Measurements
Cohere reports a 33.4 on the Artificial Analysis Coding Index. It describes this as a competitive position between models of the same size. The company tested on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. Also used Terminal-Bench Hard, SciCode, and LiveCodeBench v6.
The method is straightforward. SWE-Bench used the SWE-agent v1.1.0 harness. Terminal-Bench v2 used a simple ReAct harness with one terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark went with three seeds, and then averaged. The sample used is temperature 1.0 and top_p 0.95.
Speed
In Cohere’s internal tests, Little North Code achieved 2.8x higher output. That is held in the same concurrency as the hardware. It also showed a 30% edge in inter-token latency. The first time-token was close between the two. Devstral Small 2 retained the small TTFT lead.
| Metric | North Mini Code vs Devstral Small 2 |
|---|---|
| Output | Up to 2.8x more (same hardware compatibility) |
| Inter-token delay | 30% off North Mini Code |
| Initial time-token | It is slightly behind the Devstral Small 2 |
Use Cases with examples
Cohere built North Mini Code agent workflow.
Three patterns stand out in their composition:
- Sub-agent orchestration: The master agent sends subordinate tasks to helpers. For example: one agent writes unit tests while another fixes code that fails.
- System architecture map: The model reads the cache and draws its structure. For example: tracking how services call each other before a major refactor.
- Code review: The model evaluates the diff of problems. Example: marking an unattended null dereference before compilation.
Terminal functions fit the model as well. For example: listing files, running a layout, and passing the output to find errors.
Getting started
The fastest way is Hugging Face Transformers. Install Transformers in the source of this model. Recommended sampling is temperature 1.0 and top_p 0.95.
# Install Transformers from source (required for this model):
# pip install "git+
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]
# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
gen_tokens = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=1.0,
top_p=0.95,
)
# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)
Functionally, vLLM is active. You need the vLLM main library plus Cohere’s melody. Analyzing the correct answer depends on it.
uv pip install "git+
uv pip install "cohere_melody>=0.9.0"
vllm serve CohereLabs/North-Mini-Code-1.0
-tp 2
--max-model-len 320000
--tool-call-parser cohere_command4
--reasoning-parser cohere_command4
--enable-auto-tool-choice
There are limited builds of Ollama, LM Studio, and llama.cpp. You can also try the model before downloading. Cohere offers free access through OpenCode and Hugging Face Space hosted.
Key Takeaways
- Cohere’s first code model, Little North Code, is a 30B professional mixer that activates 3B parameters per token.
- It runs on a single H100 in FP8, with 256K cores and 64K max output.
- Weights are shipped under Apache 2.0, although the Hugging Face card adds a non-commercial note.
- The official combined output reports 33.4 in the Artificial Analysis Coding Index, and up to 2.8x output with Devstral Small 2.
- Designed for agent coding—sub-agent orchestration, architecture mapping, code review using a native tool
Marktechpost’s Interactive Explainer
Compactness · Open Weight Code Model
Little North Code
Cohere’s first developer code model: a 30B hybrid that activates only 3B parameters per token, designed for agent software engineering and end-to-end operations.
30B absolute parameters
3B active / symbol
256K context
64K maximum output
1× H100 @ FP8
A model at a glance
Open weights, released on June 9, 2026. Text, write out.
The size
30B is the price / 3B active
Buildings
Small MoE (decoder only)
Small hardware
1× H100 @FP8
License
Apache 2.0 see note
Content window · drag to explore
128K tokens
medium-sized codebase
8K64K output cap256K maximum
Relative sizes are approximate. The exact limits are 256K core and 64K max generation.
Prepared
Code execution
Agent software engineering
Terminal functions
Agent usage cases
Sub-agent orchestration
System architecture map
Code review
License note: Cohere’s blog says Apache 2.0. The Hugging Face card adds an acceptable use supplement and a non-commercial note. Check both before you use.
Forward pass
Tap any category to see what we’re up to. The MoE block is where the minimum occurs.
→
→
→
→
Input tokens
The text is tokenized and fed to the encoder only. The model is text inside, text outside.
Try the router
Each MoE block has 128 experts. The router chooses 8 tokens each. Route tokens and clock input are increasing.
Coral = 8 shooting experts now. Peach = previously used professionals in running. Move square to check.
8 / 128 experts
6.25% of experts use each token, so the computer is always small.
Different experts are used0 / 128
Tokens have been moved0
Reported performance
Statistics from Cohere. Self-employment is still important.
0
Artificial Analysis Coding Index
0
Output vs Devstral Small 2
0
Better inter-token latency
The higher the better
Time-to-first-token was closely matched, with Devstral Small 2 holding a slight edge.
Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, LiveCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), ReAct harness with one terminal tool (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each run used 3 seeds, average, at temperatures 1.0 and above_p 0.95.
Quick start
Hugging Face Transformers, installed from source. Recommended samples: temperature 1.0, top_p 0.95.
# Install Transformers from source, then: from transformers import AutoTokenizer, AutoModelForCausalLM mid = "CohereLabs/North-Mini-Code-1.0" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained(mid, device_map="auto") msgs = [{"role": "user", "content": "Write a Python palindrome checker."}] inputs = tok.apply_chat_template( msgs, add_generation_prompt=True, return_dict=True, return_tensors="pt", ).to(model.device) out = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0, top_p=0.95) print(tok.decode(out[0][inputs["input_ids"].shape[-1]:]))
Serve with vLLM (+ cohere_melody)
You are trained OpenCode
The natives tool use + collective thinking
Quantized: OllamaLM Studio, llama.cpp
Also in Cohere API, Model Vault, OpenRouter
Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


