booklet

From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

devinfo.dev — May 27, 2026

devinfo.dev:2026.0009

#cloud #arm #oci #sovereignty #inference #self-hosted

Save as PDF

The Promise

Oracle Cloud offers Always Free ARM instances: 4 OCPUs, 24GB RAM. Enough to run a quantized 7B or 13B model via Ollama or llama.cpp. No credit card charge. No expiration.

Google Cloud offers a free e2-micro. AWS offers a t2.micro. Neither has enough RAM for inference.

For self-hosted LLM on free tier, Oracle Cloud is the only viable option. This is simultaneously the opportunity and the trap.

The Reality

Oracle's ARM capacity is allocated per-region. Popular regions (US, EU, Canada) are perpetually full. You cannot create an ARM instance because there is no capacity. The API returns Out of host capacity indefinitely.

This is not a temporary condition. It is structural. Free tier users are lowest priority. Paid workloads consume ARM capacity first. What remains — if anything — goes to free tier requests.

The result: you have an account, you have credits, you have a quota that says 4 OCPUs and 24GB RAM. You cannot use any of it.

What Actually Works

Region Selection

Not all regions are equally constrained. Capacity availability (as of 2026):

| Region | ARM Availability | Notes |

|--------|-----------------|-------|

| US-Ashburn (IAD) | Rare | Highest demand |

| US-Phoenix (PHX) | Occasional | Slightly better |

| Canada-Montreal (YUL) | Rare | Persistent unavailability reported |

| Canada-Toronto (YYZ) | Occasional | Better than Montreal |

| Germany-Frankfurt (FRA) | Rare | High EU demand |

| UK-London (LHR) | Occasional | |

| India-Mumbai (BOM) | Better | Lower demand |

| Japan-Tokyo (NRT) | Occasional | |

| Brazil-Sao Paulo (GRU) | Better | Lower demand |

| Australia-Sydney (SYD) | Better | Lower demand |

You are locked to your home region on free tier. Choose wisely at account creation. This cannot be changed later without creating a new account.

The Capacity Script

Since capacity appears and disappears, the community approach is to run a script that repeatedly attempts to create an instance:


#!/bin/bash
Attempt ARM instance creation every 60 seconds
while true; do
  oci compute instance launch \
    --availability-domain "AD-1" \
    --compartment-id "$COMPARTMENT_ID" \
    --shape "VM.Standard.A1.Flex" \
    --shape-config '{"ocpus": 2, "memoryInGBs": 12}' \
    --image-id "$IMAGE_ID" \
    --subnet-id "$SUBNET_ID" \
    --ssh-authorized-keys-file ~/.ssh/id_rsa.pub \
    --assign-public-ip true
  if [ $? -eq 0 ]; then
    echo "Instance created!"
    break
  fi
  echo "No capacity. Retrying in 60s..."
  sleep 60
done


Key insight: request fewer OCPUs. A 2 OCPU / 12GB instance is more likely to succeed than the full 4 OCPU / 24GB allocation. You can create multiple smaller instances and split workloads.
Start Small
Instead of requesting the maximum (4 OCPU, 24GB):
| Configuration | Success Rate | Usable For |
|--------------|-------------|------------|
| 1 OCPU, 6GB | Highest | Small models (3B-7B Q4) |
| 2 OCPU, 12GB | Moderate | 7B Q4-Q5 models comfortably |
| 4 OCPU, 24GB | Lowest | 13B Q4, multiple 7B models |
A 7B Q4_K_M model requires ~4.5GB RAM. A 2 OCPU / 12GB instance runs it with room for the OS and Ollama overhead.
Setting Up Inference on ARM
Once you have an instance:
Install Ollama


curl -fsSL https://ollama.com/install.sh | sh


Ollama supports ARM64 natively. No special configuration needed.
Pull a Model


ollama pull llama3.2:3b
Or for the full 7B if you have 12GB+ RAM:
ollama pull llama3.1:8b-q4_K_M


Expose the API
By default Ollama binds to localhost. To serve external clients:


export OLLAMA_HOST=0.0.0.0:11434
ollama serve


Secure with a firewall rule — only allow your IP or VPN.
ARM Performance
ARM inference is CPU-only (no GPU on free tier). Expect:
| Model | Instance | Throughput |
|-------|----------|------------|
| Llama 3.2 3B Q4 | 2 OCPU / 12GB | ~12 tok/s |
| Llama 3.1 8B Q4 | 2 OCPU / 12GB | ~5 tok/s |
| Llama 3.1 8B Q4 | 4 OCPU / 24GB | ~9 tok/s |
| Qwen 2.5 7B Q4 | 4 OCPU / 24GB | ~8 tok/s |
This is not fast. It is functional. For async tasks (code review, document summarization, batch processing), 5-9 tok/s is usable. For interactive chat, it feels slow.
The Sovereignty Architecture
Free tier is not sovereign if a policy change can remove it. Build assuming the instance will disappear:
1. Stateless Inference
Do not store state on the instance. Models are re-pullable. Configuration is in version control. The instance is disposable.
2. Multi-Cloud Readiness
Your localllm-engine should have fallback paths:
Primary: OCI ARM instance (Ollama)
Secondary: Local machine (when you are home)
Tertiary: FreeLLMAPI (cloud, for non-sensitive requests)

If OCI disappears, your tools still work. They just route differently.
3. Configuration as Code


.env for localllm-engine
OLLAMA_BASE_URL=http://your-oci-instance:11434
LLAMACPP_BASE_URL=http://localhost:8080
FREELLMAPI_BASE_URL=http://freellmapi-instance


One line change to reroute. No tool reconfiguration.
4. Data Sovereignty

Use X-Routing-Privacy: local_only` for sensitive prompts. Cloud fallback exists but is opt-in per request, not default.

Alternatives When Free Tier Fails

|----------|------------------|-----|-----|-------|

The honest recommendation: if you need reliable always-on inference and OCI capacity is unavailable, a $4/month Hetzner ARM box or a home server with 16GB+ RAM is more sovereign than a free tier you cannot provision.

Lessons Learned

1. Free tier is a loan, not a right. The provider can change terms, reduce capacity, or sunset the offering. Build accordingly.

2. Region lock is permanent on free tier. Research capacity before creating your account. Once locked, you cannot move.

3. Start small. 1-2 OCPU instances provision more often than 4 OCPU. Run multiple small instances if needed.

4. Automate provisioning. The capacity script is not elegant. It is necessary.

5. Never depend on a single backend. The localllm-engine architecture exists because single points of failure are unacceptable for sovereignty.

6. Measure what you actually need. 5 tok/s on a free ARM instance may be enough for your workload. Do not over-provision for imagined requirements.

References

Oracle Cloud. (2024). "Always Free Resources." https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier_topic-Always_Free_Resources.htm
Oracle Cloud. (2024). "VM.Standard.A1.Flex (ARM) Shape." https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm
Ollama. (2024). "Supported Platforms: ARM64." https://github.com/ollama/ollama
llama.cpp. (2024). "ARM NEON optimizations." https://github.com/ggerganov/llama.cpp
Hetzner. (2024). "Cloud ARM64 Servers (CAX)." https://www.hetzner.com/cloud
localllm-engine. (2026). Source and architecture. https://github.com/nrupala/localllm-engine

Cite as

devinfo.dev. (2026). "From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances." devinfo.dev:2026.0009. https://devinfo.dev/d/2026.0009

devinfo.dev | https://devinfo.dev/d/2026.0009
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

The Promise

The Reality

What Actually Works

Region Selection

The Capacity Script

Attempt ARM instance creation every 60 seconds

Start Small

Setting Up Inference on ARM

Install Ollama

Pull a Model

Or for the full 7B if you have 12GB+ RAM:

Expose the API

ARM Performance

The Sovereignty Architecture

1. Stateless Inference

2. Multi-Cloud Readiness

3. Configuration as Code

.env for localllm-engine

4. Data Sovereignty

Alternatives When Free Tier Fails

Lessons Learned

References

Cite as

See also