booklet

From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

devinfo.dev — May 27, 2026

devinfo.dev:2026.0009

The Promise

Oracle Cloud offers Always Free ARM instances: 4 OCPUs, 24GB RAM. Enough to run a quantized 7B or 13B model via Ollama or llama.cpp. No credit card charge. No expiration.

Google Cloud offers a free e2-micro. AWS offers a t2.micro. Neither has enough RAM for inference.

For self-hosted LLM on free tier, Oracle Cloud is the only viable option. This is simultaneously the opportunity and the trap.

The Reality

Oracle's ARM capacity is allocated per-region. Popular regions (US, EU, Canada) are perpetually full. You cannot create an ARM instance because there is no capacity. The API returns Out of host capacity indefinitely.

This is not a temporary condition. It is structural. Free tier users are lowest priority. Paid workloads consume ARM capacity first. What remains — if anything — goes to free tier requests.

The result: you have an account, you have credits, you have a quota that says 4 OCPUs and 24GB RAM. You cannot use any of it.

What Actually Works

Region Selection

Not all regions are equally constrained. Capacity availability (as of 2026):

| Region | ARM Availability | Notes |

|--------|-----------------|-------|

| US-Ashburn (IAD) | Rare | Highest demand |

| US-Phoenix (PHX) | Occasional | Slightly better |

| Canada-Montreal (YUL) | Rare | Persistent unavailability reported |

| Canada-Toronto (YYZ) | Occasional | Better than Montreal |

| Germany-Frankfurt (FRA) | Rare | High EU demand |

| UK-London (LHR) | Occasional | |

| India-Mumbai (BOM) | Better | Lower demand |

| Japan-Tokyo (NRT) | Occasional | |

| Brazil-Sao Paulo (GRU) | Better | Lower demand |

| Australia-Sydney (SYD) | Better | Lower demand |

You are locked to your home region on free tier. Choose wisely at account creation. This cannot be changed later without creating a new account.

The Capacity Script

Since capacity appears and disappears, the community approach is to run a script that repeatedly attempts to create an instance:

``

#!/bin/bash

Attempt ARM instance creation every 60 seconds

while true; do

oci compute instance launch \

--availability-domain "AD-1" \

--compartment-id "$COMPARTMENT_ID" \

--shape "VM.Standard.A1.Flex" \

--shape-config '{"ocpus": 2, "memoryInGBs": 12}' \

--image-id "$IMAGE_ID" \

--subnet-id "$SUBNET_ID" \

--ssh-authorized-keys-file ~/.ssh/id_rsa.pub \

--assign-public-ip true

if [ $? -eq 0 ]; then

echo "Instance created!"

break

fi

echo "No capacity. Retrying in 60s..."

sleep 60

done

`

Key insight: request fewer OCPUs. A 2 OCPU / 12GB instance is more likely to succeed than the full 4 OCPU / 24GB allocation. You can create multiple smaller instances and split workloads.

Start Small

Instead of requesting the maximum (4 OCPU, 24GB):

| Configuration | Success Rate | Usable For |

|--------------|-------------|------------|

| 1 OCPU, 6GB | Highest | Small models (3B-7B Q4) |

| 2 OCPU, 12GB | Moderate | 7B Q4-Q5 models comfortably |

| 4 OCPU, 24GB | Lowest | 13B Q4, multiple 7B models |

A 7B Q4_K_M model requires ~4.5GB RAM. A 2 OCPU / 12GB instance runs it with room for the OS and Ollama overhead.

Setting Up Inference on ARM

Once you have an instance:

Install Ollama

`

curl -fsSL https://ollama.com/install.sh | sh

`

Ollama supports ARM64 natively. No special configuration needed.

Pull a Model

`

ollama pull llama3.2:3b

Or for the full 7B if you have 12GB+ RAM:

ollama pull llama3.1:8b-q4_K_M

`

Expose the API

By default Ollama binds to localhost. To serve external clients:

`

export OLLAMA_HOST=0.0.0.0:11434

ollama serve

`

Secure with a firewall rule — only allow your IP or VPN.

ARM Performance

ARM inference is CPU-only (no GPU on free tier). Expect:

| Model | Instance | Throughput |

|-------|----------|------------|

| Llama 3.2 3B Q4 | 2 OCPU / 12GB | ~12 tok/s |

| Llama 3.1 8B Q4 | 2 OCPU / 12GB | ~5 tok/s |

| Llama 3.1 8B Q4 | 4 OCPU / 24GB | ~9 tok/s |

| Qwen 2.5 7B Q4 | 4 OCPU / 24GB | ~8 tok/s |

This is not fast. It is functional. For async tasks (code review, document summarization, batch processing), 5-9 tok/s is usable. For interactive chat, it feels slow.

The Sovereignty Architecture

Free tier is not sovereign if a policy change can remove it. Build assuming the instance will disappear:

1. Stateless Inference

Do not store state on the instance. Models are re-pullable. Configuration is in version control. The instance is disposable.

2. Multi-Cloud Readiness

Your localllm-engine should have fallback paths:

  • Primary: OCI ARM instance (Ollama)
  • Secondary: Local machine (when you are home)
  • Tertiary: FreeLLMAPI (cloud, for non-sensitive requests)

If OCI disappears, your tools still work. They just route differently.

3. Configuration as Code

`

.env for localllm-engine

OLLAMA_BASE_URL=http://your-oci-instance:11434

LLAMACPP_BASE_URL=http://localhost:8080

FREELLMAPI_BASE_URL=http://freellmapi-instance

`

One line change to reroute. No tool reconfiguration.

4. Data Sovereignty

Use X-Routing-Privacy: local_only` for sensitive prompts. Cloud fallback exists but is opt-in per request, not default.

Alternatives When Free Tier Fails

| Provider | Free/Cheap Option | RAM | GPU | Notes |

|----------|------------------|-----|-----|-------|

| Oracle Cloud | 4 OCPU / 24GB ARM | Yes | No | Capacity lottery |

| Hetzner | CAX11 (2 ARM / 4GB) ~$4/mo | Limited | No | Reliable, no lottery |

| Scaleway | STARDUST ~$2/mo | 1GB | No | Too small for LLM |

| Vast.ai | Spot GPU ~$0.20/hr | Varies | Yes | Not persistent |

| RunPod | Spot GPU ~$0.25/hr | Varies | Yes | Not persistent |

| Home server | One-time cost | Full | Maybe | Always available |

The honest recommendation: if you need reliable always-on inference and OCI capacity is unavailable, a $4/month Hetzner ARM box or a home server with 16GB+ RAM is more sovereign than a free tier you cannot provision.

Lessons Learned

1. Free tier is a loan, not a right. The provider can change terms, reduce capacity, or sunset the offering. Build accordingly.

2. Region lock is permanent on free tier. Research capacity before creating your account. Once locked, you cannot move.

3. Start small. 1-2 OCPU instances provision more often than 4 OCPU. Run multiple small instances if needed.

4. Automate provisioning. The capacity script is not elegant. It is necessary.

5. Never depend on a single backend. The localllm-engine architecture exists because single points of failure are unacceptable for sovereignty.

6. Measure what you actually need. 5 tok/s on a free ARM instance may be enough for your workload. Do not over-provision for imagined requirements.

References

  • Oracle Cloud. (2024). "Always Free Resources." https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier_topic-Always_Free_Resources.htm
  • Oracle Cloud. (2024). "VM.Standard.A1.Flex (ARM) Shape." https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm
  • Ollama. (2024). "Supported Platforms: ARM64." https://github.com/ollama/ollama
  • llama.cpp. (2024). "ARM NEON optimizations." https://github.com/ggerganov/llama.cpp
  • Hetzner. (2024). "Cloud ARM64 Servers (CAX)." https://www.hetzner.com/cloud
  • localllm-engine. (2026). Source and architecture. https://github.com/nrupala/localllm-engine

Cite as

devinfo.dev. (2026). "From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances." devinfo.dev:2026.0009. https://devinfo.dev/d/2026.0009

devinfo.dev | https://devinfo.dev/d/2026.0009
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev