Why open-source AI has to win

18 min read
open-source-ailocal-llmprivacyself-hostinganthropicopenrouter
View as Markdown

The Friday it went dark

On Friday, 12 June 2026, at 5:21pm ET, the US government sent Anthropic an export-control directive. Suspend all access to Fable 5 and Mythos 5 for any foreign national, anywhere, inside or outside the US, including Anthropic’s own foreign-national employees.

Anthropic could not block only foreign nationals. So it had to cut access for everyone, worldwide, at short notice. Two of the most-hyped models on the planet, gone in an afternoon (and only Fable 5 had even reached the general public). (Anthropic’s own write-up is here.)

The reporting fills in the rest. Bloomberg and Fortune point to a letter from Commerce Secretary Howard Lutnick to Dario Amodei. The trigger, per Axios, was a kind of jailbreak. Someone demoed Fable 5 reading a codebase and finding and fixing software vulnerabilities, and a competitor reportedly claimed to have jailbroken Mythos. That spooked the administration about cyber risk.

Here is the irony, and it is the whole point of this post. The model got pulled because it was good at exactly the kind of private security work you would want to run yourself.

I want to be fair about what this was. It was a temporary suspension of two specific models, not some permanent geopolitical cutoff. Opus 4.8 and the rest stayed up. Anthropic called it “a misunderstanding” and says it is working to restore access. I believe them.

But that is not the part that matters to me.

The part that matters is the mechanism. Access to core infrastructure can vanish overnight, by government order, through no fault of yours, with zero notice. That is not a hypothetical anymore. It happened. On a Friday.

I was testing the thing that got switched off

Funny timing. I had spent this week testing Fable 5 on real projects.

My honest take? Mixed. It is genuinely good at very long autonomous tasks and it is capable. But as a general business-analysis model, the thing I actually use AI for most days, I felt it lagged Opus. I did not feel it warranted the hype.

So I was not heartbroken when it went away. But I sat there and realized the problem is not whether Fable 5 is good. The problem is that I had started building real workflows on top of a model that a government can switch off while I am asleep.

I run a few businesses and an agency. AI is no longer a toy in any of them - it is becoming core infrastructure and a real competitive advantage. And you do not build core infrastructure on something that can be turned off, repriced, or trained on your data. I would never do that with my hosting. Why am I doing it with my AI?

So this week I decided. I am moving my businesses and my client projects toward open-source AI. Let me walk you through why, and then exactly how I am getting started.

(Quick note on words. When I say “open source” I mostly mean open-weight - models like DeepSeek, Kimi by Moonshot, and Qwen by Alibaba release the weights you can run and fine-tune yourself, usually without the full training data. I am using “open source” loosely. Pedants, I see you.)

Privacy

Let me start with privacy, because most people have not read the fine print.

Since 28 August 2025, on Anthropic’s consumer plans (Free, Pro, Max), if you leave training on - and it is on by default - your data can be retained for up to five years and used to train models. You can opt out, which keeps the old 30-day, no-training behavior. But you have to know to do it. (Here are the terms.)

The API and commercial plans are better. Around 30-day default, never trained on without permission, and you can get zero data retention by contract. (Details here.)

Now here is the twist that ties it back to the top of this post. Fable 5 and Mythos 5 are “Covered Models” with mandatory 30-day retention and no zero data retention option. So on the very model this whole story is about, you could not opt out of retention even if you wanted to. (Anthropic’s docs spell this out.)

With an open-weight model running on hardware you control, this conversation does not happen. Nothing leaves the building. There is no retention policy to read because there is nobody to retain it.

And the part the opt-out box does not fix: people. You can write the policy, run the training, send the memo - and someone will still paste the entire client deck into a chatbox at 11pm because it is faster. They are not careless, they are busy, and no amount of training changes that. With a local model you do not have to win that fight, because there is no wire for the data to leave on. The safe thing and the lazy thing finally become the same thing.

Cost

People worry the AI companies will hike prices. The real trap is the opposite of a price hike.

Per-token prices have actually fallen hard - roughly 10$ down to 2,50$ per million tokens in about a year. Sounds great. Except your total spend is exploding anyway, because agentic workflows burn 50 000 to 500 000 tokens per task now, against maybe 2 000 a year ago. The unit got cheaper and you started buying a million times more units.

And that cheap unit price is propped up by a subsidy. In early 2026 OpenAI reportedly lost about 1,22$ for every 1$ of revenue - an operating margin of roughly minus 122% on the quarter, per The Information. Read that again. The pricing you see today is below cost. (The token cost illusion, and the margin math.)

You can measure the size of that gift. SemiAnalysis bought every tier and ran them flat out: a 200$/month Claude Max plan handed back about 8 000$ of API-equivalent tokens, ChatGPT Pro about 14 000$. (TechSpot’s write-up of the study.) But a gift that big only lands for the heaviest users, the ones running agents all day - which, if you are reading this, is probably you. The subsidy bleeds worst on exactly the people who lean on it hardest, and those are the people it cannot keep carrying.

You do not have to guess how this ends, because it is already happening in the open. On 1 June 2026, GitHub Copilot moved to usage-based, per-token billing. GitHub explained that a quick chat question and a multi-hour autonomous agent run had been costing it the same flat fee, and that it had been absorbing the difference. (GitHub’s own announcement.) Google did the same, moving its heaviest AI usage onto paid metered tiers. The flat-rate, all-you-can-eat era is ending one vendor at a time.

So the bill comes due, the subsidy ends, and the thing ballooning is your own usage. I do not love building a business on top of that.

Reliability

Reliability is the boring one, but it bites.

Anthropic has had repeated outages. Multiple incidents in August 2025, a roughly 10-hour outage on 2 March 2026, and a 2 June 2026 outage that took down the web app, the console, and Claude Code all at once. (Status history is public.)

When that last one hit, my honest reaction was that a lot of developers just got handed a mandatory break. That is funny for an afternoon. It is not funny when a client workflow depends on it.

We have seen this before

This is why I do not think the Friday shutdown is a one-off.

Look at chips. The US banned advanced AI chips (A100, H100) to China on 7 October 2022. Nvidia made cut-down parts. The BIS closed the loophole. Nvidia made weaker parts again. The Trump administration halted the H20 in April 2025, then reversed in July 2025. China routed around all of it anyway. (CSIS has the full timeline.) The lesson is not who won. The lesson is that compute is a political lever and it gets pulled.

If your AI lives on someone else’s cloud, your AI lives downstream of geopolitics you do not control.

Local models are getting better

The usual objection: open models are worse. Today, for most complex tasks, that is fair. But argue the trend line, not the snapshot.

And the trend line is brutal for the closed labs. The lead that used to be measured in years is now measured in months - and it is the open side doing the catching up.

The averages show it. Stanford’s AI Index 2025 put the best open model next to the best closed one on the Chatbot Arena leaderboard: in January 2024 the gap was 8,04%; a year later, 1,70%. Epoch AI measures it as a delay instead - a lag of about a year in 2024, down to three or four months by late 2025.

Head to head shows it too. DeepSeek R1 shipped in January 2025 and beat OpenAI’s o1 on the AIME 2024 math exam, 79,8% to 79,2%, then again on MATH-500, 97,3% to 96,4%, with the weights free to download. That summer Qwen3 outscored OpenAI’s o3 and Google’s Gemini 2.5 Pro on AIME 2025, 92,3% to their 88,9% and 88,0%, and Kimi K2 beat GPT-4.1 on the SWE-bench Verified coding test, 65,8% to 54,6%. The closed labs still own the very top - but I would not build a company on a head start that small, shrinking that fast.

And the line keeps moving. As I write this in June 2026, GLM 5.2 from Zhipu just became the leading open-weight model on Artificial Analysis’s coding index, competitive with older Opus releases and, by the people actually running it, only three or four months behind the frontier. The catch comes from the same crowd: it can be a token glutton, with one developer watching it burn 45 000 tokens to chew through a small task. Leaderboard-topping and not yet effortless - which is the open-source story in a single line.

This is the Linux story again. Linux started as one guy in Finland (Linus Torvalds), and now a global community of several thousand developers - around 5 000 active in any given year - maintains most of it. Open source moves faster because it is not one small team in one building. (Honest aside: Linux also won partly because big companies funded it. The same may be true here, with Meta, Alibaba and DeepSeek backing open weights. Fine by me - I care that it ships.)

There is a famous internal Google memo from 2023, “We Have No Moat”, leaked and republished by SemiAnalysis and later attributed to engineer Luke Sernau. It was an internal discussion doc, not Google’s official position, so do not over-read it. But the line stuck with me: “Open-source models are faster, more customizable, more private, and pound-for-pound more capable.” Three years on, it reads less like a worry and more like a forecast.

I think AI should end up like the internet. If one company owned the internet, it would not be where it is today. Everyone got access, and that is what made it matter. AI should go the same way.

The freedom and fine-tuning part

Two more things you only really get with open weights.

First, fine-tuning. I want to be precise here, because the loose version of this claim is wrong. You can fine-tune some closed frontier models - the big labs have offered hosted fine-tuning, though the support is patchy, uneven, and keeps changing. The real difference is that with open weights you can fine-tune locally and privately, with full control of the weights, and nothing leaves your infrastructure. For a business, your private fine-tuned variant is the edge. Picture an insurer tuning a model on a decade of its own claims to catch the fraud patterns specific to its market - or a Finnish company tuning one on its own contracts and support history, in a language the frontier models still handle poorly. That variant is yours, it is private, and nobody else can buy it. You do not want to ship the thing that makes you special to someone else’s training pipeline.

Second, worldview. Every model embeds one. DeepSeek will refuse some historical and political topics, for example. That is a feature for its makers and a problem for you if your work touches those topics. (One team tested 1 360 China-sensitive prompts and hit canned refusals about 85% of the time.)

The clean way to see how shallow the guardrails really are is abliteration, a technique popularized by Maxime Labonne (building on work by Arditi and others) that strips refusals straight out of open weights. It is a neat proof: the refusal is a thin layer you can edit out, not something woven deep into the intelligence underneath. (Side opinion, not a proven fact: I suspect over-tuned guardrails sometimes get in the way of legitimate security work, where you need the model to look at nasty inputs.)

What am I doing

Enough theory. Here is my real plan, and the way I think about it is a ladder of control. At the bottom you have the least control and the most convenience. At the top you have the most control and the most responsibility. Frontier API -> open model via a router -> rented GPU -> owned hardware.

Step 1: default to open, not frontier. Before I reach for a frontier model, I now try an open-source one on OpenRouter first. OpenRouter makes switching trivial, so the cost of trying open first is basically zero. (Yes, OpenRouter is still a third party, so the full privacy win only lands at the owned-hardware rung. That is the point of a ladder - this is a compromise rung, not the destination.)

Step 2: pay for local-first tools. I want the people building on-device AI to keep building. So I buy and sponsor them. My favorite right now is Cotypist, fully on-device AI autocomplete for the Mac - nothing leaves your machine. LM Studio is the other one I lean on for running models locally. Vote with money.

Step 3: rent a GPU instead of buying one. Before you spend real capital, you can just rent. Real rental prices, mid-2026: an RTX 4090 around 0,34$ to 0,55$ an hour, an A100 around 1,07$ to 1,79$, an H100 around 2,50$ to 3,30$ on the big clouds and under 1,50$ on spot and community marketplaces (a live vendor price board). So my old “about a euro an hour” mental model still holds for the smaller cards. You can run serious open models for the price of a coffee.

Step 4: eventually own the hardware. The top rung is your own machines in colocation. The case study I keep coming back to is DHH and 37signals, who left the cloud. They were spending about 3,2M$ a year on cloud, spent roughly 500k$ on Dell hardware, and project around 7M$ saved over five years just from the compute exit - past 10M$ once they also leave cloud storage. I have run my own servers since high school, so I will admit there is some nostalgia in this for me. But the numbers are not nostalgic.

I already bought a Mac Studio that can run large models locally. As a proof point: an M3 Ultra Mac Studio with 512GB of unified memory can run DeepSeek R1 671B (at Q4) entirely in memory, at under 200W. (A reviewer did exactly that.) A frontier-class model, on a desk, sipping less power than a gaming PC. That was not possible a year ago.

And if you do not write code yourself, your first step is not technical at all. Ask the vendors you already pay three questions: where does our data go, how long do you keep it, and could this same job run on a model we host? You may not move a thing this year. But the answers tell you exactly how exposed you are - which is the whole point.

What to actually run

The shopping list, sorted by the hardware you already own.

One caveat first: open models move fast. New versions land almost every month. So treat the exact model names below as “good as of June 2026” and check for a newer release before you download. The apps stay the same. The model names churn. That is the open-source treadmill, and it is a good problem to have.

These picks blend the public benchmarks with what the r/LocalLLaMA crowd actually runs day to day. Their rule of thumb sums it up nicely: Qwen for code, Gemma for words, and the big mixture-of-experts models for the hosted or heavy-Mac tier.

On your phone. Yes, your phone can run a real model, fully offline, on a plane with no wifi. It will not be Opus, but it is genuinely useful.

  • iPhone or iPad: start with PocketPal AI (free, open source, runs most GGUF models from Hugging Face). If you want Siri and Shortcuts integration, Private LLM is the better paid pick (one-time payment, no subscription). MLC Chat is among the fastest on Apple Silicon.
  • Android: PocketPal AI again is the easy default. On a flagship phone, Google AI Edge Gallery runs Gemma 4’s small E-series with much lighter memory use. Power users can run full Ollama through Termux.
  • Best pick: a small Qwen 3.x (the 4B, at Q4) for quality, or Gemma 4 (the small E-series) if you want speed and battery life. The people doing this daily lean Qwen for accuracy and Gemma for lightness. Two things they keep warning about: skip the heavily censored small models (they throw needless refusals), and skip dense models much bigger than 4B (they crawl). A flagship phone tops out around 24GB of RAM, and it is memory bandwidth, not compute, that actually limits you.

On a normal computer. This is where it gets really useful, and almost any Windows, Linux or Mac laptop from the last few years can do it.

  • The app: install LM Studio. It is the friendliest - search a model, it tells you if it fits your RAM, you click download, you are chatting in five minutes. If you live in the terminal or want to script against it, use Ollama. If you care about zero telemetry and MCP tools, look at Jan.
  • Best pick: Qwen 3.6 35B-A3B at Q4. It looks like a 35B, but only 3B activate at a time (it is a mixture-of-experts), so you can keep the bulk of the expert layers in normal system RAM, fetched as needed, and run it on a normal box. This is the community’s universal daily driver. For chat, creative writing and non-English, the co-pick is Gemma 4 26B-A4B. On a 16GB machine, drop to a smaller Qwen 3.x; for the MoEs you want 32GB. The rule of thumb on quantization: keep small models at Q4 or above, but the big mixture-of-experts models hold up fine at Q3 or even Q2 with modern dynamic quants. (This matches the research - larger models are far more resilient to low-bit quantization than small ones, which fall apart fast.)

If you have a beefy machine. For developers and power users with prosumer hardware - not datacenter gear, just the expensive end of consumer - local stops being a toy.

  • A PC with an RTX 5090 (32GB VRAM): the agreed best all-rounder is Qwen 3.6 27B dense (Q5 or Q6, with multi-token prediction on), which fits in 32GB and, by the crowd’s own numbers, now out-codes the older dedicated 80B “coder” models. People are actually canceling Claude subscriptions over it. A fun sleeper: gpt-oss-20b is mediocre at writing code but excellent as the orchestrator that routes an agent’s tool calls. Run Linux headless, since Windows steals 2-3GB of VRAM.
  • A Mac Studio with big unified memory: this is the quiet monster, the only consumer box that loads a near-300B open model whole. Best pick: DeepSeek V4 Flash (a roughly 280B mixture-of-experts) at a low quant, around 20-25 tokens a second while sipping about 50W. But the catch the spec sheet hides: the Mac’s prompt-processing is slow, so it is great for chat and long single answers and frustrating for agentic coding loops (the context rebuilds between tool calls and you sit there waiting). The people who own both a Mac Studio and an Nvidia box reach for the Nvidia box when an agent is involved. A 512GB config starts around 10 000$, which sounds like a lot until you compare it to a year of frontier API bills.

If you don’t want to own anything. You do not need any of the above to start today. Open models are one API call away on OpenRouter, at a fraction of frontier prices. The models I reach for right now:

  • General use: DeepSeek V4 for cheap bulk work (people report it doing Sonnet-level jobs at roughly a third of the cost) and Kimi K2.6 as the most-cited “best general open model”. On the public SWE-bench leaderboard, DeepSeek V4 Pro is the top open-weight model right now, around 80%, tied with a closed frontier model.
  • Coding and agentic: GLM 5.1 is the repeated “open agentic king”, and Kimi K2.7 Code is the cheap honest pick (people rate it above Opus at a fraction of the price). For cheap parallel agents, Qwen 3.6 drops the bill by an order of magnitude.
  • Two warnings from people who actually pay these bills: MiniMax M3 tops benchmarks but disappoints in real use (the classic benchmark-versus-reality gap), and OpenRouter providers quietly serve different quantizations of the same model, so whitelist good providers (the :exacto tag helps) when output quality matters.

The beautiful part is that switching is one line of config. So you default to an open model, and only reach for a frontier one when a task genuinely needs it.

What the people running this actually say

I leaned hard on the r/LocalLLaMA crowd for the picks above, the people who run these models every single day. Their hard-won consensus contradicts the leaderboards in ways worth knowing, and honestly this is the best argument for open AI I can give you.

  • Parameter count is a lie. A model that says “35B” but activates only 3B (a mixture-of-experts) runs circles around a true 14B dense, and a fresh 27B makes last quarter’s shiny 80B “coder” look obsolete. The leaderboards that rank by raw size miss both of these completely.
  • Benchmarks get gamed. MiniMax M3 is this season’s cautionary tale: great charts, “blabbers for minutes” in real use. The crowd’s blunt verdict is that most current benchmarks mean very little.
  • Gemma is a chat star and bad at tool calling. It tops chatbot evals and is near-useless as an agent, so the eval that crowns it does not test the thing you actually need.

None of this shows up in a press release. It falls out of thousands of people running the things for real and comparing notes in the open. Which is, when you think about it, the whole argument for open in the first place.

It’s not all roses though

I am not going to sell you a free lunch, because there isn’t one.

Self-hosting does not delete risk, it moves it. The moment you own the hardware, you own uptime, patching, and physical security too. When Anthropic goes down, that is their problem. When my box goes down, that is mine. That is a deliberate trade, and I am making it with eyes open, not because it is effortless.

And owning the GPU only wins on cost if you keep it busy. A card billed by the hour, rented or owned, is cheap per token only at high, steady utilization. Leave it idle and the math flips: a barely-used H100 can cost more per token than just calling DeepSeek’s own API at 0,14$ per million (their published price). So self-hosting pays off for sustained, high-volume work, and for bursty or occasional jobs the cheap open API is not a fallback rung, it is the right answer. The ladder is volume-shaped, not only privacy-shaped.

And the middle rungs are honest compromises. OpenRouter and rented GPUs put a third party back in the loop. That is fine for now. It just means the ladder is the strategy, not any single step.

There is also a real line where I would stay on a frontier model: a capability gap too wide for a specific task. Open weights still trail most on very long-context retrieval and on the hardest multi-step agentic coding. If Opus is the only thing that can do a particular job well enough for a client, I will use Opus for that job. The goal is not purity. The goal is to stop being dependent - to make frontier models a choice I make, not the only option I have.

That is really what the Friday shutdown taught me. It is not that closed models are bad. Some of them are excellent. It is that “excellent and switchable-off-without-notice” is a bad foundation to build a business on.

So I am building on ground I control. Slower at first, more work, more boring. But mine. And after watching the “best model” on Earth disappear in an afternoon, I will take mine every time.