The Dark Factory Test: Where Would Your Software Delivery Break First?a6f7e54

By

The phrase "dark factory" sounds like something out of an automation sales pitch. A factory runs with the lights off because the machines do not need people walking the floor. Work enters the system, the system does the work, and finished goods come out the other side.

Apply that idea to software and the conversation gets strange in a hurry. Some people hear it and picture AI building software while everyone sleeps, with product ideas going in one side and production releases coming out the other while the whole thing quietly improves itself in the background. Other people hear it and go straight to replacement, imagining engineers, testers, product owners, managers, architects, release managers, and incident responders all automated out of the process. Neither reaction is very useful.

The value of the dark factory idea is not that it is practical, because for most teams it is not, at least not in the clean fantasy version where humans disappear and delivery becomes a fully autonomous machine. It is a thought experiment, not a delivery strategy, and I am not suggesting anyone actually try to build it. The value is that it works as a forcing function. Pretend the system had to run without constant human intervention, and ask where it would fail first. That is the question that makes the idea worth your time, not because anyone is trying to remove people from software delivery, but because the thought experiment exposes how much of delivery still runs on undocumented judgment, tribal knowledge, manual recovery, weak automation, unclear ownership, and people quietly compensating for broken systems.

A dark factory workflow is not really about AI. It is a test of the delivery system around the AI.

This is not the same as using an AI coding assistant

A developer who uses an AI tool to write a function, explain a stack trace, generate a test, or refactor a class is still working in a human-orchestrated model. The human is driving. The human decides what to ask for, what matters next, and when the answer is good enough, then runs the tests, reads the failures, adjusts the prompt, opens the pull request, explains the change, and shepherds it through review.

That can be valuable, because it makes a good engineer faster. It also makes a sloppy engineer faster, which is not always a gift. Either way, it is not a dark factory workflow.

A dark factory workflow asks a different question. Could the delivery system itself take work from intent to plan to implementation to validation to release, with humans involved only at defined control points? The real topic is not whether AI can write code. It is whether your engineering system is explicit enough for work to move through it without a person babysitting every step. For most teams the answer is no, and that is not an insult. That is the point.

What the initial workflow would have to look like

If you diagram the idealized AI delivery workflow, it looks simple enough. The trouble is that every arrow hides a hard problem.

Who decides whether the request is clear enough? Who knows which repo owns the behavior, whether the test evidence is strong enough, or whether the change is safe to deploy during business hours? Who knows whether the rollback path actually works, whether a healthy-looking production signal is telling the truth or the dashboards are lying, and who owns the customer impact when the system gets it wrong?

This is where the thought experiment starts doing real work, because it does not let you hide behind the happy path. It asks whether the delivery system has enough structure to move from one step to the next without a person quietly filling in the blanks. Most software organizations do not fail because nobody can write code. They fail because the system around the code is vague: vague intake, vague ownership, vague test strategy, vague release process, vague rollback plan, vague production signals. That kind of vagueness is survivable when experienced people are constantly watching and correcting for it, and it is not survivable in an automated workflow. AI does not remove the need for structure. It raises the penalty for not having it.

Most teams are not ready, and that is useful to know

The reason most teams are not ready is not only that AI is imperfect. It is imperfect, but that is the obvious problem. The larger issue is that most delivery systems are not explicit enough, and you can see it in the gap between how the system is described and how it actually behaves. The build works, but only if you know the right local setup steps. The tests pass, except for the ones everyone knows to ignore. The pipeline works, except for the service that needs one manual step afterward. The rollback plan exists, but nobody has practiced it in months. The ownership model is clear until a defect crosses three repos and two teams. The acceptance criteria are written down, but the real requirement lives in someone's head. The logs exist, but they do not explain what the customer actually experienced. The alert fired, but only one person knows whether it matters.

That is the actual system. Not the architecture diagram, not the onboarding document, not the process described during planning. The system is what happens after the code ships, and when you imagine a dark factory workflow, all of that hidden mess becomes visible.

Walk through the workflow as a thought experiment

Take a simple, hypothetical change. A customer reports a defect, and the workflow has to understand the issue, find the owning system, reproduce the problem, write a failing test, make the fix, validate it, open a pull request, pass review, deploy safely, watch production, and then roll back or continue.

Now ask the dark factory question at each step. Could the system understand the issue from the customer report, or would someone need to translate customer language into engineering language? Could it identify customer impact, or is that still inferred from who is yelling the loudest? Could it find the owning service, or does ownership depend on someone remembering that this behavior moved last year? Could it reproduce the problem, or do lower environments lack the right data, configuration, integrations, and permissions? Could it write a failing test, or is the suite too brittle, too slow, and too disconnected from real behavior to trust? Could it make the fix safely, or is the codebase full of clever abstractions that only one person understands? Could it validate the fix, or do people still fall back on manual testing because nobody trusts the automation? Could it open a pull request reviewers can actually trust, or would the review be a pile of generated confidence resting on weak evidence? Could it deploy safely, or is deployment still a ritual held together by Slack messages and memory? Could it watch production and know whether the change helped, or are the signals too noisy to read? Could it roll back, or is rollback one of those things everyone says exists but nobody wants to use?

This is where the article gets practical, not because the dark factory is practical, but because the questions are. Wherever the workflow breaks, you have found a real engineering problem.

The hidden human labor is the system

A lot of software organizations run on invisible human patching. Someone knows which tests are flaky, which deploys are risky, and which service owns the weird edge case. Someone knows which dashboard to check and which customer complaint is actually urgent. Someone knows which team to pull into the incident, that the rollback plan works for code but not for the database migration, and that the feature flag exists but not every path actually honors it.

That knowledge is valuable, and it is also a risk. If the system only works because experienced people are constantly compensating for missing structure, then the system is not automated. It is manually stabilized. AI does not fix that, and in many cases it exposes it. An AI workflow can only move confidently through a system that has enough explicit structure to support it: clear ownership, understandable code, reliable tests, usable environments, deployment discipline, observability, rollback paths, and product intent written down well enough to reason about. Without those things, the AI is not operating a dark factory. It is wandering through a dark warehouse with a flashlight.

Review cannot be based on vibes

One of the more uncomfortable parts of this thought experiment is code review. A lot of teams talk about review as if it were a clear quality gate, but in practice it often depends on who is available, how tired they are, how much they know about that part of the system, and whether the pull request looks familiar enough to feel safe. That is not a process. That is a social habit.

So if an AI agent opens a pull request, what should happen? The weak answer is that an engineer looks at it, and that is not enough. A useful pull request should arrive with evidence: what changed and why, what customer behavior it affects, what tests were added or updated, what validation passed, what risks remain, what rollout mechanism is being used, what rollback path exists, and what production signals will be watched. That is useful whether the author is a human or a machine.

The thought experiment forces the team to define what good review actually means, turning it from a feeling into a contract. That matters because review is not only about code style. It is about whether the change should exist, whether it is safe to release, whether it fits the system, and whether the team can operate it after it ships. Most architecture problems are ownership problems, and a lot of review problems are ownership problems wearing a different shirt. If nobody owns the consequences, review drifts toward the surface, where naming, formatting, and small refactors get attention because they are easy to see while operational risk gets missed because it requires thinking beyond the diff. A dark factory workflow cannot survive that kind of review, and honestly, neither can a human one.

Validation is where the fantasy usually breaks

The easiest part of the AI story is generating code. The hardest part is knowing whether the generated change is safe, which is why validation sits at the center of the workflow. If the tests are slow, flaky, shallow, or ignored, the workflow stops being a factory and becomes a risk amplifier. If integration environments are unreliable, it cannot build confidence. If contract tests do not exist, it cannot reason across service boundaries. If observability is weak, it cannot tell whether production is healthy. If rollback is painful, it cannot safely recover from its own mistakes.

This is why dark factory thinking leads straight back to boring engineering fundamentals. Trunk-based development, short-lived branches, merge queues, feature flags, automated tests, fast builds, good logs, rollback discipline, and clear ownership all matter, and none of it is trendy. It is just the price of building a delivery system that does not require heroics. Boring production is a feature, not because nothing ever goes wrong, but because when something does, the customer does not have to sit there while engineering discovers how the system actually works.

The real design is the exception queue

Run the thought experiment one more time, this time from the other direction. If you were designing a serious AI delivery workflow, the interesting question is not how to automate everything. It is what should force the system to stop, because that is where the real design lives. The workflow should stop when the requirement is ambiguous, when the blast radius is too large, when it cannot reproduce the defect, or when the tests do not actually prove the fix. It should stop when the change crosses an ownership boundary, when the rollback path is weak, when the production signal is unclear, and when the customer impact is high enough that a person needs to make the call.

A mature AI workflow is not one that never stops. It is one that stops for the right reasons, and that is the difference between automation and recklessness. The exception queue is not a failure of the system. It is one of the most important parts of it, because it tells you where human judgment is required, where your automation lacks evidence, and where your engineering model is still too vague to trust. That queue is also an improvement backlog. If too much work stops because ownership is unclear, fix ownership. If it stops because the tests are weak, fix the tests. If it stops because rollback is risky, fix rollback. If it stops because requirements are vague, fix intake. Fix once, not many times.

Humans do not disappear

The useful version of a dark factory workflow does not remove humans from software engineering. It moves them to higher-leverage control points. Humans still own intent, priority, and product judgment. They still own architecture decisions, risk tolerance, and policy. They still own customer consequences and final accountability. The point is not to put a human on every button. It is to decide where human judgment is actually required and then make every other step explicit, observable, testable, and repeatable.

That is a much healthier model than pretending every manual step is valuable just because a person is involved. Some manual steps are judgment. Some are bureaucracy. Some are scar tissue. And some are just unpaid automation work being performed by tired engineers. The dark factory thought experiment helps you tell them apart.

The dark factory is a mirror

I like the dark factory idea, and not because I think most teams are close to it, because most are not. I like it because it makes the hidden system visible. It forces you to look at the actual path from customer problem to production fix, and to see where that path depends on memory, heroics, manual testing, unclear ownership, fragile deployments, and rollback plans nobody has practiced. That is the value. The goal is not to build a fully autonomous software factory tomorrow. The goal is to ask the uncomfortable question and be honest about the answer. Pretend the system had to run without constant human intervention, and find where it would fail first. Wherever it fails is probably where the real engineering work is.

Frequently asked questions

What is a "dark factory" in software delivery?

It is a thought experiment borrowed from manufacturing, where a plant runs with the lights off because the machines do not need people walking the floor. Applied to software, it imagines a delivery system that carries work from intent to plan to implementation to validation to release with humans involved only at defined control points. The point is not to actually build that, because most teams cannot. The point is to use it as a forcing function: pretend the system had to run without constant human intervention, and find where it would break first.

How is a dark factory workflow different from using an AI coding assistant?

An AI coding assistant works inside a human-orchestrated model. The human decides what to ask for, when the answer is good enough, runs the tests, reads the failures, opens the pull request, and shepherds it through review. That can make a good engineer faster, but a person is still driving every step. A dark factory workflow asks whether the delivery system itself could move work through all of those steps without a person babysitting each one. The real question is not whether AI can write code, it is whether your engineering system is explicit enough for work to flow through it on its own.

Why are most teams not ready for an autonomous AI delivery workflow?

Not mainly because AI is imperfect, though it is. The larger issue is that most delivery systems are not explicit enough. Intake is vague, ownership is unclear, tests are weak or flaky, lower environments are unreliable, deploys are held together by Slack messages and memory, production signals are noisy, and rollback plans exist on paper but nobody has practiced them. Those gaps normally stay hidden because experienced people quietly compensate for them. That means the system is not automated, it is manually stabilized, and AI tends to expose that rather than fix it.

What makes an AI-generated pull request safe to review?

Evidence, not vibes. A useful pull request should arrive with what changed and why, what customer behavior it affects, what tests were added or updated, what validation passed, what risks remain, what rollout mechanism is being used, what rollback path exists, and what production signals will be watched. That is just as useful when the author is a human. The thought experiment forces a team to turn review from a feeling into a contract, so review judges whether the change should exist and whether the team can operate it after it ships, not just naming and formatting.

What is the exception queue and why does it matter?

The exception queue is the set of conditions that should force the workflow to stop and hand control back to a person: an ambiguous requirement, a blast radius that is too large, a defect that cannot be reproduced, tests that do not prove the fix, a change crossing an ownership boundary, a weak rollback path, an unclear production signal, or customer impact high enough that a human should make the call. It is not a failure of the system, it is the most important part of its design, and it doubles as an improvement backlog. If too much work stops for the same reason, fix that reason once instead of many times.

Does a dark factory workflow mean replacing engineers?

No. The useful version does not remove humans, it moves them to higher-leverage control points. Humans still own intent, priority, and product judgment, along with architecture decisions, risk tolerance, policy, customer consequences, and final accountability. The goal is not to put a person on every button, it is to decide where human judgment is actually required and make every other step explicit, observable, testable, and repeatable. Some manual steps are judgment, some are bureaucracy, and some are just unpaid automation work performed by tired engineers, and the exercise helps you tell them apart.

Conversation

    Log in to join the conversation.

    © 2026 ABWaters. Made quietly.