Blue Planet Studio - stock.adobe

AI agents are accelerators, not developer replacements

The central challenge with integrating AI into application development isn't its capacity to assist, but rather the extent to which we can confidently delegate control.

While AI agents can flawlessly execute tasks previously thought exclusive to humans, they can also commit hair-raising errors in the very next piece of code.

These mistakes serve as a stark reminder that even the most advanced AI copilots still lack any understanding of how the world works. This fundamental distinction separates current generative AI from the vision of artificial general intelligence (AGI). With that in mind, let's look at how AI agents make great development accelerators but cannot replace human developers.

LLM reasoning is not logical

Even sophisticated agentic AI -- built on large language models (LLMs) with their increasingly vast context windows and complex workflows -- relies on semantic pattern matching. They offer no genuine insight into underlying causal relationships and interdependencies.

What makes this problematic for humans to grasp is the convincing way LLMs can articulate their decision-making processes, often mimicking a logical progression that suggests an understanding of cause and effect, which they do not actually possess. They achieve this by stitching together statistically likely fragments of how humans reason in text. While this might seem like logical reasoning, it is based on probabilistic calculations derived from training data, rather than a direct understanding of why one step leads to the next.

Diagram showing AI's illusion of understanding
LLMs mimic logical reasoning but cannot grasp causality.

Compare this to an actor starring in a medical television series, who, over the years, has memorized thousands of hours of dialogue, documentaries and real-life consultations. They can flawlessly deliver a differential diagnosis, rattling off symptoms, test results and treatment protocols with the confidence and vocabulary of a seasoned physician. They know that "chest pain radiating to the left arm" usually appears in scenes about heart attacks, that "CBC and metabolic panel" follows "let's run some tests," and that concerned looks accompany discussions about tumors.

Diagram illustrating the medical actor's superficial knowledge of real medical subject matter as a metaphor for AI.
An actor on a medical TV show who shows a surface understanding of medical topics is a metaphor for AI's inability to grasp causality.

Their performance is so convincing that anyone watching would believe they understand medicine. But they have no idea why aspirin thins blood, what happens during a heart attack or why one treatment works while another kills. They're simply reciting variations of medical conversations they've memorized, assembling fragments that statistically co-occur, without comprehending that these patterns represent actual biological processes where sequence and causation literally mean life or death. Translated to application development, this often means great results directly followed by catastrophic failure and vice versa.

Statistical patterns instead of causal truths

LLMs are incredibly good at finding and connecting patterns in unimaginably large quantities of text. While much of this text might describe how the world works, the LLM does not comprehend the actual meaning of these descriptions. Instead, it translates text into numbers -- vectors -- that capture statistical relationships, not causal truths. The model then translates these numbers back into human language, while underneath it all, it never stops tracking and shuffling numbers rather than meaning. For example, the words "charge," "payment" and "credit card" might sit close together in vector space because they often co-occur in text, while "profile," "lookup" and "fetch" form a different cluster -- but the model doesn't actually know that one group involves money and the other doesn't.

Diagram showing the way AI statistically correlates words to give the impression of understanding.
LLMs only process the statistical relationships between groups of words.

Things are not what they seem

Because programming languages are highly structured, this numerical shuffling can produce great code. While the AI model does not 'understand' the code the way a developer would, it can reliably map patterns of inputs to outputs, frameworks to boilerplate and syntax to semantics in ways that often look indistinguishable from human code. For example, when asked to "build a REST API in Python with Flask," the model cannot reason about HTTP or databases -- it simply recalls that @app.route usually precedes function definitions, that GET requests often map to return jsonify, and that error handling frequently involves try/except blocks. The result often is well-structured Flask code, even though it originated from pattern recall rather than genuine understanding.

Diagram of inappropriate AI retry logic
Humans need to stay in the loop to deal with AI's missing context and reasoning capabilities.

For example, adding retry logic to harden a microservice sounds simple -- until it isn't. Ask an AI assistant to "add retries on failures," and you might get code that retries everything on any error. That's fine for idempotent -- or stateless -- reads, such as "fetch profile," where repeating the call simply returns the same data.

Apply the same logic to non-idempotent actions -- charge a card, create an order, send an email, query a database -- and you've invited disaster: double charges, duplicate orders, notification storms, duplicates in the database. The fix isn't magic; it's judgment. Humans classify operations first -- idempotent vs. not -- retry only on transient errors, and require idempotency keys and server-side deduplication for anything with side effects. While this still saves human developers lots of time, they are still required to add their skill and expertise to the mix, as otherwise disaster can and will strike at random.

Understanding the limits of pattern matching is tricky

In principle, couldn't pattern matching recognize that charging a credit card requires a different approach to retrying an API call compared to retrieving a customer profile or product information? Yes, it could, but this is impossible for humans to know in advance, as it depends on whether the training data for that specific model included retry functions that make standard POST or GET requests.

The model fails to establish a connection between the type of operation and its real-world consequences; it merely recalls statistical associations. For the model to avoid this mistake, the training data would need to contain clear, consistent and repeated pairings that link the type of operation with the retry strategy and its consequence.

Ideally, the data would distinctly contrast code that is safe to retry against code where retries must be avoided. Perhaps it includes post-mortems or warnings that describe what happened when retries were misapplied. However, whether the model had ingested enough training data to make this distinction is impossible for us humans to determine. To make things trickier, due to its probabilistic nature, the model might make the distinction once but not in the following three attempts

This example illustrates why simply adding more training data is often not the answer, as the necessary data might not exist in writing. Or worse, the training data could include content that strengthens the wrong generalization. Either way, the human user can't know if this is the case and needs to understand how a specific problem should be approached comprehensively.

The value of AI is real, and development teams can benefit

As long as their limitations are clearly understood, AI agents can significantly increase the productivity of human developers throughout the development lifecycle. From gathering requirements and turning them into user stories, all the way to instrumenting and deploying the application, AI agents can provide humans with suggestions, automated validations and rapid prototyping to significantly shorten iteration cycles.

AI agents should be seen as force multipliers that can handle mechanical aspects of development, such as generating boilerplate code based on existing examples and documentation, writing test cases and documenting APIs. Humans, on the other hand, are there to truly understand business implications, decide on architectural tradeoffs and solve complex problems that require the ability to apply abstract logic.

Productivity impact of AI on the SDLC

Below is a breakdown of AI's productivity impact for different activities in the SDLC, as well as AI's capabilities for each activity, the level of human involvement required and the level of risk for each activity.

Productivity impact of AI agents Current AI agent capabilities Human involvement AI usage risk
Requirement Gathering Low - Medium Generate user stories from notes, meeting transcripts, emails and other materials. High - Ensure stories are aligned with current business priorities in terms of cost, risk and reward. High - Misunderstood requirements will cascade through the entire project.
Architecture and Design Low Suggest patterns, identify bottlenecks and generate initial diagrams as a solid starting point for humans to build on. Critical - Consider system-wide implications, make strategic trade-offs and monitor technology trends. High - Poor architectural decisions are difficult and expensive to reverse.
Code Generation High Build out well-defined boilerplate code and solve concisely defined problems. Keep documentation up to date.  Moderate - Stay on top of business logic and edge cases. Medium - Often challenging to stay on top of code written by AI.
Code Review Medium Catch syntax errors, find security vulnerabilities, find performance issues and suggest optimizations. High - AI misses context-dependent issues and architectural problems. Medium - Humans need to take overall responsibility for the review.
Testing High Create unit tests, integration tests, automated regression tests and find edge cases.

Low - for test generation

High - for test strategy
Medium - Humans must take responsibility for completeness and relevance of tests. 
Debugging High Analyze stack traces and suggest fixes to known errors. Medium - guide the debugging process. Low - Wrong fixes are typically easy to spot.
Documentation High Generate API docs, readme files, inline comments, user guides and change logs. Low - involvement for user-facing documents. Low - Incorrect documentation can typically be corrected without significant impact.
Deployment & CI/CD Medium Create deployment manifests, build IaC templates, generate pipeline configurations. High - Production deployments need to be carefully checked. High - Any issues have a direct impact on production.
Monitoring Medium Add instrumentation, analyze logs and generate alert rules. Medium - AI struggles to prioritize without context. Medium - False positives waste time.

Conclusion

Technology leaders announcing that AI agents are taking over developer jobs have created unrealistic expectations about AI's current capabilities. This has led many business executives to believe that developer hours are no longer the limiting factor for what they can build. The finance analyst could create their own portfolio rebalancing tool; the healthcare administrator could build a patient scheduling system; the supply chain manager could develop inventory optimization dashboards; or the marketing director could construct personalized campaign automation platforms without needing to write a single line of code. While they can achieve proof of concepts for many of these business tasks, architecting, developing and shipping enterprise-grade software still relies vastly on the skill and experience of human developers.

However, AI agents can significantly speed up the SDLC by completing a lot of legwork for human developers. Creating test cases, automatically instrumenting complex software with monitoring agents, documenting tens of thousands of lines of mainframe code and accurately defining complex infrastructure manifests are only a few examples AI agents can help human developers with.

The SDLC between humans and AI agents must be collaborative, iterative and subject to continuous oversight. Determining how to optimally adjust processes, development tools and corporate culture to meet these requirements is the next frontier in agent-assisted application development. The payoff for figuring out how to provide human coders with optimal AI support promises significant productivity increases, enabling human development teams to ship more features faster and at higher quality.

Torsten Volk is principal analyst at Enterprise Strategy Group, now part of Omdia, covering application modernization, cloud-native applications, DevOps, hybrid cloud and observability.

Enterprise Strategy Group is part of Omdia. Its analysts have business relationships with technology providers.

Dig Deeper on Application development and design