AI Builds It: Easy Coding Tools

Inside Devin’s Workflow: Tool Use, Planning, and Autonomy

AI Builds It: Easy Coding Tools

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 17:35

Read the full article: Inside Devin’s Workflow: Tool Use, Planning, and Autonomy

Discover more at AI Builds It: Easy Coding Tools

Excerpt:

Introduction Devin (from Cognition AI) is a new autonomous AI software engineer that can plan software development tasks and carry them out largely on its own. It works end-to-end on code projects, using tools like a code editor, a command-line shell, and a web browser to research, write, test, and deploy code. In demos and press, Devin has been shown scanning a codebase, generating a plan, editing files, running tests, and making pull requests with surprisingly little human input (medium.com) (www.linkedin.com). Cognition claims Devin can handle “complex engineering tasks requiring thousands of decisions,” recalling context at each step and even learning from mistakes (medium.com) (www.linkedin.com). We therefore explore the public details of Devin’s design and workflow. This includes how Devin breaks down tasks (its planning process), how it literally works in a developer environment (editor, terminal, browser), how it keeps memory or context across a coding session, how it self-corrects and iterates, and what guardrails or safety measures it uses. We also note what is not revealed – for example the exact model internals are undisclosed, so some community discussion relies on educated guesswork.

Task Planning and Decomposition When a developer gives Devin a new assignment, the first step is planning what files to change and in what order. Cognition’s notes explain that Devin uses a “planning mode” sub-agent whose job is to figure out which files in the repository are relevant to the task (medium.com) (docs.devin.ai). In practice, Devin “investigates” the repo and proposes a plan before writing any code (docs.devin.ai) (docs.devin.ai). For complex tasks, developers see this plan and can approve or adjust it; if the Agency mode is enabled, Devin will automatically proceed with its plan without waiting for approval (docs.devin.ai) (docs.devin.ai).

... Continue reading

SPEAKER_00

Introduction. Devon from Cognition AI is a new autonomous AI software engineer that can plan software development tasks and carry them out largely on its own. It works end-to-end on code projects, using tools like a code editor, a command line shell, and a web browser to research, write, test, and deploy code. In demos and press, Devon has been shown scanning a code base, generating a plan, editing files, running tests, and making pull requests with surprisingly little human input. Cognition claims Devon can handle complex engineering tasks requiring thousands of decisions, recalling context at each step, and even learning from mistakes. We therefore explore the public details of Devon's design and workflow. This includes how Devon breaks down tasks, its planning process, how it literally works in a developer environment, editor, terminal, browser, how it keeps memory or context across a coding session, how it self-corrects and iterates, and what guardrails or safety measures it uses. We also note what is not revealed. For example, the exact model internals are undisclosed, so some community discussion relies on educated guesswork. Task planning and decomposition. When a developer gives Devon a new assignment, the first step is planning what files to change and in what order. Cognition's notes explain that Devon uses a planning mode subagent whose job is to figure out which files in the repository are relevant to the task. In practice, Devon investigates the repo and proposes a plan before writing any code. For complex tasks, developers see this plan and can approve or adjust it. If the agency mode is enabled, Devon will automatically proceed with its plan without waiting for approval. Behind the scenes, Cognition trained this planning agent with reinforcement learning. In one analysis, the team describes giving the planner only read-only tools, like LS, grep, or read file, and rewarding it when it correctly predicts the set of files a human would edit. The result, Devon's planner learns to issue parallel file system queries, e.g., running LS and grep on different directories at once, and then narrow down promising leads. The training penalty encourages efficiency, so the agent avoids brute force, e.g., grepping the entire repo endlessly, and instead promptly commits once it finds a target. This means Devon's planning is data-driven. It has learned generic code-based navigation strategies. As Cognition notes, the model was trained on many repos and user queries. At a user level, you see the result as an outline of steps. For example, with a new feature request, Devin will suggest something like modify file A to implement X, add tests in file B, then update configuration C. In demos, if a user forgot to specify some details, Devin's plan step often catches it and prompts for clarification. In one demo, the assistant automatically added configuration of a GitHub account into the plan, even though the user did not mention it explicitly. These steps of planning, asking questions, listing tasks, mapping files, are all done within Devon's dialog interface before any code is written. If the user agrees or auto approval is on, Devon moves on to execution. Working in a dev environment, editor, terminal, and browser, Devon operates within a sandbox developer environment. Cognition's materials describe it as having a familiar developer toolkit, a shell terminal, a code editor, and a web browser all at its disposal. In practice, when Devon runs, everything it does is logged and visible in the web UI. A follow Devon view highlights each action, such as a file edit or shell command, and even lets a human click an icon to jump directly into either the code editor or the terminal where that action occurred. For example, if Devon edits a JavaScript file, a user can click to see the VS Code editor view with the changes, or if Devon runs a shell command, click to see the terminal output. You can also manually drop into Devon's workspace if you like. A recent update added a Use Devon's machine button that opens Devin's environment in VS Code over the web. This means a developer can peek at Devon's files, run commands, or even hand edit code in its workspace. For long-running tasks, this is convenient if you want to inspect something mid-flight. In one example, a user activated this to watch Devin create UI elements. The user literally opened Devin's VS Code, saw the new files Devin wrote, and could explore the UI live. The browser tool lets Devin research or test things on the internet. In demos, Devin is seen using web search to look up documentation or libraries, and even running the local web server to check that its code isn't broken. E.g., it will point a browser at localhost to verify the UI works. All told, Devon's interface is multimodal. It can take inputs like text prompts, attached design images or docs, and even code snippets, and it interacts through both chat and these developer tools. The result is an experience much closer to a colleague writing code than a static chat with an AI. Memory, knowledge, and session context. Devon keeps track of information across a session using a built-in knowledge system. Think of knowledge like a workspace notebook. Devon can store tips, project-specific instructions, or important context there and recall it later. For example, the docs describe workflows to pin certain knowledge so Devon never forgets it, such as important architectural constraints or coding style guides. Users can edit or add to this knowledge bank. Devon also will auto-generate helpful notes. It scans your repository to learn about the code structure, components, and your documentation, and builds a repo knowledge summary automatically. In practice, after you've run a few tasks, Devon might say, I noticed you often use React and Redux. I suggest adding that to knowledge, and if you approve, that info is saved. During a session, Devon will keep relevant knowledge in working memory. Cognition claims it recalls relevant context at every step. For example, if earlier it's learned that you prefer Python 3.11 or that your web app uses OAuth, it will bring that info into prompts as needed. The session is inherently long and stateful. You might talk to Devon for dozens of turns, minutes or more, while it edits many files, and it retains the chat history. If Devon ever breaks, you can scroll the log or turn on progress mode to see every action it took. If your session ends, for example, if you stop the task or wrap up, Devon forgets the running state of that machine. And its virtual machine resets to a base snapshot next time. By default, this base state includes the repositories you've preloaded in your workspace, so Devon doesn't have to clone from scratch every time. Without workspace setup, each session would start with an empty machine, so Cognition emphasizes pre-configuring your repo for speed. But beyond code, Devon does carry knowledge forward via its knowledge bank. It will prompt you to add lessons or definitions that seem useful for future tasks. Over multiple sessions, this means Devon gradually builds up a memory of your project's conventions and architecture. In addition to knowledge, Cognition has released DeepWiki, a related tool that indexes entire code bases and provides a chat interface on top of them. While DeepWiki is a separate product, it suggests the broader architecture. Devon can query its own or an external wiki of the code to answer questions. In practice, if you ask Devon something about the code, it may internally use the same retrieval systems as DeepWiki to ground its replies. Autonomy, iteration, and self-correction. Devon is designed to be autonomous, but with feedback loops when needed. After planning, it executes steps one by one, constantly checking for errors. In demos, the agent frequently follows this pattern. It uses the browser or docs to understand a problem, writes some code, runs it, sees an error, and then looks up how to fix it, mimicking a human debug cycle. For example, one presenter shows Devon adding a login form, then running the front-end test, finding a bug, and going back to research how to fix that error. Each of Devon's turns is a loop of think, act, observe, correct. Multiple sources note that Devon has self-correction built in. Indeed, the cognition blog with GPT-5 mentions that GPT-5 is good at understanding errors and course correct itself, which they highlight as great for long tasks. In other words, if Devon's code doesn't compile or fails a test, the model, often GPT-5 or similar, will see the error message and figure out a fix on the fly. It's even capable of retry loops. If an action partially succeeds, Devon may do a second pass. These loops are visible in the UI as repeated edit and run sequences. To systematically handle failures, Devon uses a mixture of automation and human oversight. For example, if Devon opens a pull request and receives a CI failure or a code review comment, Cognition's system will automatically wake Devon from sleep and have it address the issue. By default, Devon responds to lint errors or comments, although users can disable this. The UI also highlights its status and actions in real time, so a developer can intervene at any time. Developers are encouraged to watch the first few runs in live mode, where each step is shown to build trust, then let Devon run fully headless once confident. Safety, guardrails, and customization. Operators can give Devon explicit instructions on what not to do. One powerful feature is forbidden actions. You can list things Devon is not allowed to touch. For example, do not push directly to main or don't edit file X. The system ensures Devon respects these commands when they appear in the prompt or in a playbook. According to release notes, Devon now handles forbidden action lists reliably, meaning it checks its actions against those rules. This helps prevent common mistakes like modifying the wrong branch or file. Devon also provides various controls. In Slack or the web UI, you can tell Devon to sleep, pause work, or archive a session. You can choose whether Devon requires your approval before executing a plan, via the agency setting, or runs fully autonomously. Its compute usage is metered in Agent Compute Units, ACUs, and the UI shows warnings if Devon is about to hit limits, so you can intervene or grant more resources. If something does go wrong behind the scenes, Cognition has monitoring in place. In earlier releases, some users reported Devon's sessions stuck or crashed. The team notes that those issues have been fixed and offers ACU refunds if Devon hangs. In other words, the company is actively instrumenting the system for reliability. Outside analysts caution that, like any chat-based AI, Devon can produce mistakes or hallucinate code on occasion. The recommended practice is to review its output as you would a junior developer's work. For safety, many teams use code reviews on Devon's commits and constrain Devon's permissions, e.g., no direct access to secrets by default. So far, the publicly described guardrails are mostly user-defined, forbidden actions requiring plan approval, etc., and system health checks, rather than built-in ethical filters. What we don't yet know, Cognition has intentionally kept some details internal, so parts of Devon are opaque. For example, the exact large language model it uses was not initially public. Rumors and later posts suggest Cognition now integrates GPT-5 into Devon for its planning and reasoning core, and they have a preview agent based on Claude Sonnet 4.5. But the full architecture is unclear. Devon likely orchestrates multiple models and has custom fine-tuning, as hinted by the RFT planning subagent, but those layers aren't open sourced. We also do not fully know the limits of its memory. Devon claims to learn over time, but how it merges new knowledge into its existing network versus just storing it in the knowledge bank is unspecified. The maximum length of conversation history it effectively uses is not documented. When a session is very long, it is possible earlier parts of chat or code context get pruned behind the scenes. Practically, most users keep prompts and code concise to avoid context overload. On the safety side, some unknowns remain. For instance, while forbidden actions cover user-specified rules, it is not clear if Devon has any implicit safety layers, like detecting misuse of data, bias checks, or sandbox escapes. Since it runs in a VM, one hopes it cannot damage host systems, but details on that sandboxing are not public. The community infers that Devon's machine likely uses container snapshots, as mentioned for the RL training, to isolate runs. Finally, many in the community are watching to see how Devon deals with ambiguous or open-ended tasks. The sales pitch calls it fully autonomous, but analysts note it still often needs precise instructions. For example, if the user's prompt is vague, Devon might generate a plan that seems reasonable, but misses important edge cases. It may ask clarifying questions in follow-up, but developers sometimes wonder how well it understands intent versus just pattern matching on code. These aspects of Devon's cognition rely on the underlying LLM's capabilities, which we only observe indirectly. In short, users should judge Devon more as a highly skilled junior engineer than a product manager. It plans well, but might not always grasp your intent perfectly. Getting started with Devon, Devon is mainly aimed at engineering teams that do a lot of coding work. It shines on clearly defined tasks, building features from specs, refactoring, writing tests, and fixing bugs. It is less proven at high-level design or very ill-defined problems. For a software team, Devon can help knock out routine work, so humans focus on the creative architecture and oversight. For non-coders or newcomers, Devon can still be useful, but requires some setup. The first step is to give Devon access to your code repository via GitHub, GitLab, etc., and perhaps connect it in Slack or Teams. Then try a simple task. For example, ask Devon, add a new page to list all products from our database in the web UI, including test coverage. Watch the planning phase dialog. Devon will outline which files to change, e.g. HTML template, backend API code, etc., and ask any needed questions. Approve the plan or let it auto-run and watch it execute. Use the follow panel to see each step. You'll see file edits, shell commands, like running test suites, and browser snapshots of the UI. If Devin makes a mistake or you want to change, simply interact as you would in chat. Actually, use this CSS theme, or the product title should be uppercase, and Devin will start another edit loop. The key actionable step is iterate and review. Always check the code Devin produces and tests it locally. Over time you can enrich the knowledge bank, add notes like our database uses Postgres QL13 or we follow PSR12 style in PHP. Devin will begin to incorporate these in future sessions. Also explore the settings, turn agency off if you want to always vet proposals, or on if you trust it more. Link Devon to your CI for automatic pull request review, but start with notifications so you can watch how it handles feedback. Ultimately, Devon's workflow is dense and powerful, but it still relies on you for guidance. By understanding how it plans, uses tools, and learns from feedback, as detailed above, you can get the most out of this new class of a gentic coding assistant. The best next step for a team interested in Devon is to sign up on and run a small pilot. Add one web repo, ask Devon to implement a feature, and let it run in progress mode. Observe the full thinking trace. That hands-on experience will clarify exactly how Devon weaves planning, editing, and self-correction together. From there, you can scale up to more tasks and fine-tune its use. For example, custom playbooks for your domain. Though still evolving, Devon represents a major leap in AI tooling. By learning its workflow today, teams can prepare for an era where coding tasks can truly be shared with an AI teammate. All links to sources are available in the text version of this article. You can find the full article at easycoding.tools. Thanks for listening. For more practical AI coding guides, tool comparisons, and resources for builders, visit easycoding.tools. And if it's easier to remember, just go to aibuilds.it. It redirects to easycoding.tools.