Open Source

AppGen-X v2: Natural Language to Production App in 90 Seconds

AppGen-X v1 generated technically valid code that no experienced developer would write. SQLAlchemy models with no relationship loading strategy: every query defaulting to lazy loading, N+1 problems waiting to detonate in production. REST endpoints with no pagination: fine for a demo dataset of 12 rows, embarrassing at 50,000. Forms with no CSRF protection. It compiled. It ran. A senior developer reviewing the output would have questions, and those questions would be reasonable ones.

We shipped it anyway because "generates a working skeleton" was still genuinely valuable. The alternative was nothing, and nothing helps no one. But we knew what we were shipping, and we knew v2 had to be different in ways that mattered, not just in features.

This post covers what changed, the decisions behind those changes, and what v2 still does not handle well: because any release post that only lists what improved is a marketing document, not an engineering document.

What v1 Got Wrong, Specifically

Before the feature list: the honest accounting. Three categories of problems drove the v2 design.

No IR. In v1, each output target, Flask, Django, FastAPI, had its own parser that read the DSL directly and generated code. This meant every bug fix had to be applied three times. Every new input format (we wanted DBML support, SQL DDL support) required three new parsers. The codebase was a maintenance multiplier, and adding a new output target was proportionally expensive.

Generated code with no production awareness. The ORM code did not specify lazy vs joined vs subquery loading on relationships, meaning SQLAlchemy chose for you, and SQLAlchemy's default is almost never what you want in a list endpoint that fetches related objects. Endpoints had no limit/offset parameters. Generated forms used Flask-WTF but omitted the CSRF token in the template. None of these were bugs in the sense that the code would not run. They were bugs in the sense that the code should not be in production without correction.

No natural language input. v1 required users to write the AppGen-X DSL. The DSL is straightforward, but it is still a schema language, and schema languages have a learning curve. We were asking non-technical product owners to either skip the tool or learn the DSL before using it. Neither outcome was acceptable.

1. Natural Language Input

The headline feature. You describe your application in plain English; AppGen-X produces a DSL spec. From that spec, the code generators run as normal.

We evaluated GPT-4o first. For enterprise deployments, and most of our users run AppGen-X in their own infrastructure, not through our cloud, the API cost was prohibitive and self-hosting was not a realistic option. We evaluated several open models before settling on Qwen2.5-Coder-32B. It handles code-structured prompts better than larger general-purpose models, and we run it on our own infrastructure. The inference latency adds roughly 8–12 seconds to the total generation time on our CI server.

The NL-to-spec pipeline works in two stages. First, the model produces a structured JSON representation of entities, fields, relationships, and API surface. Second, a deterministic compiler validates that JSON against the AppGen-X IR schema and rejects or repairs anything that does not conform. The model can hallucinate; the IR compiler cannot be convinced to accept a malformed relationship definition.

A concrete example. Given this prompt:

# Natural language input
"Build a project management tool. Users belong to organizations.
Projects belong to organizations and have multiple tasks. Tasks
can be assigned to users. Each task has a status: todo, in_progress,
or done."

AppGen-X produces this DSL fragment:

# Generated AppGen-X DSL (excerpt)
entity Organization:
  id: uuid pk
  name: str required
  slug: str unique

entity User:
  id: uuid pk
  email: str unique required
  organization: fk(Organization) required
  display_name: str

entity Project:
  id: uuid pk
  name: str required
  organization: fk(Organization) required
  created_at: datetime auto

entity Task:
  id: uuid pk
  title: str required
  status: enum(todo, in_progress, done) default(todo)
  project: fk(Project) required
  assignee: fk(User) nullable
  due_date: date nullable

That spec goes into the IR, validation runs, and the code generators take over. The model does not touch code generation: that boundary is intentional and enforced.

2. The Intermediate Representation

The IR is the most important architectural decision in the v2 codebase. We almost shipped without it. The temptation was to take the v1 approach, per-target parsers, and just add the NL input as a new front-end. We would have regretted it within six months.

The IR as architectural load-bearing wall: The intermediate representation is a normalized, validated data model that sits between all inputs (natural language, DSL, DBML, SQL DDL) and all outputs (Flask, FastAPI, Django, GraphQL, Flutter). Validation happens exactly once. Code generators are input-agnostic. Adding a new output target is roughly 300 lines of generator code, not 300 lines per input format. We spent three weeks on the IR schema. It was the best three weeks in the v2 development cycle.

The IR schema is defined in Python using Pydantic v2. Every entity, field, relationship, index, and constraint that AppGen-X understands has a typed representation in the IR. The DSL compiler, the NL pipeline, the DBML parser, and the SQL DDL reader all produce IR objects: not code, not strings, not dictionaries. When a parser produces an invalid IR object, Pydantic raises before anything reaches a code generator.

This also meant we could fix the v1 relationship loading problem systematically. The IR now carries an explicit loading_strategy field on every foreign key. The DSL compiler infers a sensible default based on relationship cardinality and endpoint context: 1:N relationships used in list endpoints default to subquery; relationships used in detail endpoints default to joined. Users can override explicitly. Code generators read the field and emit the correct SQLAlchemy argument. The problem is solved at the IR level, not patched in three separate generators.

3. GraphQL with Strawberry-Python

v2 adds a GraphQL output target. We chose Strawberry over Graphene because Strawberry is type-safe at the Python level: types are Python dataclasses decorated with strawberry.type, not string-based schema definitions. The resolver stubs AppGen-X generates pass pyright with zero errors. Graphene requires you to trust string-based field definitions at runtime; Strawberry lets your type checker catch schema/resolver mismatches before you run anything.

The part we are most proud of is dataloader generation. N+1 queries are the default failure mode of naive GraphQL implementations: a query for 20 projects that each have tasks triggers 21 database queries instead of 2. Fixing this requires dataloaders, which batch and cache child queries. Writing correct dataloader code for an arbitrary relationship graph by hand is tedious. Generating it is harder: the generator has to understand the relationship graph from the IR, identify which relationships will be traversed in GraphQL resolvers, and emit a correctly-keyed dataloader for each one.

# Generated Strawberry type with dataloader resolver
# AppGen-X emits this automatically for every 1:N relationship

@strawberry.type
class ProjectType:
    id: strawberry.ID
    name: str

    @strawberry.field
    async def tasks(self, info: strawberry.types.Info) -> list["TaskType"]:
        return await info.context["loaders"].tasks_by_project.load(self.id)


# Generated DataLoader — one per FK relationship in the resolver graph
class TasksByProjectLoader(DataLoader):
    async def batch_load_fn(self, project_ids: list[str]) -> list[list[Task]]:
        tasks = await Task.filter(project_id__in=project_ids).all()
        by_project: dict[str, list[Task]] = {pid: [] for pid in project_ids}
        for task in tasks:
            by_project[task.project_id].append(task)
        # output order must match input key order — common mistake in hand-written loaders
        return [by_project[pid] for pid in project_ids]

The generator walks the IR relationship graph, identifies every foreign key that could be traversed from a GraphQL query root, and emits a loader for each. The output list must be ordered to match the input key list: a subtlety that trips up hand-written dataloaders regularly, and one the generator handles correctly by construction.

4. Flutter Target

The war story version: we spent three weeks on state management before a single screen rendered correctly.

We evaluated Provider, Riverpod, Bloc, and GetX. The evaluation was not close. Provider is fine for simple cases and becomes architectural spaghetti at scale. Bloc's event/state model produces verbose but predictable code, predictable enough to generate, which matters, but the boilerplate-to-logic ratio is high and generated boilerplate is generated boilerplate. GetX's magic makes it difficult to generate code that a human can subsequently understand and maintain. Riverpod won: it is code-generation friendly via the @riverpod annotation pattern, its compile-time safety aligns with AppGen-X's approach of catching errors before runtime, and its async notifier pattern maps cleanly onto list/detail state derived from API calls.

If you disagree, our DMs are not open for this debate.

The Flutter generator reads the IR entity definitions and API surface, then emits Riverpod providers for each entity's list and detail state, repository classes that call the generated API client, model classes with fromJson/toJson, and basic list/form/detail screens in Material 3 with correct navigation and error handling. The screens are not beautiful. Styling is intentionally left to the developer. AppGen-X generates structure, not design systems.

The 90-Second Benchmark

The spec says 90 seconds. We should be precise about what that means.

Tested on a 2023 MacBook Pro M2 Pro, local inference (Qwen2.5-Coder-32B at 4-bit quantization), a 5-table CRUD schema with standard relationships, Flask + GraphQL + Flutter output targets selected. From natural language prompt to written output files: 87 seconds. On our CI server, an AMD EPYC instance with an A100, it is 110 seconds. We are not hiding that number. The M2 Pro is faster for this workload because the quantized model fits efficiently in unified memory; the A100 instance is not provisioned for low-latency inference.

DSL input, skipping the NL step, runs in 12–18 seconds regardless of hardware. The bulk of the generation time is model inference for NL-to-spec; the code generators themselves are fast. If your workflow is DSL-first, the 90-second headline is not your number.

What v2 Still Does Not Handle Well

The part of this post that the marketing team would prefer we omit.

NL input hallucinates on ambiguous many-to-many schemas: "Users can have multiple projects and projects can have multiple users" should produce a proper M2M join table: user_projects with foreign keys to both sides. Depending on phrasing, Qwen2.5-Coder-32B sometimes produces a 1:N instead: projects get a single owner_id FK to users, which is structurally valid but semantically wrong for this description. The IR compiler accepts it because a 1:N is a legal relationship, it just is not what you asked for. If your schema has complex many-to-many relationships, describe them explicitly in the DSL or be explicit in your NL prompt ("users have a many-to-many relationship with projects through a membership table"). We are working on a relationship disambiguation pass that surfaces a clarification prompt when the model's cardinality confidence is below a threshold.

Beyond the M2M issue:

No authentication scaffolding. AppGen-X generates CRUD endpoints and screens. It does not generate authentication: no login flow, no JWT handling, no session management. This is a deliberate scope decision, not an oversight. Auth requirements vary too much across deployment contexts to generate well generically. The generated code has no awareness of auth: no @login_required decorators, no current-user context injected into endpoints. That is on your list from day one.

Flutter screens assume REST. The Flutter generator calls the REST API target. If you select GraphQL but not REST, the Flutter output will not work as-is. Mixed-target generation where Flutter consumes GraphQL is on the roadmap for v2.1.

Complex business logic is out of scope by design. AppGen-X generates CRUD scaffolding. If your application has non-trivial business rules: multi-step workflows, computed fields, event-driven side effects, those go in extension hooks that the generator leaves empty and documented. The generator cannot model your business logic. Its job is to eliminate the scaffolding work so you can focus on the business logic, not to replace it with generated code.

Generated migrations are additive only. The Alembic migration generator creates initial migrations and additive changes: new tables, new columns, new indices. It does not generate safe destructive migrations: dropping columns, renaming fields, changing types. Destructive schema changes require human review. This is correct behavior; destructive migrations on production databases should always have a human in the loop.

What Is Next

The immediate roadmap is constrained to things we are confident we can do correctly, not things that sound good in a release post.

v2.1 targets the M2M disambiguation problem: a structured clarification prompt when the model's relationship confidence is below threshold, so ambiguous NL input produces a question rather than a wrong schema. It also targets Flutter-over-GraphQL so the Flutter generator can consume a GraphQL API instead of requiring a REST layer in parallel.

v2.2 targets Django as a first-class output target: v2 ships Flask and FastAPI; Django's ORM and migration system require a distinct generator with different loading strategy semantics. It also targets an async FastAPI mode: v2's FastAPI output uses synchronous SQLAlchemy; the async SQLAlchemy + asyncpg path is a separate generator that warrants its own release rather than a flag.

The CSRF problem from v1 is fixed in v2. The lazy loading problem is fixed. The pagination problem is fixed: every generated list endpoint takes limit and offset parameters, defaults to limit 50, and returns a total count alongside the results. These were embarrassing omissions in v1. They are not in v2.

AppGen-X v2 is available now on GitHub. The v1 DSL is fully compatible: the v2 parser is a strict superset. The changelog documents every new IR field and every behavior change in the generators. If you were generating Flask apps with v1, the output is different in the loading strategy and pagination details: both are improvements, both may require updating any tests that assert on the exact generated output.