BYBOWU > Blog > AI

AWS Nova Forge: The Build‑Your‑Own Model Playbook

blog hero image
AWS Nova Forge just arrived with a promise most execs and CTOs have been waiting for: build your own frontier‑grade model without starting from a blank slate. This piece breaks down what actually shipped, why “open training” matters, how it intersects with Nova 2 and Trainium3, and a concrete 30‑day pilot plan you can run with a lean team. We’ll cover costs, risks, architecture patterns that work in production, and the tradeoffs leaders should weigh before committing budget.
📅
Published
Dec 07, 2025
🏷️
Category
AI
⏱️
Read Time
11 min

AWS Nova Forge is now generally available, and it changes how enterprises think about the build‑versus‑buy decision for AI. Instead of choosing between a generic hosted model and a multi‑year pretraining effort, teams can start from Amazon’s Nova checkpoints and inject proprietary data at pre‑, mid‑, or post‑training stages—what AWS calls “open training.” Announced on December 2, 2025, Nova Forge also brings early access to the newest Nova models. (aws.amazon.com)

Architecture: Nova Forge training with Bedrock inference

What shipped this week—and why it matters

Three announcements are worth your roadmap meeting:

First, Nova Forge GA. You can begin model development from early Nova checkpoints in SageMaker AI, mix in proprietary datasets with Nova‑curated data, run reinforcement fine‑tuning (RFT) in your own environments, and apply built‑in safety tooling. It’s live today in us‑east‑1 with more regions to follow. (aws.amazon.com)

Second, Nova 2 models landed in Bedrock. Nova 2 Lite is available now; Nova 2 Pro is in preview with one‑million‑token context windows, three reasoning intensity levels, and built‑in tools like code interpreter and web grounding. These are designed for agentic workflows and long, multi‑document reasoning. (aws.amazon.com)

Third, Nova 2 Omni entered preview. It’s a multimodal reasoning model that accepts text, image, video, and speech inputs and outputs text and images—useful when your agents must listen, watch, and act. Early access is via Nova Forge. (aws.amazon.com)

All of this ships alongside Trainium3 UltraServers: 144‑chip, 3nm systems with up to 4.4× performance and 4× better performance per watt versus the prior generation—important if you plan to train or heavily fine‑tune at scale. (aws.amazon.com)

Where AWS Nova Forge fits in your AI strategy

Here’s the thing: most enterprises don’t need to pretrain a general‑purpose foundation model from scratch. But they do need deep domain alignment—procedures, catalog specifics, regulatory language, product semantics—and reliable reasoning. Nova Forge is a middle path: you inherit Nova’s broad capabilities and teach it your company’s private know‑how through staged training and alignment, producing a custom variant (“Novella” in AWS parlance) that you control. (aboutamazon.com)

That “open training” access to pre‑, mid‑, and post‑training checkpoints is the real unlock. You can mitigate catastrophic forgetting by mixing Nova‑curated data with your own, then enforce safety with a first‑party toolkit. And because Nova 2 Lite is available today—with Nova 2 Pro and Nova 2 Omni early to Forge customers—you can start narrow and grow into more capable variants without replatforming. (aws.amazon.com)

On pricing, AWS states that Nova Forge uses an annual subscription model, with details in the console and through account teams. For inference and evaluation, Nova models run via Bedrock with on‑demand or provisioned throughput, and on‑demand for custom Nova is priced the same as base Nova inference. That makes cost modeling straightforward once you know your token budgets. (aws.amazon.com)

How to pilot AWS Nova Forge in 30 days

Here’s a tight, realistic plan my teams would run with three engineers and a product lead. Adjust the scope, not the discipline.

Day 1–3: Frame the problem and success metrics

Pick a use case where “generic LLM + RAG” is close but not reliable enough: complex policy summarization, product compatibility reasoning, code refactoring for your in‑house frameworks, or claims adjudication with dense business rules. Define three measurable outcomes (e.g., task success rate, median latency, and cost per completed task) and choose a rubric for human evals. Lock a weekly demo rhythm.

Day 4–7: Assemble data and safety boundaries

Curate a high‑signal corpus: authoritative manuals, golden tickets, verified decision trees, and negative examples that illustrate what to avoid. Draft guardrails: forbidden tool calls, escalation triggers, PII handling. Set up a private reward gym that mirrors your production environment for RFT—think sandboxed APIs and synthetic edge cases that punish bad behavior and reward chain‑of‑thought that leads to safe actions. (aws.amazon.com)

Day 8–12: Baseline on Nova 2 Lite

Stand up a baseline using Nova 2 Lite on Bedrock with shallow supervised fine‑tuning (SFT) where needed. Exploit Bedrock’s built‑in tools (code interpreter, web grounding) for tasks that blend reasoning with external actions. Gather token counts and latency, and run your first human eval pass to establish a floor. (aws.amazon.com)

Day 13–18: Forge your first Novella

Spin up Nova Forge in us‑east‑1. Start from the relevant checkpoint, mix Nova‑curated with proprietary data, and run short RFT cycles in your reward gym. Your goal isn’t SOTA—it's eliminating failure modes the baseline couldn’t handle. Keep iterations small; don’t aim for big monolithic runs on week one. (aws.amazon.com)

Day 19–23: Safety hardening and red teaming

Use the responsible AI toolkit to enforce policy constraints (blocked intents, safe tool invocation, and content filters). Run adversarial prompts from prior incidents and compliance test suites. Document the model card and residual risks. (aws.amazon.com)

Day 24–30: Side‑by‑side bake‑off

Run a blinded evaluation: generic Nova 2 Lite, your SFT‑only variant, and your first Novella. Use the same tasks, tool access, and budgets. Compare task success, latency, and cost per completed task. If your Novella wins on success rate by ≥10% with neutral or better cost, you’ve earned a greenlight for a quarter‑long program.

Trainium-class AI servers in a data center

Architecture patterns that actually work

Most teams will run a hybrid: Forge for training/alignment, Bedrock for inference, and selective use of SageMaker AI or EC2 UltraServers when they need more control or throughput. With Trainium3 UltraServers now available, you can stage larger or denser training cycles while keeping inference on Bedrock for elasticity and cost control. (aws.amazon.com)

For knowledge grounding and long‑context retrieval, consider pairing your Novella with vector search that scales. We’ve written about the new Amazon S3 Vectors capability and how it changes RAG economics at massive scale; if you’re pushing beyond tens of millions of chunks, it’s worth a look. RAG at billion‑vector scale on S3 Vectors can simplify your data plane and spend.

If your estate spans clouds, design for the pipes. Keep training data and reward gyms close to where you’ll run Forge. If inference needs to be closer to users or other clouds, plan your network and IAM boundaries up front. We’ve covered multicloud connectivity tradeoffs and playbooks you can adapt. See our multicloud playbook for AWS Interconnect and the follow‑ups for Google Cloud pairings.

And don’t ignore CPU backends. Much of your orchestration, pre/post‑processing, and evaluation runs fine (and cheaper) on the latest Graviton instances. For many teams, moving non‑GPU workloads to Graviton5 pays for part of the AI bill. If you’re still on x86 for these services, our guide can jump‑start your plan. A 90‑day Graviton5 migration plan outlines the milestones.

How does Nova Forge relate to Bedrock fine‑tuning?

Think of it as depth versus convenience. Bedrock’s SFT is the fastest path to tailor a model for many tasks. Forge is what you reach for when SFT tops out—when you need domain knowledge embedded earlier in the model’s learning, custom reward functions, and tighter control over safety. Bedrock also supports on‑demand deployment for custom Nova models, so you can keep ops simple after you train or fine‑tune. (aws.amazon.com)

Do I need Trainium3 to use Nova Forge?

No. You can train via SageMaker AI managed options and use Bedrock for inference, then scale up to Trainium3 UltraServers if and when your training needs justify it. Forge’s value—the checkpoints, data mixing, and RFT hooks—doesn’t require you to own a data center or an UltraCluster; it just ensures you have a path if you need it. (aws.amazon.com)

Cost control: the levers that matter

Leaders care about two lines: accuracy and cost per completed task. Nova helps with both, but only if you instrument correctly.

Token economics: AWS has published example Nova price points that, as of mid‑2025, illustrate substantial differences between Nova 2 Lite and Nova 2 Pro. Your model choice and prompt design matter more than you think. Do calorie counting on tokens up front and set guardrails in your app layer. (aws.amazon.com)

Provisioning: On Bedrock, custom Nova models support on‑demand inference priced the same as base Nova; this keeps your pilot costs variable while you find product‑market fit for your agent or workflow. When traffic stabilizes, consider Provisioned Throughput for predictability. (aws.amazon.com)

Training budgets: Start small. Use compact corpora with high signal density, then graduate to longer RFT runs. If you outgrow managed options, that’s when Trainium3 UltraServers begin to make sense—especially for large MoE or long‑context variants. (aws.amazon.com)

Risks and limits (read this before you scale)

Overfitting to proprietary lore: It’s easy to teach the model narrow truths that don’t generalize across teams, regions, or product lines. Keep a strong holdout set and run cross‑domain evals each sprint.

Compliance drift: Guardrails reduce risk, but governance still lives in process. Run policy tests as code. Treat your reward gym like regulated infra: versioned, reviewed, and auditable. (aws.amazon.com)

Latency surprises: Multimodal inputs and long contexts can spike latency. Batch where you can, and prune prompts aggressively. Consider hybrid patterns—summarize to structured facts, then reason over the skinny artifact.

Data movement: Ground truth often sits across clouds and regions. Be intentional about where you train and serve. If you haven’t mapped your interconnect and egress patterns, do it now to avoid surprise bills and throttling. Our practical AWS + Google multicloud plan walks through the gotchas.

A simple scoring framework for go/no‑go

Use this five‑factor score after your 30‑day bake‑off. Rate each 1–5; a 20+ total merits expansion.

1) Task success: Does the Novella beat the generic baseline by ≥10% on real tasks? 2) Safety: Did you eliminate critical failure modes in red‑team tests? 3) Latency: Are P50 and P95 inside your UX envelope? 4) Unit cost: Is cost per completed task stable within budget across traffic spikes? 5) Ops fit: Can your team run and iterate without heroics?

FAQ‑style quick hits

Can we start on Lite and graduate to Pro or Omni without rework?

Yes. Forge customers get early access to Nova 2 Pro and Omni. Keep your app thin and model‑agnostic—prompt schemas, tool contracts, and eval harness—to make upgrades incremental. (aws.amazon.com)

What about vendor lock‑in?

You’re opting into the Nova family and AWS training/inference stack. If portability is a hard requirement, isolate your data pipelines and evaluation harnesses, and keep RAG/store layers cloud‑neutral. But the trade for many enterprises—faster time to reliability—will be worth it.

How do we estimate token budgets?

Instrument from day one. Log input/output token counts, tool calls, and retries. Run weekly reports on cost per resolved ticket, per document, or per test case. Use price references from AWS posts as a sanity check and update when your account team shares current rates. (aws.amazon.com)

What to do next

• Pick a “nearly there” use case and run the 30‑day pilot. • Start on Nova 2 Lite in Bedrock, then move to Forge when SFT can’t eliminate critical misses. • Build a reward gym that mirrors production and wire it into CI. • Set token budgets and safety guardrails in code. • If your evals prove out, plan Q1 headcount and capacity for a three‑month scale‑up.

If you want a second set of eyes, our team can help—from use‑case selection and data triage to reward gym design and pilot execution. See what we do for AI and cloud engagements, browse a few relevant builds in our portfolio, and subscribe to our blog for deeper dives.

Planning cost and reliability for an AI pilot
Written by Viktoria Sulzhyk · BYBOWU
3,485 views

Work with a Phoenix-based web & app team

If this article resonated with your goals, our Phoenix, AZ team can help turn it into a real project for your business.

Explore Phoenix Web & App Services Get a Free Phoenix Web Development Quote

Get in Touch

Ready to start your next project? Let's discuss how we can help bring your vision to life

Email Us

[email protected]

We typically respond within 5 minutes – 4 hours (America/Phoenix time), wherever you are

Call Us

+1 (602) 748-9530

Available Mon–Fri, 9AM–6PM (America/Phoenix)

Live Chat

Start a conversation

Get instant answers

Visit Us

Phoenix, AZ / Spain / Ukraine

Digital Innovation Hub

Send us a message

Tell us about your project and we'll get back to you from Phoenix HQ within a few business hours. You can also ask for a free website/app audit.

💻
🎯
🚀
💎
🔥