AWS Nova Forge is here, and it changes how serious teams approach custom AI. Announced on December 2, 2025 during re:Invent, Nova Forge lets you start from Amazon’s early Nova checkpoints and blend your own data through multiple training stages—beyond the usual fine‑tuning. Pair that with the Nova 2 family (Lite, Pro, Sonic, Omni), new Trainium3 UltraServers, and Lambda durable functions, and you’ve got a fresh, very practical path to domain‑specific intelligence that’s cheaper and faster to ship than last year’s playbooks.

Team planning an AWS Nova Forge pilot on a whiteboard

Why Nova Forge matters right now

Most enterprises hit a wall trying to shoehorn generic models into specialized workflows—claims adjudication, KYC reviews, CAD change orders, clinical coding, tax notices, the list goes on. Nova Forge opens the hood: you can start from pre‑training or mid‑training checkpoints of Amazon Nova models, inject proprietary corpora, and still retain general reasoning capabilities. You also get reinforcement‑style fine‑tuning with your own reward functions and built‑in safety tooling, plus early access to advanced models like Nova 2 Pro and Nova 2 Omni.

Here’s the thing: if your organization sits on high‑signal data (decades of SOPs, annotated tickets, decision trees, procedure notes), AWS just offered you a way to turn that pile into a defendable capability—without staffing a research lab. That’s why AWS Nova Forge is the headline worth acting on this week, not next quarter.

What actually changed at re:Invent 2025

Concrete, ship‑ready updates developers can use now:

• Nova 2 models: Nova 2 Lite is available now for everyday reasoning with controllable “thinking” depth; Nova 2 Pro is in preview for complex, multi‑step work. Nova 2 Sonic upgrades real‑time voice with polyglot voices and a larger context window, while Nova 2 Omni (preview) handles multimodal reasoning across text, image, video, and speech.

• Nova Forge GA: Build domain models starting from early Nova checkpoints, bring your own data during multiple phases, run reinforcement fine‑tuning with custom rewards, and define safety guardrails—all within the AWS stack.

• Trainium3 UltraServers: Up to 144 Trainium3 chips per system, with materially better performance and efficiency than the prior generation. That matters for training and inference cost curves, especially for long‑context, tool‑using agents.

• Lambda durable functions: A first‑class way to run reliable multi‑step flows—think AI approvals, RAG pipelines, and back‑office automations—that can pause for human input and resume, preserving progress for up to a year.

• Data plumbing upgrades: S3 Vectors scales massively for native vector search; S3’s max object size jumps to 50 TB; CloudWatch and S3 Tables improvements reduce duct‑tape ETL.

Build vs. Buy vs. Forge: a decision framework

Use this quick rubric before you spin cycles on custom training:

1) Buy (managed model) if…

• Your tasks are generic (FAQ chat, simple classification) and latency/cost targets are modest.

• Compliance risk is low and you don’t need to embed deep domain rules or esoteric vocab.

• You want to ship a baseline in days using Bedrock‑hosted models and keep options open.

2) Fine‑tune (classic) if…

• You have labeled data for narrow tasks, but the base model’s general knowledge is good enough.

• You need small accuracy gains at controlled cost, and you can live with occasional hallucinations mitigated by retrieval.

3) Forge (Nova Forge) if…

• You have proprietary corpora with dense, decision‑grade signals (e.g., adjudication outcomes, detailed SOPs, adjudicator notes, exception flows).

• You need reasoning aligned to domain rules (compliance, safety, regulated orders) and want control over reward signals and guardrails.

• You expect sustained savings from fewer model calls, fewer agents, or faster resolution times—enough to justify training runs on Trainium.

Rule of thumb: if your weekly human expert time on the workflow exceeds 500 hours and error costs are real dollars, Nova Forge is worth a pilot. If not, start with Nova 2 Lite plus retrieval and re‑evaluate in 60 days.

How to pilot AWS Nova Forge in 90 days

This is a battle‑tested track that won’t torpedo velocity.

Weeks 0–2: Pick the narrowest high‑value slice

• Choose one decision loop: “Approve/deny with reasons,” “Classify and cite policy,” “Triage to queue + next action.” Define one success metric (e.g., F1 on decisions, average handle time, first‑contact resolution).

• Data audit: identify 50k–500k high‑signal examples. Prioritize documents with outcomes, rationales, and timestamps. Scrub PII beyond what’s needed for the task. Document consent and retention policies.

• Guardrails: write down prohibited actions, sources of truth, and escalation rules you’ll enforce in Forge’s responsible AI features.

• Baseline: wire up Nova 2 Lite with retrieval against a curated knowledge base. Capture latency, cost per task, and accuracy as the line to beat.

Weeks 3–4: Build your evaluation harness

• Gold sets: 1k–5k examples with adjudicator agreement. Freeze them.

• Scoring: implement automatic metrics (exact match, citation coverage, tool‑use success), plus human review of edge cases weekly.

• Tooling: stand up dashboards for cost, latency, and error categories. Decide your rollback criteria.

Weeks 5–6: Enter Nova Forge

• Start from an early Nova checkpoint offered by Forge. Train with your proprietary data; keep a small portion for reinforcement fine‑tuning with your reward function (e.g., correctness + citation + policy adherence).

• Configure safety guardrails: banned sources, escalation triggers, and hard constraints. Log all violations to your SIEM.

• Compare to baseline: same gold sets, same cost and latency targets.

Weeks 7–8: Stage to production

• Ship behind a feature flag to 5–10% of traffic. Monitor guardrail violations, mean time to reason, and task success. Keep Nova 2 Lite + retrieval as fallback.

• Add human‑in‑the‑loop for borderline cases; capture feedback to improve rewards.

Weeks 9–12: Expand and harden

• Scale to 25–50% if error budgets hold. Snapshot checkpoints and version your reward functions.

• Run a chaos week: deliberately degrade retrieval, inject noisy inputs, and verify safety responses.

• Decide: ship, iterate, or cut. If you cut, you still keep your evaluation harness and guardrails for future models.

How do I estimate Trainium3 capacity?

Think in tokens and throughput. For pretraining or heavy reinforcement phases, your cost is dominated by tokens processed per second and the time to reach target loss/accuracy. AWS’s Trainium3 UltraServers package up to 144 chips with major gains over the previous generation, so you’ll see shorter wall‑clock times and better performance per watt. To budget: pick your token budget (say, 100–500B tokens for a mid‑sized domain model), estimate effective tokens/second per chip from your pilot, then scale by desired time‑to‑train. When in doubt, buy time—not just chips—by tuning your data mix and curriculum first.

For inference, long‑context agents (Nova 2 models can go up to million‑token contexts) push memory hard; test batch sizes and latency envelopes with realistic prompts and tool calls. A small architectural change—like trimming retrieval context or streaming partial results—often saves more than throwing hardware at the problem.

What about data security and lock‑in?

Nova Forge keeps your proprietary data in your AWS environment during training; design for encryption in transit and at rest, tight IAM boundaries, and explicit data retention policies. For portability, require your teams to document prompts, tools, and reward functions independently of any single model. Keep a dual‑track path: a Bedrock‑hosted Nova 2 baseline and your Forge‑trained model. If a roadmap pivots, you have an escape hatch without a rewrite.

Risks, gotchas, and edge cases

• Reward hacking: if your reward mixes correctness and speed, models may learn to cut corners. Keep a held‑out adversarial set and rotate it.

• Catastrophic forgetting: when your domain data overwhelms the mix, general reasoning can suffer. Monitor open‑domain tasks alongside your target metric.

• Evaluation variance: don’t celebrate a +2% gain off a single run. Require confidence intervals or repeated trials before a rollout decision.

• Cost whiplash: long context + tool calls can double effective inference cost. Cap context lengths, stream results, and audit unused references.

• Quotas and regions: Nova Forge started in a limited set of regions. Plan for staggered rollouts and keep region‑specific runbooks.

Where Lambda durable functions fit

Once your model’s behavior stabilizes, you need reliable orchestration. Lambda durable functions add progress checkpoints, error recovery, and the ability to pause for approvals or external events for up to a year without burning compute. It’s ideal for agentic workflows that mix calls to Nova 2, retrieval, and downstream systems. If you already run serverless at scale, this drops in cleanly and replaces a bunch of custom state machines your team probably maintains today. Our Lambda operations playbook breaks down patterns for productionizing serverless workloads—worth a skim before you wire agents into customer‑facing flows. See our Lambda playbook for real‑world operations.

Multimodal now: where Sonic and Omni help

Nova 2 Sonic brings real‑time voice to customer support and field ops, with better speech understanding under noise and polyglot voices that keep the same persona across languages. Nova 2 Omni unifies modalities so you don’t glue together multiple models for transcription, analysis, and image generation. That trim in complexity shows up as fewer moving parts to monitor and fewer handoffs that can fail.

High‑density AI training cluster in a data center

A realistic cost and throughput checklist

Run this before any procurement meeting:

• Tokens first: define total tokens to process for training and an average tokens‑per‑second target per chip.

• Context policy: set maximum context window per use case and reject prompts that exceed it unless explicitly approved.

• Batch and stream: measure latency with batch sizes you’ll use in prod; prefer streaming for UX and cost smoothing.

• Guardrail budget: cap the number of retries or escalations per request; log violations with structured events.

• Golden set cadence: re‑score weekly; require regression protection before new checkpoints roll to prod.

• Rollback muscle: keep a known‑good Nova 2 Lite + retrieval path live and test it in rotation.

Agentic development: process, not magic

Agents stop being toys when they have memory, policy, evaluation, and observability wired from day one. If you’re new to structured agent builds, borrow a plan and stick to it. We published a practical 90‑day track for AgentCore that pairs nicely with Forge—use it to codify policies, evaluation, and memory strategies while your model training progresses. Read our Amazon Bedrock AgentCore 90‑day plan.

When multicloud still makes sense

Not every team will keep all training and serving in one cloud. Some will co‑locate inference near data gravity or comply with regional rules. That’s fine—just invest in clean network links and a consistent identity story. We’ve covered the practicalities of AWS–Google connectivity and what to do immediately when networking changes ripple through your estate. For a hands‑on runbook, see our AWS–Google multicloud networking guide.

What to do next

• This week: pick one workflow and one metric; stand up a Nova 2 Lite + retrieval baseline; draft your safety guardrails and publish them internally.

• Next two weeks: build the evaluation harness and gold sets; define a first reward function that balances correctness, citation, and escalation.

• Weeks 5–6: start your AWS Nova Forge run from an early checkpoint. Compare cost, latency, and accuracy to your baseline.

• Weeks 7–8: ship behind a flag with Lambda durable functions orchestrating tool calls and approvals.

• Quarter’s end: decide go/no‑go based on error budgets and ROI; if go, fund Trainium time for the next scope and lock in operational SLOs.

If you want a second opinion on scoping, guardrails, or rollout math, our team does this work for product and platform orgs every week. Start with what we do, browse a few examples in our portfolio, or just reach out. We’ll help you ship something real, not just a keynote demo.

Durable functions orchestration for AI workflows

AWS Nova Forge: What to Do This Week

Why Nova Forge matters right now

What actually changed at re:Invent 2025