YC S26 · Pre-tape-out · San Francisco

Self-adapting
silicon for AI.

Modern accelerators run agent workloads at <30% utilization. Silmir builds a flexible matrix of compute, memory, and interconnect blocks. It reorganizes itself to the workload at runtime.

DomainAdaptive Silicon · Inference
ScopeBlock · Chip · System
StatusFPGA prototype in development
FIG.01 Adaptive compute / memory / interconnect matrix · live power reallocation
<30%
Accelerator utilization on agent workloads.
10–100×
Per-device work spread inside one inference step.
1000+
Node-scale deployments studied to surface the bottleneck.
// 01 · Thesis

The bottleneck isn't compute. It's the loop.

Agent loops swing between memory-bound calls, I/O-bound tool use, and compute-bound orchestration. They branch and backtrack dozens of times per task.

Accelerators built for batch training leave most of the silicon idle. At cluster scale the imbalance compounds. The network between nodes becomes the dominant bottleneck.

Any architecture that does not treat network and memory as first-class, dynamically allocated resources will hit a hard ceiling.

~10–100× spread in per-device work inside a single inference step. The busiest device does an order of magnitude more than the idlest.

FIG.02 Per-GPU utilization · single inference step · 16-device node
// 02 · Architecture

A learning matrix, not a fixed pipeline.

01

Fine-grained allocation

Power moves at the block level. Idle blocks yield budget to bottlenecked ones in real time.

02

Global visibility

Every block sees every other block's runtime state. The allocator decides with full information.

03

Online-learned policy

Allocation is learned, not hand-coded. The policy adapts as workload patterns shift.

04

Anomaly substrate

A safety layer catches bad policy calls before they propagate. Learning at silicon level becomes safe to ship.

FIG.03 Stack: workload signals → learned allocator → adaptive matrix → physical substrate
// 03 · Differentiator

Adaptation, end to end.

Block level

ALU, CPU, and memory blocks adjust voltage, clock, and routing per cycle. Adaptation begins below the core boundary.

Chip level

A learned controller redistributes power across the full block matrix. Compute, memory, and interconnect compete in one budget.

System level

Heterogeneous big-little arrays reorganize execution across nodes per inference loop. The whole system adapts as one.

Fixed-function accelerators win at one pattern. General-purpose chips spread thin across all of them. Neither adapts at runtime. Silmir treats adaptation as the architecture — from the block to the cluster.

// 04 · Stack

Open substrate. Custom silicon next.

Hardware

  • BlocksALU · CPU · MEM · NET
  • HDLRTL · HLS
  • SubstrateOpen-source RISC-V
  • SimCycle-accurate · FPGA
  • PathShuttle → full-mask ASIC

Runtime

  • CoreRust · Python
  • TargetsCPU · GPU · NPU clusters
  • CompilerMLIR-based IR
  • Phase 1Scheduling on GPU clusters
  • Phase 2FPGA prototype

Policy

  • ModelsGBT · light transformers
  • InputsPer-block runtime telemetry
  • OutputsPower · clock · route maps
  • SafetyAnomaly-gated commit
  • LoopOnline learning, on-die
// 05 · Roadmap

Software first. Silicon when the policy is proven.

  1. T+0

    Runtime layer on GPU clusters

    Show learned allocation beats static partitioning on real agent workloads.

  2. T+6

    FPGA prototype

    Adaptation layer in silicon. Live demo under shifting workload.

  3. T+18

    Shuttle tape-out

    Adaptive subsystem on a shuttle run. Validate physical-level primitives.

  4. T+36

    Full-mask tape-out

    First adaptive ASIC for inference. Hyperscaler design wins.

// 06 · Contact

Building, hiring, talking.

Inference at scale. Adaptive systems. RTL. Agent workloads. Get in touch.