Home I Tried to Put an AI Inside Trading Hardware (Part 1: The Problem)
Post
Cancel

I Tried to Put an AI Inside Trading Hardware (Part 1: The Problem)

What This Is

For my Financial & Risk Engineering capstone at NYU Tandon, I built a prototype framework for classifying market behavior in real time using an LSTM neural network deployed on an FPGA.

The short version: markets move fast, bad actors exploit that speed, and if you can detect their patterns early enough—in microseconds—you might be able to do something about it. FPGAs are one of the few hardware platforms where “microseconds” is a realistic target.

This is Part 1. It covers the problem, the finance context, and why this is hard. Part 2 covers what I actually built, the results, and where it broke down.


Why Speed Is Everything in HFT

In high-frequency trading, microseconds determine who wins.

You’re not competing on insight. You’re competing on reaction time. If your system is 10 microseconds slower than a competitor’s, you will consistently trade at worse prices, fill orders after the market has moved, and systematically lose money to faster participants. This is called adverse selection.

Traditional CPU- and GPU-based architectures have a fundamental problem here: they’re unpredictable. Operating system overhead, context switching, cache misses, network jitter—all of these introduce variability. You can optimize software down to low milliseconds. Getting consistently into the single-digit microsecond range, with deterministic timing, is a different problem entirely.

FPGAs solve this differently. Instead of running code on a general-purpose processor, you program the hardware itself. There’s no OS. No scheduler. The logic is the circuit. The result is deterministic, pipelined processing that can operate at wire speed.

Prior work has demonstrated LSTM inference on FPGAs exceeding 13 GOP/s and latency reductions of over 250x compared to equivalent CPU implementations. That’s the gap we’re working in.


The Limit Order Book

Before getting into the hardware, it helps to understand what we’re actually analyzing.

At the center of every electronic market is the limit order book (LOB)—a real-time ledger of all outstanding buy and sell orders, organized by price and time priority. It updates continuously, on millisecond to microsecond timescales.

There are three standard levels of detail:

  • Level 1 (L1): Best bid and best ask only. One-dimensional.
  • Level 2 (L2): Multiple price levels on each side. You can see the “depth” of the market.
  • Level 3 (L3): Individual orders at each price level, in queue order. Full visibility into who is waiting for what.

A snapshot of the L2 book looks something like this:

Bid LevelBid PriceBid VolumeAsk PriceAsk VolumeAsk Level
1100.05500100.103501
2100.00750100.156002
399.951200100.204503
499.90850100.253004
599.85600100.305505

The question my model tries to answer: given the last several snapshots of this book, which direction does the mid-price move in the next five steps?


The Adversarial Problem

Standard indicators—moving averages, volume spikes, bid-ask spread—struggle in modern markets because many microstructure patterns are adversarial by design.

Three examples:

Spoofing: A large order appears on one side of the book, inducing other participants to react. The spoofer then cancels before execution. The apparent supply or demand was never real.

Liquidity sweeps: Aggressive orders rapidly consume multiple price levels on one side of the book. Often precedes a directional move.

Momentum ignition: Coordinated directional pressure that triggers other participants’ momentum strategies, then reverses once enough people have piled in.

These patterns are subtle, transient, and designed to be hard to detect. Rule-based systems with static thresholds can’t keep up. A model that learns from sequential order book data directly—and can run in hardware—is theoretically better suited.


Adverse Selection Has a Price Tag

This isn’t abstract. The costs are measurable.

Slower participants who trade against faster, better-informed counterparties typically incur 1–3 basis points of slippage per trade. For a desk executing $100M daily, that’s $10,000–$30,000 per day—roughly $2.5–$7.5M annually.

Budish et al. estimated that latency arbitrage generates around $1 billion annually in U.S. equities alone. Tabb Group pegs each microsecond of latency reduction in HFT at approximately $100,000/year in trading advantage.

The financial incentive for hardware-accelerated pattern recognition is enormous.


Why Not Just Use a GPU?

GPUs are great for training models. They’re not great for this specific deployment problem.

PlatformBenefitsDrawbacks
FPGALow latency, deterministic, reconfigurableComplex dev cycle, resource constraints
ASICLowest latency, power-efficientVery high cost, zero reconfigurability
CPUEasy to program, widely availableHigher latency, limited parallelism for MATMUL
GPUGood for parallel/matrix operationsHigh power, high inference latency

GPUs are optimized for batch processing—running many operations in parallel across large datasets. What we need is single-event, microsecond-level inference on a continuous stream of order book snapshots. That’s not what GPUs are designed for.

FPGAs let you build a custom pipeline that matches the exact computation you need. Less flexibility, more speed.


What I Set Out to Build

The goal: embed a trained LSTM directly into FPGA hardware so it can classify order book behavior in real time—acting as a “reflex in silicon” that flags microstructure events before they materially impact price.

The specific targets:

  • Inference latency under 3 microseconds
  • Throughput above 1 million frames per second
  • Run on a Xilinx Zynq-7020 FPGA using the FINN compiler framework

Part 2 covers the simulator I built to generate training data, the model architecture, the FPGA compilation pipeline, the actual results, and—honestly—where the implementation hit walls.


This is Part 1 of 2. Read Part 2 here.

This post is licensed under CC BY 4.0 by the author.