HPC & Custom Hardware

Engineered for the most demanding workloads in AI training, inference, and high-frequency trading. Modulus provides custom silicon, accelerator-class compute, and HPC software tuned to the architecture.

Modulus engineers high-performance systems for the most demanding workloads in capital markets, AI, and scientific computing. Our work has powered platforms for J.P. Morgan Chase, Bank of America, UBS, Charles Schwab, and Nasdaq, alongside organizations in space exploration, healthcare, national security, and defense.

Most problems can be solved at the software layer, in C, C++, or accelerated GPU code targeting CUDA, ROCm, or oneAPI. When those approaches are not fast enough, our hardware and HPC engineers design custom silicon. From high-frequency trading to deep-learning AI, they deliver high-performance systems built on the latest hardware.

High-performance computing and custom hardware

FPGA & ASIC engineering

Our flagship machine learning platform pairs server-class CPUs with FPGA accelerators on AMD Versal and Intel Agilex devices, attached over a coherent fabric using OpenCAPI and CXL 3.x. We design accelerator kernels in VHDL and modern high-level synthesis flows such as Vitis HLS and Catapult HLS, validate on the bench, and harden to production. The platform carries terabytes of DDR5 memory, hosts in-house deep-learning algorithms, and runs complex workloads in under 30 nanoseconds, up to 250 times faster than highly optimized C. Multi-port optical interconnects deliver 140 Gbps full-duplex throughput.

When FPGA performance is not enough, we engineer custom ASIC solutions at 5nm, 3nm, and 2nm process nodes. An ASIC requires a larger up-front investment of time and capital, but it can deliver up to ten times the throughput of an FPGA design. We also build training and inference systems on the latest commercial accelerator silicon, including NVIDIA Blackwell (B200, GB200, and GB300 NVL72), AMD Instinct MI300X, MI325X, and MI350, Intel Gaudi 3, Cerebras WSE-3, and Groq LPU.

  • FPGA accelerators on AMD Versal and Intel Agilex
  • VHDL alongside Vitis HLS and Catapult HLS flows
  • Custom ASIC design at 5nm, 3nm, and 2nm
  • Coherent accelerator attach via OpenCAPI and CXL 3.x
  • HBM3e on accelerator silicon, DDR5 on host
  • 140 Gbps full-duplex optical interconnect
  • Sub-30-nanosecond inference latency

Supercomputing software

Supercomputers tackle the hardest simulations, analytics, and AI workloads, from protein-folding prediction and climate modeling to trillion-parameter language model training. To deliver real performance, the software running on them must be tuned to the architecture in play.

Our engineers ship code for HPE Cray EX with Slingshot 11, IBM, Fujitsu, Dell, Eviden, and Lenovo systems, on the same architectures that power Frontier, El Capitan, and Aurora. We optimize across InfiniBand NDR and XDR, Ultra Ethernet, and RoCE v2 fabrics, integrate parallel filesystems including Lustre, DAOS, WekaFS, and VAST Data, and build pipelines on PyTorch, JAX, Megatron-LM, DeepSpeed, vLLM, and Ray.

Proven in mission operations: NASA

NASA Mission Operations needed to port high-performance desktop software, originally written in C, onto tablet devices capable of displaying real-time telemetry and health data streamed from the International Space Station. The target was demanding: half a billion data points per second of ISS health and telemetry, processed and rendered on hardware with limited compute. NASA evaluated numerous solutions before selecting Modulus to build the system.

Modulus designed and patented a method for compressing time-series data into a custom display format, optimized for the CPU's cache and instruction pipeline, and integrated it into our charting library. The tablet solution matched the throughput of the original desktop software while running on a fraction of the compute.

Patented time-series compression rendering hundreds of millions of data points

Deep learning at the core

Modern AI runs on transformer architectures trained at trillion-parameter scale and served at production latency. Modulus engineers build, fine-tune, and deploy these systems across the full stack: GPU clusters connected by NVLink 5 and InfiniBand, distributed training with Megatron-LM and DeepSpeed, and high-throughput inference through vLLM, SGLang, and Triton Inference Server.

Modulus has been immersed in machine learning and high-performance computing for more than three decades. We have contributed to the deep-learning foundations in use today by industry and research pioneers, and we know how to put them to work in practical, high-value ways.

HFT, DMA & quant systems

We build trading systems that combine multidimensional nonlinear modeling, deep neural networks, kernel regression, dynamic programming via genetic algorithms, and other compute-intensive methods to produce dynamic strategies for equities, futures, options, forex, bonds, and digital assets.

Across the buy side and sell side, our engineers have discreetly built systems for some of the largest hedge funds in the industry: high-frequency data acquisition, algorithmic execution, smart order routing, market making, and ultra-low-latency network design using DPUs such as NVIDIA BlueField-3 and AMD Pensando, FPGA NICs from the AMD X3 series, and kernel-bypass transports including DPDK and RDMA over Converged Ethernet. End-to-end network latency reaches as low as 20 nanoseconds.

  • Turnkey broker-dealer platforms with source code
  • Matching engines for equities, futures, and forex
  • Direct connections to major exchanges including CME and Nasdaq
  • WebSocket data broadcasting for web and mobile apps
  • RAM database and memory-mapped file systems
  • FIX and FAST protocol development

What we engineer

From custom silicon to supercomputing software, our hardware and HPC engineers cover the full stack of ultra-high-performance computing.

FPGA acceleration

Deep-learning accelerators on AMD Versal and Intel Agilex, designed in VHDL and Vitis HLS, delivering sub-30-nanosecond inference for the most latency-sensitive workloads.

Custom ASIC design

Application-specific silicon at 5nm, 3nm, and 2nm process nodes, achieving up to ten times the throughput of an equivalent FPGA when FPGA performance is not enough.

Accelerator-class systems

Training and inference platforms built on NVIDIA Blackwell, AMD Instinct MI300X and MI350, Intel Gaudi 3, Cerebras WSE-3, and Groq LPU, with NVLink, NVSwitch, and InfiniBand fabric.

Supercomputing software

Architecture-tuned software for HPE Cray EX, IBM, Fujitsu, Dell, Eviden, and Lenovo systems, with parallel filesystems including Lustre, DAOS, WekaFS, and VAST Data.

Time-series compression

Patented compression that renders hundreds of millions of data points at extreme speed, proven on NASA tablet telemetry streamed from the ISS.

Ultra-low-latency networks

Twenty-nanosecond fabric built with DPUs, FPGA NICs, and RDMA over Converged Ethernet, engineered for high-frequency trading and tightly coupled AI training.

Platforms & tooling

The hardware platforms, frameworks, and protocols our engineers work in every day.

Compute & silicon

  • NVIDIA Blackwell B200, GB200, GB300 NVL72
  • NVIDIA Hopper H100 and H200
  • AMD Instinct MI300X, MI325X, MI350
  • Intel Gaudi 3, Cerebras WSE-3, Groq LPU
  • AMD Versal and Intel Agilex FPGAs
  • Custom ASIC at 5nm, 3nm, and 2nm

Fabric & I/O

  • NVLink 5 and NVSwitch for GPU-to-GPU
  • InfiniBand NDR (400G) and XDR (800G)
  • Ultra Ethernet and HPE Slingshot 11
  • OpenCAPI and CXL 3.x coherent attach
  • BlueField-3 DPUs and AMD Pensando
  • GPUDirect RDMA, RoCE v2, and DPDK

Software & ML stack

  • PyTorch, JAX, and TensorFlow
  • Megatron-LM and DeepSpeed for distributed training
  • vLLM, SGLang, and Triton Inference Server
  • Ray, MLflow, and Kubeflow
  • Lustre, DAOS, WekaFS, and VAST Data
  • Apache Kafka and Spark for data pipelines

The engine for real-time AI truth

Modulus real-time AI grounds Large Language Models in live, verified HPC data, delivering ultra-low latency responses with up-to-date accuracy while virtually eliminating hallucinations. The same engineering behind our custom hardware powers natural-language queries against high-velocity data streams for finance, defense, healthcare, cybersecurity, and logistics.

Explore Modulus Real-Time AI

Technology we use

FPGAASICVHDLVitis HLSCUDAROCmoneAPICC++RustPyTorchJAXTensorFlowvLLMRayInfiniBandNVLinkCXLRDMABlueField-3

Let's build.

Request an instant meeting or schedule a call with our hardware team to discuss your custom hardware project.