# Fast ML for Science

Nhan Tran, Fermilab AI4EIC 2025

### https://indico.cern.ch/e/fastml2025



# Fast ML for Science at its core

"Scientific discoveries come from groundbreaking ideas and the capability to validate those ideas by testing nature at new scales - finer and more precise temporal and spatial resolution. This is leading to an explosion of data that must be interpreted, and ML is proving a powerful approach. The more efficiently we can test our hypotheses, the faster we can achieve discovery. To fully unleash the power of ML and accelerate discoveries, it is necessary to embed it into our scientific process, into our instruments and detectors."

Applications and Techniques for Fast Machine Learning in Science

https://doi.org/10.3389/fdata.2022.787421

# Reflections - a retrospective

- Essence of community:
  - Sharing open-source tools, techniques, and enabling collaboration encouraging cross-disciplinary research
  - User- and application-driven focus to enable science and technology research
- It can be jarring, but it can build new and powerful synapses!

[Public]

### The Power of the Open Ecosystem: Recent Examples



97



## Edge of Tomorrow today:

Al at the speed of nature

Al-powered embedded devices that deliver ultra-efficient inference in extreme environments for real-time intelligent sensing and control.



The dream: self-driving experiments that are ultra-fast, precise, and robust

# What are the limitation of AI / DL <del>(in robotics)</del>?

- One task at a time
- Need a much more multi-modal approach (not just images and text)
  - Different sampling rates
  - Different dimensionalities
  - Different reliability
  - Different (changing?) geometrical arrangement
- Orchestrating different tasks is a challenge
- Data hungry
  - Physical world is fundamentally complex and dynamic
- High latency
  - Control loop might need <10ms latency and confidence</li>
- Reliability: accuracy and robustness











## Fast control

- Immediate response to dynamics of the experiment and data readout
- Event timing, triggering, etc.

## Slow control

- Detector stability over minutes, days, weeks, months,...
- Monitoring and controlling operational parameters: electronics gains, pedestals, calibrations, etc.







### What is real-time???



# Towards the dream



Towards the dream



Decisions

12





# One example with hls4ml

# Efficient tools for efficient, embedded Al





PRESS RELEASE

Siemens simplifies development of AI accelerators for advanced system-on-chip designs with Catapult AI NN

May 21, 2024 Plano Texas



Catapult AI NN brings together hls4ml, an open-source package for machine learning hardware acceleration, and Siemens' Catapult™ HLS software for High-Level Synthesis. Developed in close collaboration with Fermilab, a U.S. Department of Energy Laboratory, and other leading contributors to hls4ml, Catapult AI NN addresses the unique requirements of machine learning accelerator design for power, performance, and area on custom silicon.

- Catapult AI NN offers software engineers a comprehensive solution to synthesize AI Neural Nets
- Enables software development teams to seamlessly translate AI models designed in Python into silicon-based implementations, facilitating faster and more power-efficient execution compared to standard processors















Data representation

→ ML architecture

Neural architecture search/ Hyperparameter optimization

**Physics requirements** 

Latency?
Pipeline Interval?

Resources? Area/power? Radiation? Cryo?

Synthesize, validate design, satisfy design rules/timing

Quantize network

Intermediate (quantized) representations

**Pruning/sparsity?** 

Microarchitecture

**Parallelization** 

can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

```
Copy code
python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from qkeras import QDense, QActivation
from qkeras.quantizers import quantized_bits, quantized_relu
# Define input and output sizes
input_size = 32  # Replace with your actual input feature size
output_size = 10 # Replace with your actual number of classes
# Define quantizers
                                                       # 8-bit quantization for we
weight_quantizer = quantized_bits(8, 0, alpha=1)
activation_quantizer = quantized_relu(8, 0)
                                                        # 8-bit quantization for a
# Build the model
model = Sequential([
    QDense(64, input_shape=(input_size,),
           kernel_quantizer=weight_quantizer,
           bias_quantizer=weight_quantizer,
           name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

```
们 Copy code
python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from gkeras import QDense, QActivation
from qkeras.quantizers import quantized_bits, quantized_relu
# Define input and output sizes
input_size = 32  # Replace with your actual input feature size
output_size = 10 # Replace with your actual number of classes
# Define quantizers
weight_quantizer = quantized_bits(8, 0, alpha=1)
                                                        # 8-bit quantization for we
activation_quantizer = quantized_relu(8, 0)
                                                        # 8-bit quantization for a
# Build the model
model = Sequential([
    QDense(64, input_shape=(input_size,),
           kernel_quantizer=weight_quantizer,
           bias_quantizer=weight_quantizer,
           name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

- 3. Train Your Model
- 4. Prepare for Conversion with hls4ml
- 5. Convert the Model Using hls4ml

Specify the FPGA part number corresponding to your AMD FPGA. Since AMD acquired Xilinx, you can use Xilinx part numbers.

```
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    output_dir='hls4ml_prj',
    part='xcvu9p-flgb2104-2-i' # Replace with your specific FPGA part number
)
```

can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

```
Copy code
python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from gkeras import QDense, QActivation
from qkeras.quantizers import quantized_bits, quantized_relu
# Define input and output sizes
input_size = 32  # Replace with your actual input feature size
output_size = 10 # Replace with your actual number of classes
# Define quantizers
weight_quantizer = quantized_bits(8, 0, alpha=1)
                                                        # 8-bit quantization for we
                                                        # 8-bit quantization for a
activation_quantizer = quantized_relu(8, 0)
# Build the model
model = Sequential([
    QDense(64, input_shape=(input_size,),
           kernel_quantizer=weight_quantizer,
           bias_quantizer=weight_quantizer,
           name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

- 3. Train Your Model
- 4. Prepare for Conversion with hls4ml
- 5. Convert the Model Using hls4ml

Specify the FPGA part number corresponding to your AMD FPGA. Since AMD acquired Xilinx, you can use Xilinx part numbers.

```
python
                                                                        们 Copy code
hls_model = hls4ml.converters.convert_from_keras_model(
   model,
   hls_config=config,
   output_dir='hls4ml_prj',
   part='xcvu9p-flgb2104-2-i' # Replace with your specific FPGA part number
```

### 6. Compile and Build the HLS Model

#### 7. Validate the HLS Model

Compare the outputs of the original Keras model and the HLS model to ensure consistency.

```
python
                                                                         Copy code
import numpy as np
# Generate some test data
X_test = np.random.rand(100, input_size)
```

### 8. Synthesize the Design Using Vivado HLS

can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

```
Copy code
python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from gkeras import QDense, QActivation
from qkeras.quantizers import quantized_bits, quantized_relu
# Define input and output sizes
input_size = 32  # Replace with your actual input feature size
output_size = 10 # Replace with your actual number of classes
# Define quantizers
weight_quantizer = quantized_bits(8, 0, alpha=1)
                                                        # 8-bit quantization for we
                                                        # 8-bit quantization for a
activation_quantizer = quantized_relu(8, 0)
# Build the model
model = Sequential([
    QDense(64, input_shape=(input_size,),
           kernel_quantizer=weight_quantizer,
           bias_quantizer=weight_quantizer,
           name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

- 3. Train Your Model
- 4. Prepare for Conversion with hls4ml
- 5. Convert the Model Using hls4ml

Specify the FPGA part number corresponding to your AMD FPGA. Since AMD acquired Xilinx, you can use Xilinx part numbers.

```
python
                                                                        们 Copy code
hls_model = hls4ml.converters.convert_from_keras_model(
   model,
   hls_config=config,
   output_dir='hls4ml_prj',
   part='xcvu9p-flgb2104-2-i' # Replace with your specific FPGA part number
```

### 6. Compile and Build the HLS Model

#### 7. Validate the HLS Model

Compare the outputs of the original Keras model and the HLS model to ensure consistency.

```
python
import numpy as np
# Generate some test data
X_test = np.random.rand(100, input_size)
```

### 8. Synthesize the Design Using Vivado HLS

Copy code

# wa-hls4ml — an hls4ml surrogate model

### **Traditional Codesign Workflow**

### Proposed Codesign Workflow





### **Data representation**

→ ML architecture

**Physics requirements** 

**Neural architecture search/ Hyperparameter optimization** 

**Quantize network** 

**Latency? Pipeline Interval?** 

**Resources? Area/power? Radiation? Cryo?** 



**Intermediate (quantized)** representations

**Pruning/sparsity?** 

**Microarchitecture** 

Synthesize, validate design, satisfy design rules/timing

**Parallelization** 

Surrogate Neural Architecture Co-design Package (SNAC-Pack) powered with wa-hls4ml



#### Backends Evolution of the oneAPI backend for hls4ml Jovan Mitrevski<sup>1</sup>, Harry Wang<sup>2</sup>, Suleyman Demirsoy<sup>2</sup>, Lauri Laatu<sup>3</sup>, Vladimir Lončar<sup>4</sup>, Enrico Lupi<sup>4</sup>, Paul White<sup>2</sup> **User API** Motivation In order to deploy neural networks (NNs) on Intel/Altera FPGAs, Model Initial config. HLS Project Report Firmware Input Output Internal representation (IR) Frontends Backends Flow Layer parser Variable Attribute Type Type/Variable keras\_to\_hls Layer **HLS Template** Fn parser ModelGraph Vivado/Vitis (+Acc) pytorch\_to\_hls Quartus/oneAPI Optimizer Key/Value Node parser MicroChip/Libero Configuration Flow onnx\_to\_hls Catapult Generic flows Custom layer Custom template Custom opt. Custom parser **Extension API**

#### **DMA Data Transfers**

A key feature of the accelerator flow is the addition of DMA data transfer kernels to the NN kernel for efficient data processing.

<sup>4</sup> European Organization for Nuclear Research (CERN), CH

<sup>2</sup> Altera Corporation, USA <sup>3</sup> Imperial College, UK



Evolution of the oneAPI backend for hls4ml <sup>2</sup> Altera Corporation, USA Jovan Mitrevski<sup>1</sup>, Harry Wang<sup>2</sup>, Suleyman Demirsoy<sup>2</sup>, Lauri Laatu<sup>3</sup>, Vladimir Lončar<sup>4</sup>, Enrico Lupi<sup>4</sup>, Paul White<sup>2</sup> <sup>4</sup> European Organization for Nuclear Research (CERN), CH **DMA Data Transfers** A key feature of the accelerator flow is the addition of DMA data In order to deploy neural networks (NNs) on Intel/Altera FPGAs, transfer kernels to the NN kernel for efficient data processing. Integrating Support for Google XLS in hls4ml Andrei Gîrjoabă<sup>1</sup>, Benjamin Ramhorst<sup>1</sup>, Dimitrios Danopoulos<sup>2</sup>, and Vladimir Lončar<sup>2</sup> <sup>1</sup>ETH Zürich; <sup>2</sup>European Organization for Nuclear Research (CERN) XLS Introduction 3 Supported Features XLS (Accelerated Hardware Synthesis) Backend built in 3 months by 1 developer





5. [V] Use the new expansion of hls4ml to convert the LHCb ML models.

cover this new backend.

https://github.com/vloncar/hls4ml/tree/libero backend

The circuits generated by chisel4ml are implemented as hardware

# Backends



chisel4ml is a tool for generating highly parallel hardware

# da4ml





Higher latency than other pure LUT-based methods, but similar LUT usage and can achieve higher accuracy.

| Implementation        | Accuracy | Latency [ns] | LUT    | DSP | FF     | $F_{\mathrm{max}}$ [MHz] | II [cc] |
|-----------------------|----------|--------------|--------|-----|--------|--------------------------|---------|
| HGQ+hls4ml+DA         | 76.9%    | 44.1         | 12,682 | 0   | 19,056 | 702.                     | 1       |
| HGQ+da4ml (RTL)       | 76.5%    | 23.1         | 6,165  | 0   | 7,207  | 736.                     | 1       |
| HGQ+hls4ml            | 76.9%    | 57.6         | 16,081 | 57  | 26,484 | 729.                     | 1       |
| HGQ+hls4ml            | 76.5%    | 67.2         | 8,548  | 30  | 14,418 | 521.                     | 1       |
| QKeras+hls4ml [1]     | 76.3%    | 105          | 5,504  | 175 | 3,036  | 143.                     | 2       |
| DWN [2]               | 76.3%    | 14.4         | 6,302  | 0   | 4,128  | 695.                     | 1       |
| MetaML-Pro [3]        | 76.1%    | 50           | 13,042 | 70  | N/A    | 200                      | 1       |
| NeuraLUT-Assemble [4] | 76.0%    | 2.1          | 1,780  | 0   | 540    | 940.                     | 1       |
| TreeLUT [5]           | 75.6%    | 2.7          | 2,234  | 0   | 347    | 735.                     | 1       |

# KANs



Any multivariate continuous function can be represented by a finite sum of univaritate functions and additions.

$$f(\mathbf{x}) = f(x_1, \dots, x_n)$$

$$= \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \phi_{q,p}(x_p) \right)$$

Where  $\phi_{q,p}:[0,1] \to \mathbb{R}$  and  $\Phi_q:\mathbb{R} \to \mathbb{R}$ 







| Dataset     | Model                         | Accuracy (%) | LUT    | FF     | DSP | BRAM | F <sub>max</sub> (MHz) | Latency (ns) | Area×Delay (LUT×ns)  |
|-------------|-------------------------------|--------------|--------|--------|-----|------|------------------------|--------------|----------------------|
| MNIST       | KAN-Quantized-Pruned          | 96.1         | 1323   | 546    | 0   | 0    | 316                    | 9.6          | $1.27 \times 10^4$   |
|             | NeuraLUT-Assemble [4]         | 97.9         | 5070   | 725    | 0   | 0    | 863                    | 2.1          | $1.06 \times 10^4$   |
|             | TreeLUT [16]                  | 96.6         | 4478   | 597    | 0   | 0    | 791                    | 2.5          | $1.12 \times 10^{4}$ |
|             | DWN [5]                       | 97.8         | 2092   | 1757   | 0   | 0    | 873                    | 9.2          | $1.92 \times 10^{4}$ |
|             | PolyLUT-Add [20]              | 96.0         | 14810  | 2609   | 0   | 0    | 625                    | 10           | $1.48 \times 10^{5}$ |
|             | AmigoLUT-NeuraLUT [34]        | 95.5         | 16081  | 13292  | 0   | 0    | 925                    | 7.6          | $1.22 \times 10^{5}$ |
|             | NeuraLUT [3]                  | 96.0         | 54798  | 3757   | 0   | 0    | 431                    | 12           | $6.58 \times 10^{5}$ |
|             | PolyLUT [2]                   | 97.5         | 75131  | 4668   | 0   | 0    | 353                    | 17           | $1.38 \times 10^{6}$ |
|             | FINN [30]                     | 96.0         | 91131  | _      | 0   | 5    | 200                    | 310          | $2.82 \times 10^{7}$ |
|             | hls4ml (Ngadiuba et al.) [23] | 95.0         | 260092 | 165513 | 0   | 345  | 200                    | 190          | $4.94 \times 10^{7}$ |
| JSC CERNBox | KAN-Quantized-Pruned          | 73.3         | 1302   | 612    | 0   | 0    | 338                    | 8.9          | $1.15 \times 10^4$   |
|             | NeuraLUT-Assemble [4]         | 75.0         | 8539   | 1332   | 0   | 0    | 352                    | 5.7          | $4.87 \times 10^{4}$ |
|             | AmigoLUT-NeuraLUT [34]        | 74.4         | 42742  | 4717   | 0   | 0    | 520                    | 9.6          | $4.10 \times 10^{5}$ |
|             | PolyLUT-Add [20]              | 75.0         | 36484  | 1209   | 0   | 0    | 315                    | 16           | $5.84 \times 10^{5}$ |
|             | NeuraLUT [3]                  | 75.0         | 92357  | 4885   | 0   | 0    | 368                    | 14           | $1.29 \times 10^{6}$ |
|             | PolyLUT [2]                   | 75.1         | 246071 | 12384  | 0   | 0    | 203                    | 25           | $6.15 \times 10^{6}$ |
|             | LogicNets [29]                | 72.0         | 37931  | 810    | 0   | 0    | 427                    | 13           | $4.93 \times 10^{5}$ |
| JSC OpenML  | KAN-Quantized-Pruned          | 74.8         | 1235   | 623    | 0   | 0    | 315                    | 9.5          | $1.17 \times 10^4$   |
|             | NeuraLUT-Assemble [4]         | 76.0         | 1780   | 540    | 0   | 0    | 941                    | 2.1          | $3.92 \times 10^3$   |
|             | TreeLUT [16]                  | 75.6         | 2234   | 347    | 0   | 0    | 735                    | 2.7          | $6.03 \times 10^{3}$ |
|             | DWN [5]                       | 76.3         | 6302   | 4128   | 0   | 0    | 695                    | 14.4         | $9.07 \times 10^{4}$ |
|             | hls4ml (Fahim et al.) [9]     | 76.2         | 63251  | 4394   | 38  | 0    | 200                    | 45           | $2.85 \times 10^{6}$ |

# The edge of tomorrow

- EoT and hardware codesign
  - enable fundamentally new capabilities & robust adaptive systems
  - data challenge: connecting data across full-stack codesign from algorithms to devices to materials
- Fast and Slow together
  - Powerful autonomous instruments enabled through physics-inspired surrogate models and robust, real-time controllers

Feel free to join our Slack or mailing list, just reach out to me at ntran@fnal.gov