

# Development of ML FPGA filter for particle identification and tracking in real time

Sergey Furletov (Jefferson Lab)

<u> Team :</u>

F. Barbosa, L. Belfore, N. Branson, N. Brei, C. Dickover, C. Fanelli, D. Furletov, L. Jokhovets, D. Lawrence, C. Mei, D. Romanov, K. Shivu

Workshop on Streaming readout XI

2 Dec 2023

#### EIC streaming readout as motivation





- The correct location for the ML on the FPGA filter is called "FEP" in this figure.
- This gives us a chance to reduce traffic earlier.
- Allows us to touch physics: ML brings intelligence to L1.
- However, it is now unclear how far we can go with physics at the FPGA.
- Initially, we can start in pass-through mode.
- Then we can add background rejection.
- Later we can add filtering processes with the largest cross section.
- In case of problems with output traffic, we can add a selector for low cross section processes.
- The ML-on-FPGA solution complements the purely computer-based solution and mitigates DAQ performance risks.

### Generic EIC R&D project RD15, ML-(on)-FPGA



Usually, several PID detectors are used in an experiment.

12/02/23

- □ For example, the GEM-TRD and e/m-calorimeter, both provide separation of electrons and hadrons.
- Summation and processing of joint data from both detectors at the early stages will increase the identification power of these detectors compared to independent identification.
- □ To test the "global PID" performance we work on developing the ML-FPGA setup for real-time data pre-processing.
- □ The setup consists of several PID and tracking detectors: emCAL, GEMTRD, GEM tracker.
- Preprocessed data from both detectors including decision on the particle type will be transferred to another ML-FPGA board with neural network for global PID decision.
- □ The global filter transfers data to off-line computer farm, running JANA2 software.



### GEM-TRD prototype for EIC R&D

- To demonstrate the operating principle of the ML FPGA, we use the existing setup
- from the EIC detector R&D project
- A test module was built at the University of Virginia
- The prototype of GEMTRD/T module has a size of 10 cm × 10 cm with a corresponding to a total of 512 channels for X/Y coordinates.
- The readout is based on flash ADC system developed at JLAB (fADC125) @125 MHz sampling.
- GEM-TRD provides e/hadron separation and tracking









12/02/23

#### Sergey Furletov

### **GEM-TRD** principle



- □ The e/pion separation in the GEM-TRD detector is based on counting the ionization along the particle track.
- □ For electrons, the ionization is higher due to the absorption of transition radiation photons
- So, particle identification with TRD consists of several steps:
  - The first step is to cluster the incoming signals and create "hits".
  - The next is "pattern recognition" sorting hits by track.
  - Finding a track
  - Ionization measurement along a track
  - As a bonus, TRD will provide a track segment for the global tracking system.

#### GEM-TRD can work as micro TPC, providing 3D track segments



### GEMTRD tracks

- □ In a real experiment, GEMTRD will have multiple tracks.
- □ So we also need a fast algorithm for pattern recognition
- □ As well as for track fitting.
- □ The decision was made to try the Graph Neural Network (GNN) for pattern recognition.
- □ And a recurrent neural network LSTM, for track fitting.



Javier Duarte arXiv:2012.01249v2 [hep-ph] 7 Dec 2020

□ HEP advanced tracking algorithms at the exascale (Project Exa.TrkX)

<u>https://exatrkx.github.io/</u>







### **GNN** for pattern recognition



- Graph Neural Networks (GNNs) designed for the tasks of hit classification and segment classification.
  - > These models read a graph of connected hits and compute features on the nodes and edges.
- □ The input and output of GNN is a graph with a number of features for nodes and edges.
  - In our case we use the edge classification
- $\Box$  A complete graph on N vertices contains N(N 1)/2 edges.
  - > This will require a lot of resources which are limited in FPGA.

□ To keep resources under control, we can construct the graph for a specific geometry and limit the minimum particle momentum.

- □ In our case we have a straight track segments, with a quite narrow angular distribution ~15 degree.
- Thus, for the input hits (left), we connect only those edges that satisfy our geometry and the momentum of most tracks (middle)
- □ The trained GNN processes the input graph and sets the probability for each edge as output.

□ The right plot shows edges with a probability greater than 0.7



12/02/23

Sergey Furletov

### **GNN performance**



- □ This type of graph neural network is not yet supported in HLS4ML.
- So we did a manual conversion first to C++ and then to Verilog using Vitis\_HLS.
- □ This neural network has not been optimized, so it consumes a lot of resources 70% of DSPs, (4651 of 6840).
  - At the moment it can serve up to 21 hits and 42 edges, or , in our case (GEM-TRD), it will be 3-5 tracks.
- However, it performs all calculations in ~3 μs (left plot) (thanks to Ben Raydo), providing good purity and efficiency (right plot).



| Modules & Loops                | Issue Type Slac | c Latency(cycles) | Latency(ns) | Iteration Latency | Interval | Trip Count | Pipelined | BRAM | DSP  | FF     | LUT     | URAM |
|--------------------------------|-----------------|-------------------|-------------|-------------------|----------|------------|-----------|------|------|--------|---------|------|
| ▼                              |                 | - 589             | 2.945E3     | -                 | 590      | -          | no        | 42   | 4424 | 394036 | 2519454 | 0    |
| ▼ 10Graph                      |                 | - 499             | 2.495E3     | -                 | 497      |            | dataflow  | 42   | 4424 | 391308 | 2515320 | 0    |
| ⊜ fromGraph                    |                 | - 331             | 1.655E3     |                   | 1        |            | yes       | 0    | 0    | 197686 | 1673583 | 0    |
| ▶                              |                 | - 496             | 2.480E3     |                   | 496      |            | no        | 42   | 4422 | 172620 | 785082  | 0    |
| toGraph_Block_split100_proc205 |                 | - 480             | 2.400E3     |                   | 480      |            | no        | 0    | 2    | 7226   | 49627   | 0    |
| C VITIS_LOOP_1365_1            |                 | - 63              | 315.000     | 3                 |          | 21         | no        |      |      |        |         | -    |
| C VITIS_LOOP_1400_3            |                 | - 22              | 110.000     | 3                 | 1        | 21         | yes       | -    | -    | -      | -       |      |

12/02/23

#### FPGA test bench (vcu118 board)





Several version of IPs were synthesized and tested on FPGAs.

□ The logic test was performed with the MicroBlaze processor.

□ I/O data transfer is carried out through the ETH interface with the TCP/IP core.

12/02/23



# Beam test at FermiLab May 2023







12/02/23

Sergey Furletov

Workshop on Streaming readout XI

#### Beam test at FermiLab





- JANA2 has been tested to receive data from the DAQ and save it as evio and/or root files.
  JANA2 also used for offline conversion and processing data.
- DAQ rates during spill :
  - Raw data rate:

- ~100 MB/s; Trigger: 1.5 kHz
- Pulse mode data rate (SRS raw) : ~45 MB/s ; Trigger: 2.5 kHz
- > Data collected: 1.1 TB



12/02/23

### GEMTRD online event display











12/02/23

Workshop on Streaming readout XI



## Preliminary ML-FPGA results

The data recorded during the beam test is currently being used to develop ML-FPGA algorithms and performance tests.



### ML-FPGA performance for GEMTRD





□ Top rows: show ionization along the track in GEMTRD detector.

- <u>Red circles</u> are reconstructed clusters using some dE/dx threshold. The size is proportional to energy.
- Middle rows: after filtering out the noisy clusters, the coordinates of the clusters are sent to the FPGA/GNN for pattern recognition.

30 z pos,mm

30 z pos,mm

z pos.mr

П

- Bottom rows: GNN provides labeling of clusters (by color in the figure), the same colors belong to the same track.
- Then clusters of the same color (tag) are sent to the track fitting module: LSTM.
- □ <u>The results of track fitting</u> are represented by lines in the figures.
- The next step is to count all the ionization in the corridor around the track and send it to the PID module (DNN).
- As a bonus, GEMTRD provides a track segment for the global tracking system.

Sergey Furletov



### JANA2 for ML on FPGA

Pre-processed data from the FPGA is transferred over the network (TCP/IP) to a computer running JANA2 software.



#### JANA4ML4FPGA



#### JANA2

(JLab ANAlysis framework)

- JANA2 is a multi-threaded modular event reconstruction framework being developed at Jlab for online and offline processing

- JANA2 is a rewrite based on modern coding and CS practices. Developed for modern NP experiments with streaming readout, heterogeneous computing and AI

- JANA2 is the main framework chosen for EIC. Used for ePIC collaboration reconstruction and further Detector 2. Used in multiple Jlab experiments and prototypes





# Validation software

#### JANA4ML4FPGA





#### Goals:

- Read and write EVIO
- Write flat ROOT files
- Receive EVIO by TCP (and save)
- Receive network streams
- Receive FPGA data
- Simulate sending detector data
- Data Quality Monitor
- Al streaming preprocessing
- Conventional preprocessing

#### 12/02/23



## Other detectors on the beam test.

#### At the beam test was also recorded data from e/m calorimter. the



#### Other detectors that were used in beam tests.



e/m Calorimeter
 Micromegas TRD
 µRWELL TRD
 GEM tracker
 Work continues on processing data from these detectors.



#### Calorimeter parameters reconstruction

By Dmitry Romanov



Geant 4 simulation



Convolutional VAE as a backbone

Per cluster output of multiple values:

Energy, e/  $\pi$ , coordinates, features

Modules deposits as inputs







Examples of events with e and  $\pi^-$  showers and  $\mu^-$  passing through.

12/02/23

•

•

### **CNN for calorimeter reconstruction**



- This was done to minimize a network size in FPGA and due to current limitation of HSL4ML of supported network layer types.
- FPGA synthesis with reuse factor of 2 has a latency of 1.3µs and an interval of 245 clocks. It uses 71% of DPS resources

| Actual values | Predicted results |        |  |  |  |  |
|---------------|-------------------|--------|--|--|--|--|
| Actual values | e                 | $\pi$  |  |  |  |  |
| e             | 98.8 %            | 1.2 %  |  |  |  |  |
| $\pi$         | 2.9 %             | 97.1 % |  |  |  |  |









#### == Utilization Estimates \_\_\_\_\_ \* Summarv: Name BRAM\_18K| DSP48E FF LUT URAM DSP Expression 0 20 IFIF0 202 8191 14048 \_ Instance 61 63801 239028 4862 -Memory \_ Multiplexer 36 Register ITotal 71998 263 4862 253132 |Available SLR 1440 2280 788160 394080 |Utilization SLR (%) 18 213 91 64 4320 6840 IAvailable 2364480 1182240 |Utilization (%) 71| 61 31

12/02/23

Sergey Furletov

Workshop on Streaming readout XI



# Other activities and projects

#### Xilinx VPK180 board





### **GEMTRD** for GlueX



- **Q** Real-time data processing is a frontier field in experimental particle physics.
- □ The growing computational power of modern FPGA boards allows us to add more sophisticated algorithms for real-time data processing.
- Many tasks, such as tracking and particle identification, could be solved using modern Machine Learning (ML) algorithms which are naturally suited for FPGA architectures.
- The work described in this report aims to test ML-FPGA algorithms in a triggered data acquisition system, as well as in streaming data acquisition, such as in the future EIC collider.
- The first target is the GlueX experiment, with a plan to build a Transition Radiation Detector (TRD) based on GEM technology (GEM-TRD), to improve the electron-pion separation in the GlueX experiment. It will allow to study precisely reactions with electron-positron pairs in the final states.





- GEM-TRD will be installed in front of DIRC detector.
- □ Hall D is dedicated to the operation with a linearly-polarized photon beam produced by ~12 GeV electrons from CEBAF at Jefferson Lab.
- □ Typical trigger rate 40-70 kHz
- □ Data rate 0.7 1.2 GB/s
- Trigger latency 3.5 us.

12/02/23

Sergey Furletov

Workshop on Streaming readout XI

### ADC based DAQ for PANDA STT



#### Level 0 Open VPX Crate

ADC based DAQ for PANDA STT (one of approaches):

- 160 channels (shaping, sampling and processing) per payload slot, 14 payload slots+2 controllers;
- totally 2200 channels per crate;
- time sorted output data stream (arrival time, energy,...)
- noise rejection, pile up resolution, base line correction, ...







(configurable up to 1 GSPS); Single Virtex7 FPGA

160 Amplifiers; 5 connectors for 32pins samtec cables

- All information from the straw tube tracker is processed in one unit.
- Allows to build a complete STT event.
- This unit can also be used for calorimeters readout and processing.



https://doi.org/10.1088/1748-0221/17/04/C04022 2022 JINST 17 C04022





Sergey Furletov

Workshop on Streaming readout XI



### Outlook



- An FPGA-based Neural Network application would offer online event preprocessing and allow for data reduction based on physics at the early stage of data processing.
- □ The ML-on-FPGA solution complements the purely computer-based solution and mitigates DAQ performance risks.
- **FPGA** provides extremely low-latency neural-network inference.
- Open-source HLS4ML software tool with Xilinx<sup>®</sup> Vivado<sup>®</sup> High Level Synthesis (HLS) accelerates machine learning neural network algorithm development.
- The ultimate goal is to build a real-time event filter based on physics signatures.



Figure 2.1: Feynman diagrams of the Quark Parton Model, QCD-Compton and Boson Gluon Fusion processes in NC DIS.

#### Published in 2007

Measurement of multijet events at low \$x\_{Bj}\$ and low \$Q^2\$ with the ZEUS detector at HERA

T. Gosau



Jennifer Ngadiuba - hls4ml; deep neural networks in FPGA

#### Workshop on Streaming readout XI

11.01.2019

<sup>12/02/23</sup> 



# Backup



### Moving forward : ML on FPGA



- Offline analysis using ML looks promising.
- Can it be done in real time ?
- Here are some of the possible solutions :
  - Computer farm.
  - CPU + GPU
  - CPU + FPGA
  - FPGA only

#### Inference on an FPGA





Image: https://nurseslabs.com/nervous-system/



IRIS-HEP th Febraury 13, 2019 Dylan Rankin [MIT]

- Modern FPGAs have DSP slices specialized hardware blocks placed between gateways and routers that perform mathematical calculations.
- The number of DSP slices can be up to 6000-12000 per chip.

12/02/23

### Optimization with hls4ml package



• A package hls4ml is developed based on High-Level Synthesis (HLS) to build machine learning models in FPGAs.



#### Sergey Furletov

#### Workshop on Streaming readout XI





#### 12/02/23

Sergey Furletov

- 32

### MLP neural network for PID





#### **Developing ethernet interface**





12/02/23

### FPGA test board for ML



- At an early stage in this project, as hardware to test ML algorithms on FPGA, we use a standard Xilinx evaluation boards rather than developing a customized FPGA board. These boards have functions and interfaces sufficient for proof of principle of ML-FPGA.
- The Xilinx evaluation board includes the Xilinx XCVU9P and 6,840 DSP slices. Each includes a hardwired optimized multiply unit and collectively offers a peak theoretical performance in excess of 1 Tera multiplications per second.
- Second, the internal organization can be optimized to the specific computational problem. The internal data processing architecture can support deep computational pipelines offering high throughputs.
- Third, the FPGA supports high speed I/O interfaces including Ethernet and 180 high speed transceivers that can operate in excess of 30 Gbps.

Featuring the Virtex® UltraScale+™ XCVU9P-L2FLGA2104E FPGA



Xilinx Virtex<sup>®</sup> UltraScale+<sup>™</sup>

#### **GEM-TRD** offline analysis





**G** For data analysis we used a neural network library provided by root /TMVA package :

- MultiLayerPerceptron (MLP)
- **Top left plot shows ionization difference for e/pi in several bins along the track**
- □ Top right plot shows neural network output for single TRD module:
  - Red electrons with radiator
  - Blue electrons without radiator.

#### Xilinx HLS: C++ to Verilog





The C/C++ code of the trained network is used as input for Vivado\_HLS.

The Xilinx Vivado HLS (High-Level Synthesis) tool provides a higher level of abstraction for the user by synthesizing functions written in C,C++ into IP blocks, by generating the appropriate ,low-level, VHDL and Verilog code. Then those blocks can be integrated into a real hardware system.

|                                                                                                          | 1//                                                                                               |
|----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
|                                                                                                          | 2// RTL generated by Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC                |
| 2 // float_regex.sh:: converted to (tx_t)                                                                | 3 // Version: 2019.1                                                                              |
| 3 //                                                                                                     | 4// Copyright (C) 1986-2019 Xilinx, Inc. All Rights Reserved.                                     |
| 4 // cxx file                                                                                            |                                                                                                   |
| 5 <b>#include "trd_ann.h"</b>                                                                            | 5 //<br>6 // ===============================                                                      |
| 6 #include <cmath></cmath>                                                                               | 6 //<br>7                                                                                         |
| 7/*                                                                                                      |                                                                                                   |
| <pre>8 fx_t ann(int index,fx_t in0,fx_t in1,fx_t in2,fx_t in3,fx_t in4,fx_t in5,fx_t in6,fx_t in7,</pre> | 8 timescale 1 ns / 1 ps                                                                           |
| 9 input0 = (in0 - (fx_t)1.96805)/(fx_t)7.63362;                                                          |                                                                                                   |
| <pre>10 input1 = (in1 - (fx_t)4.75766)/(fx_t)11.9138;</pre>                                              | <pre>10 (* CORE_GENERATION_INFO="trdann,hls_ip_2019_1,{HLS_INPUT_TYPE=cxx,HLS_INPUT_FLOAT=1</pre> |
| <pre>11 input2 = (in2 - (fx_t)4.40589)/(fx_t)11.4831;</pre>                                              |                                                                                                   |
| 12 input3 = (in3 - (fx_t)4.24519)/(fx_t)11.2533;                                                         | 12 module trdann (                                                                                |
| <pre>13 input4 = (in4 - (fx_t)4.30175)/(fx_t)11.2252;</pre>                                              | 13 ap_clk,                                                                                        |
| 14 input5 = (in5 - (fx t)3.87414)/(fx t)10.1781;                                                         | 14 ap_rst_n,                                                                                      |
| 15 input6 = (in6 - (fx t)3.75959)/(fx t)9.69367;                                                         | 15 s_axi_AXILiteS_AWVALID,                                                                        |
| 16 input7 = (in7 - (fx t)3.84352)/(fx t)9.66213;                                                         | 16 s_axi_AXILiteS_AWREADY,                                                                        |
| 17 input8 = (in8 - (fx t)3.65047)/(fx t)9.09565;                                                         | <pre>17 s_axi_AXILiteS_AWADDR,</pre>                                                              |
| 18 input9 = (in9 - (fx_t)5.96775)/(fx_t)11.3203;                                                         | 18 s_axi_AXILiteS_WVALID,                                                                         |
| 19 switch(index) {                                                                                       | 19 s_axi_AXILiteS_WREADY,                                                                         |
| 20 case 0:                                                                                               | 20 s_axi_AXILiteS_WDATA,                                                                          |
| 21 return neuron0x32b4c90();                                                                             | 21 s_axi_AXILiteS_WSTRB,                                                                          |
| 22 default:                                                                                              | 22 s_axi_AXILiteS_ARVALID,                                                                        |
| 23 return (fx_t)0.;                                                                                      | 23 saxi AXILiteS ARREADY. → Verilog                                                               |
|                                                                                                          |                                                                                                   |
| 25 }                                                                                                     | 25 s_axi_AXILiteS_RVALID,                                                                         |
| 26 */                                                                                                    | 26 s_axi_AXILiteS_RREADY,                                                                         |
| 27@fout_t trdann(int index, finp_t input[10]) {                                                          | 27 s_axi_AXILiteS_RDATA,                                                                          |
| 28 input0 = $(fx t(input[0]) - (fx t)1.96805)/(fx t)7.63362;$                                            | <pre>28 s_axi_AXILiteS_RRESP,</pre>                                                               |
| 29 input1 = $(fx t(input[1]) - (fx t)4.75766)/(fx t)11.9138;$                                            | 29 s_axi_AXILiteS_BVALID,                                                                         |
| <pre>30 input2 = (fx_t(input[2]) - (fx_t)4.40589)/(fx_t)11.4831;</pre>                                   | <pre>30 s_axi_AXILiteS_BREADY,</pre>                                                              |
| <pre>31 input3 = (fx_t(input[3]) - (fx_t)4.24519)/(fx_t)11.2533;</pre>                                   | <pre>31 s_axi_AXILiteS_BRESP,</pre>                                                               |
| <pre>32 input4 = (fx t(input[4]) - (fx t)4.30175)/(fx t)11.2252;</pre>                                   | 32 interrupt                                                                                      |
| <pre>33 input5 = (fx_t(input[5]) - (fx_t)3.87414)/(fx_t)10.1781;</pre>                                   | 33);                                                                                              |
| <pre>34 input6 = (fx_t(input[6]) - (fx_t)3.75959)/(fx_t)9.69367;</pre>                                   | 34                                                                                                |
| <pre>35 input7 = (fx t(input[7]) - (fx t)3.84352)/(fx t)9.66213;</pre>                                   | 35 parameter ap_ST_fsm_state1 = 23'd1;                                                            |
| <pre>36 input8 = (fx_t(input[8]) - (fx_t)3.65047)/(fx_t)9.09565;</pre>                                   | 36 parameter ap_ST_fsm_state2 = 23'd2;                                                            |
| <pre>37 input9 = (fx_t(input[9]) - (fx_t)5.96775)/(ftotal)11.3203;</pre>                                 | 37 parameter ap_ST_fsm_state3 = 23'd4;                                                            |
| 38 switch(index) {                                                                                       | 38 parameter ap_ST_fsm_state4 = 23'd8;                                                            |
| 39 case 0:                                                                                               | 39 parameter ap_ST_fsm_state5 = 23'd16;                                                           |
| 40 return neuron0x32b4c90();                                                                             | 40 parameter ap_ST_fsm_state6 = 23'd32;                                                           |
| 41 default:                                                                                              | 41 parameter ap_ST_fsm_state7 = 23'd64;                                                           |
| 42 return (fx_t)0.;                                                                                      | 42 parameter ap_ST_fsm_state8 = 23'd128;                                                          |
| 43                                                                                                       | 43 parameter ap_ST_fsm_state9 = 23'd256;                                                          |
| <sup>44</sup> ) Note: fixed point calculation                                                            | 44 parameter ap_ST_fsm_state10 = 23'd512;                                                         |
|                                                                                                          | 45 parameter ap_ST_fsm_statel1 = 23'd1024;                                                        |
| 46⊜ fx_t neuron0x32bf850() {                                                                             | 46 parameter ap_ST_fsm_state12 = 23'd2048;                                                        |
| 47 return input0;                                                                                        | 47 parameter ap_ST_fsm_statel3 = 23'd4096;                                                        |
| 48 }                                                                                                     | 48 parameter ap_ST_fsm_state14 = 23'd8192;                                                        |
| 49                                                                                                       | 49 parameter ap_ST_fsm_state15 = 23 'd16384;                                                      |
| 50⊜fx_t neuron0x32bf190() {                                                                              | 50 parameter ap_ST_fsm_state16 = 23'd32768;                                                       |
| 51 return input1;                                                                                        | 51 parameter ap_ST_fsm_state17 = 23'd65536;                                                       |
| E0 1                                                                                                     | 52 parameter ap_ST_fsm_state18 = 23'd131072;                                                      |
| Thanks to Ben Raydo for help.                                                                            | 53 parameter ap_ST_fsm_state19 = 23'd262144;                                                      |
| 54@fx_t neuron0x32bf4d0() {                                                                              | 54 parameter ap_ST_fsm_state20 = 23'd524288;                                                      |
| 55 return input2;                                                                                        | 55 parameter ap_ST_fsm_state21 = 23'd1048576;                                                     |
| 56 }                                                                                                     |                                                                                                   |
|                                                                                                          |                                                                                                   |

#### 12/02/23

Sergey Furletov

#### Test NN IP in FPGA



#### Test tools:

- 1. Vivado SDK
- 2. Petalinux





#### C++ code for test : XTrdann ann; // create an instance of ML core.



Sergey Furletov

#### Workshop on Streaming readout XI

#### Ser/gez Burletov

### Beam test @ FermiLab



#### FermiLab test beam :

- > Primary beam: protons 120 GeV
- 4.2 seconds = length of spill
- 60 seconds = approximate rep rate of spill
- ➢ Beam intensity: Particles per spill : 10K − 1M (pps)

#### DAQ rates during spill :

- Raw data rate: ~100 MB/s; Trigger: 1.5 kHz
- Pulse mode data rate (SRS raw) : ~45 MB/s ; Trigger: 2.5 kHz
- Data collected: 1.1 TB