

### Fast ML on FPGA for Particle Identification and Tracking

Sergey Furletov (*Jefferson Lab*)

Streaming Readout Workshop SRO-XII

University of Tokyo ,1-4 Dec 2024

12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo



# Electron Ion Collider (EIC)



- The Electron-Ion Collider, a new facility for nuclear physics research to be located at Brookhaven Lab, will allow scientists from across the nation and around the globe to peer inside protons and atomic nuclei to reveal secrets of the strongest force in nature.
- Research at the EIC will take our understanding of matter to the next level—beyond the interactions of atomic nuclei with their orbiting electrons, which power the electronic and information technologies we now use every day, to the forces acting inside the nucleus.





# The Electron-Ion Collider

A machine for delving deeper than ever before into the building blocks of matter

### EIC streaming readout as motivation for ML-FPGA

Data **Configuration & Control** Power **FEP** DAQ **FEB** Detector (Front End Board) (Front End Processor) (Data Acquisition) BW: O(10 Tbps) BW: O(100 Tbps) Global timing, busy & sync Beam collision clock input L~100 m fiber Goal: O(100 Gbps) ASIC Fiber Storage Switch Server Switch / Switch / Server: Link-Server: Fiber **FPGA** Exchange Buffer Processing Readout Monitorina **FEP** 8 LVDS ~ 5m Analog ~ 20m **Power Supply System** (HV, LV, Bias) **Cooling Systems +** 

The correct location for the ML on the FPGA filter is called "FEP" in this figure.

Jefferson Lab

Accelerator Facility

- This gives us a chance to reduce traffic earlier.
- Allows us to touch physics: ML brings intelligence to L1.
- However, it is now unclear how far we can go with physics at the FPGA.
- Initially, we can start in pass-through mode.
- Then we can add background rejection.
- Later we can add filtering processes with the largest cross section.
- In case of problems with output traffic, we can add a selector for low cross section processes.
- The ML-on-FPGA solution complements the purely computer-based solution and mitigates DAQ performance risks.

### Generic EIC R&D project RD15, ML-(on)-FPGA



- The goal is to build a demonstrator that can operate under beam test conditions in real-time.
- The setup consists of several PID and tracking detectors: emCAL, GEMTRD, GEM tracker.
- Preprocessed data from detectors including decision on the particle type will be transferred to another ML-FPGA board with neural network for global PID decision.
- □ The global filter transfers data to off-line computer farm, running JANA2 software.

12/03/24

### <u> Team :</u>

F. Barbosa, L. Belfore, N. Branson, N. Brei, C. Dickover, C. Fanelli, D. Furletov, L. Jokhovets, D. Lawrence, C. Mei, D. Romanov, K. Shivu



### FPGA test board for ML



- At an early stage in this project, as hardware to test ML algorithms on FPGA, we use a standard Xilinx evaluation boards rather than developing a customized FPGA board. These boards have functions and interfaces sufficient for proof of principle of ML-FPGA.
- The Xilinx evaluation board includes the Xilinx XCVU9P and 6,840 DSP slices. Each includes a hardwired optimized multiply unit and collectively offers a peak theoretical performance in excess of 1 Tera multiplications per second.
- Second, the internal organization can be optimized to the specific computational problem. The internal data processing architecture can support deep computational pipelines offering high throughputs.
- Third, the FPGA supports high speed I/O interfaces including Ethernet and 180 high speed transceivers that can operate in excess of 30 Gbps.

Featuring the Virtex® UltraScale+™ XCVU9P-L2FLGA2104E FPGA



Xilinx Virtex<sup>®</sup> UltraScale+<sup>™</sup>

# GEM-TRD prototype for EIC R&D

- To demonstrate the operating principle of the ML FPGA, we use the existing setup
- from the EIC detector R&D project
- A test module was built at the University of Virginia
- The prototype of GEMTRD/T module has a size of 10 cm × 10 cm with a corresponding to a total of 512 channels for X/Y coordinates.
- The readout is based on flash ADC system developed at JLAB (fADC125) @125 MHz sampling.
- GEM-TRD provides e/hadron separation and tracking









12/03/24

Sergey Furletov

# **GEM-TRD** principle



- □ The e/pion separation in the GEM-TRD detector is based on counting the ionization along the particle track.
- □ For electrons, the ionization is higher due to the absorption of transition radiation photons
- So, particle identification with TRD consists of several steps:
  - The first step is to cluster the incoming signals and create "hits".
  - The next is "pattern recognition" sorting hits by track.
  - Finding a track
  - Ionization measurement along a track
  - As a bonus, TRD will provide a track segment for the global tracking system.

### GEM-TRD can work as micro TPC, providing 3D track segments



### GEMTRD tracks

- □ In a real experiment, GEMTRD will have multiple tracks.
- □ So we also need a fast algorithm for pattern recognition
- □ As well as for track fitting.
- □ The decision was made to try the Graph Neural Network (GNN) for pattern recognition.
- □ And a recurrent neural network LSTM, for track fitting.



Javier Duarte arXiv:2012.01249v2 [hep-ph] 7 Dec 2020

> HEP advanced tracking algorithms at the exascale (Project Exa.TrkX)
> <u>https://exatrkx.github.io/</u>





### GEMTRD tracks



- □ In a real experiment, GEMTRD will have multiple tracks.
- □ So we also need a fast algorithm for pattern recognition
- □ As well as for track fitting.
- □ The decision was made to try the Graph Neural Network (GNN) for pattern recognition.
- □ And a recurrent neural network LSTM, for track fitting.
- □ PID is based on measuring ionization along the track.



Javier Duarte arXiv:2012.01249v2 [hep-ph] 7 Dec 2020

□ HEP advanced tracking algorithms at the exascale (Project Exa.TrkX)

<u>https://exatrkx.github.io/</u>





# **GNN** for pattern recognition



- Graph Neural Networks (GNNs) designed for the tasks of hit classification and segment classification.
  - > These models read a graph of connected hits and compute features on the nodes and edges.
- □ The input and output of GNN is a graph with a number of features for nodes and edges.
  - In our case we use the edge classification
- $\Box$  A complete graph on N vertices contains N(N 1)/2 edges.
  - > This will require a lot of resources which are limited in FPGA.

□ To keep resources under control, we can construct the graph for a specific geometry and limit the minimum particle momentum.

- □ In our case we have a straight track segments, with a quite narrow angular distribution ~15 degree.
- Thus, for the input hits (left), we connect only those edges that satisfy our geometry and the momentum of most tracks (middle)
- □ The trained GNN processes the input graph and sets the probability for each edge as output.

□ The right plot shows edges with a probability greater than 0.7



12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

# **GNN performance**



- This type of graph neural network is not yet supported in HLS4ML.
- So we did a manual conversion first to C++ and then to Verilog using Vitis\_HLS.
- □ This neural network has not been optimized/pruned, so it consumes a lot of resources - 70% of DSPs, (4651 of 6840).
  - Network use precision ap fixed < 16,9 >
  - At the moment it can serve up to 21 hits and 42 edges, or , in our case (GEM-TRD), it will be 3-5 tracks.
- However, it performs all calculations in ~3 μs (left plot) (thanks to Ben Raydo), providing good purity and efficiency (right plot).



| Modules & Loops                    | Issue Type SI | lack | Latency(cycles) | Latency(ns) | Iteration Latency | Interval | Trip Count | Pipelined | BRAM | DSP  | FF     | LUT     | URAM |
|------------------------------------|---------------|------|-----------------|-------------|-------------------|----------|------------|-----------|------|------|--------|---------|------|
| ▼ o gnn2dfs2                       |               | -    | 589             | 2.945E3     | <u> </u>          | 590      | -          | no        | 42   | 4424 | 394036 | 2519454 | 0    |
| ✓                                  |               |      | 499             | 2.495E3     | -                 | 497      |            | dataflow  | 42   | 4424 | 391308 | 2515320 | 0    |
| ⊚ fromGraph                        |               |      | 331             | 1.655E3     |                   | 1        |            | yes       | 0    | 0    | 197686 | 1673583 | 0    |
| ▶ ⊚ gnn2dfs_loc_1                  |               |      | 496             | 2.480E3     |                   | 496      |            | no        | 42   | 4422 | 172620 | 785082  | 0    |
| 🕨 💿 toGraph_Block_split100_proc205 |               |      | 480             | 2.400E3     |                   | 480      |            | no        | 0    | 2    | 7226   | 49627   | 0    |
| C VITIS_LOOP_1365_1                |               |      | 63              | 315.000     | 3                 |          | 21         | no        |      |      |        |         | -    |
| C VITIS_LOOP_1400_3                |               |      | 22              | 110.000     | 3                 | 1        | 21         | yes       |      |      |        |         | -    |

12/03/24

# RNN/LSTM for track fit





12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

- 13

# MLP neural network for PID





## **Board design**

□ All data I/O operations are performed by Control IP

□ MicroBlaze is only used to configure the board and monitor data processing.

Aurora interface provides communication with a second FPGA board that processes the calorimeter data (CNN).

□ 10 Gigabit Ethernet uses TCP/IP, receives data from detectors (DAQ) and sends pre-processed data to the computer (farm).



12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

Jefferson Lab

Accelerator Facility

### Latency and rates (very preliminary)



- **Control IP** manages data traffic between NN-IP and the Ethernet interface.
- The IP block was synthesized directly using Vitis\_HLS, the total latency is about ~20 μs (~50 kHz).
- Control IP block primarily performs serial I/O
  - > Therefore, it consists of long loops designed to accommodate the maximum data size.
- □ In reality, the average data size is much smaller, so the actual speed should be higher.
- □ This was confirmed in measurements peak performance reached 80 kHz.
- □ This is the first version, not yet optimized and II violations have not been fixed.

| Modules & Loops                    | Issue Type     | Slack | Latency(cycles) | Latency(ns) | Iteration Latency | Interval | Trip Count | Pipelined | BRAM | DSP | FF   | LUT   | URAM |
|------------------------------------|----------------|-------|-----------------|-------------|-------------------|----------|------------|-----------|------|-----|------|-------|------|
| ▼ o ctrl_s64s                      | 👸 II Violation | -     | 4178            | 2.089E4     | -                 | 4179     | -          | no        | 8    | 5   | 4184 | 22984 | 0    |
| C VITIS_LOOP_399_2                 |                |       | 4               | 20.000      |                   | 1        | 4          | yes       |      |     |      |       | -    |
| C VITIS_LOOP_443_3                 |                |       | 1024            | 5.120E3     | 1                 | 1        | 1024       | yes       |      |     |      |       | -    |
| C VITIS_LOOP_464_4                 |                |       | 1025            | 5.125E3     | 3                 | 1        | 1024       | yes       |      |     |      |       | -    |
| C VITIS_LOOP_475_5                 | 📆 II Violation | -     | 45              | 225.000     | 6                 |          | 21         | yes       |      |     |      |       | -    |
| C VITIS_LOOP_479_7                 | 💮 II Violation | 1 -   | 43              | 215.000     | 4                 |          | 21         | yes       |      |     |      |       | -    |
| VITIS_LOOP_484_9_VITIS_LOOP_484_10 |                |       | 45              | 225.000     | 5                 | 1        | 42         | yes       |      |     |      |       | -    |
| C VITIS_LOOP_503_11                |                |       | 7               | 35.000      | 5                 | 1        | 4          | yes       |      |     |      |       | -    |
| VITIS_LOOP_508_12                  |                |       | 21              | 105.000     | 1                 | 1        | 21         | yes       |      |     |      |       | -    |
| C VITIS_LOOP_523_13                |                |       | 27              | 135.000     | 3                 | 1        | 26         | yes       |      |     |      |       | -    |
| C VITIS_LOOP_540_14                |                |       | 21              | 105.000     | 1                 | 1        | 21         | yes       |      |     |      |       | -    |
| C VITIS_LOOP_542_15                |                |       | 22              | 110.000     | 3                 | 1        | 21         | yes       |      |     |      |       | -    |
| VITIS_LOOP_562_16                  | 📆 II Violatior | - 1   | 804             | 4.020E3     | 45                |          | 20         | yes       |      |     |      |       |      |
| C VITIS_LOOP_626_20                |                |       | 44              | 220.000     | 3                 | 2        | 21         | yes       |      |     |      |       |      |
| C VITIS_LOOP_642_21                |                | -     | 1025            | 5.125E3     | 3                 | 1        | 1024       | yes       | -    | -   | -    | -     | -    |

### FPGA board resources for GEMTRD



□ Neural networks use a lot of FPGA resources.

□ Therefore, one VCU118 board can only process data from GEMTRD.





Streaming Readout Workshop SRO-XII, University of Tokyo

### Test setup at CERN SPS/H8 beam line







### Detectors



Electronics rack

### 12/03/24

### Sergey Furletov

### Streaming Readout Workshop SRO-XII, University of Tokyo

### Beam structure and rate





### □ Spill Duration: 4.8 s.

Repetition rate: 10 - 40 s.

Energy: 20 GeV

Trigger rate during spill: 300-400 Hz

| Run Control rcGui-68     |                                                                                                        |                                    |                   |                   |  |  |  |
|--------------------------|--------------------------------------------------------------------------------------------------------|------------------------------------|-------------------|-------------------|--|--|--|
| Control Sessions Config  | urations Options Expert User Help                                                                      |                                    |                   |                   |  |  |  |
|                          | Start Time                                                                                             | End T                              | ime               |                   |  |  |  |
|                          | 07/20/24 15:05:36                                                                                      | 0                                  |                   |                   |  |  |  |
| Run Parameters           |                                                                                                        | Run Status                         |                   |                   |  |  |  |
| Expid Session            | Configuration                                                                                          | Run Number                         | Run State         | Event Limit       |  |  |  |
| hdtrdana hdtrdan         |                                                                                                        | 5233                               | active            | 0                 |  |  |  |
| natraops                 |                                                                                                        |                                    |                   |                   |  |  |  |
| Output File              |                                                                                                        |                                    |                   |                   |  |  |  |
| /home/hdtrdops/DAQ/tro   | I_muon/DATA/hd_rawdata_005233_000.evio 💌                                                               | Watch Component                    |                   | Data Limit        |  |  |  |
| User RTV %(config)       |                                                                                                        | FEBIRD                             |                   |                   |  |  |  |
| /home/hdtrdops/DAQ/trd_m | uon/daq/config/hd_trd/gemtrd_ti_fp.conf                                                                | Total Events                       |                   |                   |  |  |  |
|                          |                                                                                                        | 42 491                             |                   | Time Limit (min.) |  |  |  |
| User RTV %(dir)          |                                                                                                        | 42,401                             |                   |                   |  |  |  |
| unset                    |                                                                                                        |                                    |                   |                   |  |  |  |
|                          |                                                                                                        | ]                                  |                   |                   |  |  |  |
| Name                     | State EvtRate DataRate IntEvtRate IntDataRate                                                          | Event Rate Data Rate Clier         | nt Data Live Time | e LDRs InB OutB   |  |  |  |
| PEBTRD actin             | ve 0.0 0.0 101.1 3378.4                                                                                |                                    | wont Pato         |                   |  |  |  |
| ROCTRD1  actr            | ve  1.0  37.8  101.9  3403.9                                                                           |                                    | vent Rate         |                   |  |  |  |
|                          |                                                                                                        | 400                                |                   |                   |  |  |  |
|                          |                                                                                                        | 350                                |                   |                   |  |  |  |
|                          |                                                                                                        | 300                                |                   |                   |  |  |  |
|                          |                                                                                                        | 250 _/                             | F H               |                   |  |  |  |
|                          |                                                                                                        | ₽ 200                              |                   |                   |  |  |  |
|                          |                                                                                                        | 150                                |                   |                   |  |  |  |
|                          |                                                                                                        | 150                                | 4                 |                   |  |  |  |
|                          |                                                                                                        | 100                                |                   |                   |  |  |  |
|                          |                                                                                                        | 50                                 | ₹ <i>I</i>        |                   |  |  |  |
|                          |                                                                                                        | 0                                  | <u></u>           |                   |  |  |  |
|                          |                                                                                                        | O ROC                              | TRD1 O PEBTRD     |                   |  |  |  |
| Name                     | Message                                                                                                |                                    | Time              | Severity          |  |  |  |
| ama ha tra ti            | Scatting process = nu_an.csy_rice<br>Script (harra/bdtrdang/DA0/trd_muan/dag/agripts/tum_prostart_522) | Long (reparted corp.cb.4500 15-    | 03:27 07/20       |                   |  |  |  |
| sms_rid_trd.ti           | Done process – bd. all teg. PPE                                                                        | 5 cmsg://mpgatra.cem.cn:4500 [15:  | 05:27 07/20       | INFO              |  |  |  |
| ame bd trd ti            | Prestort succeeded                                                                                     | 15:                                | 05.27 07/20       | INFO              |  |  |  |
| sms hd trd.ti            | Go is started.                                                                                         | 15.                                | 05:32 07/20       | INFO              |  |  |  |
| sms hd trd.ti            | Starting process = hd all tsg G0 SYNC                                                                  | 15.                                | 05:32 07/20       | INFO              |  |  |  |
| sms hd trd.ti            | Script (/home/hdtrdops/DA0/trd_muon/dap/scripts/run_go_sync_523)                                       | 3. cMsa://mpadtrd.cern.ch:4500     | 05:32 07/20       | INFO              |  |  |  |
| sms hd trd.ti            | Waiting sync-script to complete                                                                        | 15:1                               | 05:32 07/20       | WARN              |  |  |  |
| sms hd trd.ti            | Done process = hd all.tsg GO SYNC                                                                      | 15:                                | 05:32 07/20       | INFO              |  |  |  |
| PEBTRD                   | Emu PEBTRD go: waiting for PRESTART event in module EbModule (cli                                      | ent msg) 15:                       | 05:32 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Starting process = hd_all.tsg_GO                                                                       | 15:                                | 05:36 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Script (/home/hdtrdops/DAQ/trd_muon/daq/scripts/run_go_5233_cMs                                        | sg://mpgdtrd.cern.ch:45000/cMs 15: | 05:36 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Done process = hd_all.tsg_GO                                                                           | 15:                                | 05:36 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Starting process = hd_all.tsg_RCDB                                                                     | 15:                                | 05:36 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Done process = hd_all.tsg_RCDB                                                                         | 15:                                | 05:36 07/20       | INFO              |  |  |  |
| sms_hd_trd.ti            | Periodic script (/home/hdtrdops/DAQ/trd_muon/daq/scripts/run_upda                                      | te_rcdb %(rn) cMsg://mpgdtrd 15:   | 05:36 07/20       | INFO              |  |  |  |
| sms hd trd.ti            | Go succeeded.                                                                                          | 15:                                | 05:36 07/20       | INFO              |  |  |  |

### 12/03/24

### Sergey Furletov

### Streaming Readout Workshop SRO-XII, University of Tokyo

### **Tracking performance**





Description: Top rows: show ionization along the track in GEMTRD detector.

- <u>Red circles</u> are reconstructed clusters using some dE/dx threshold. The size is proportional to energy.
- Middle rows: after filtering out the noisy clusters, the coordinates of the clusters are sent to the FPGA/GNN for pattern recognition.
- Bottom rows: GNN provides labeling of clusters (by color in the figure), the same colors belong to the same track.
- Then clusters of the same color (tag) are sent to the track fitting module: LSTM.
- The results of track fitting are represented by lines in the figures.
- The next step is to count all the ionization in the corridor around the track and send it to the PID module (DNN).
- As a bonus, GEMTRD provides a track segment for the global tracking system.

12/03/24

Sergey Furletov

z pos.mm

ο

z pos.mm

Ο

z nos m

ο

0

0



# **ML for Calorimeter**

- 21

### Calorimeter parameters reconstruction

By Dmitry Romanov





Convolutional VAE as a backbone

Per cluster output of multiple values:

Energy, e/  $\pi$ , coordinates, features

Modules deposits as inputs







Examples of events with e and  $\pi^-$  showers and  $\mu^-$  passing through.

12/03/24

•

•

### Sergey Furletov

### Streaming Readout Workshop SRO-XII, University of Tokyo

## **CNN for calorimeter reconstruction**



- In this work we used a convolutional encoder with a decoder consisting of dense layers, which provide  $e-\pi$  separation scores as the output.
- ✤ Synthesized with HLS4ML, for calorimeter 11x11 cells.
- This was done to minimize a network size in FPGA and due to current limitation of HSL4ML of supported network layer types.
- FPGA synthesis with reuse factor of 1 has a latency of 0.7μs and an interval of 125 clocks. It uses 74% of DPS resources
- Network use precision ap fixed < 20,10 >

| Actual values | Predicted results |        |  |  |  |  |  |  |
|---------------|-------------------|--------|--|--|--|--|--|--|
|               | e                 | $\pi$  |  |  |  |  |  |  |
| e             | 98.8 %            | 1.2 %  |  |  |  |  |  |  |
| $\pi$         | 2.9 %             | 97.1 % |  |  |  |  |  |  |









| Name                                    | BRAM_18K | DSP48E | FF      | LUT       | URAM    |
|-----------------------------------------|----------|--------|---------|-----------|---------|
| DSP                                     | ++<br>   |        |         | ++<br>  - | +       |
| Expression                              | i – i    | -      | 0       | 2         | -i      |
| FIFO                                    | 404      | -      | 8999    | 15698     | -1      |
| Instance                                | 61       | 5124   | 55854   | 243846    | -       |
| Memory                                  | –        | -      | –       | –         | -       |
| Multiplexer                             | –        | -      |         | –         | -       |
| Register                                | –        | -      | –       | -         | -       |
| Total                                   | 465      | 5124   | 64853   | 259546    | +<br> 0 |
| Available SLR                           | 1440     | 2280   | 788160  | 394080    | 320     |
| Utilization SLR (%)                     | 32       | 224    | 8       | 65        | 0       |
| Available                               | 4320     | 6840   | 2364480 | 1182240   | 960     |
| Utilization (%)                         | 10       | 74     | 2       | 21        | 0       |
| г — — — — — — — — — — — — — — — — — — — |          |        | /       |           |         |

12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo



### hls\_config['Model']['Precision'] = 'ap\_fixed<20,10>'

Layer prune\_low\_magnitude\_conv\_0: % of zeros = 0.5 Layer prune\_low\_magnitude\_conv\_1: % of zeros = 0.5 Layer prune\_low\_magnitude\_conv\_2: % of zeros = 0.5 Layer prune\_low\_magnitude\_dense\_0: % of zeros = 0.5 Layer prune\_low\_magnitude\_dense\_1: % of zeros = 0.5 Layer prune\_low\_magnitude\_output\_dense: % of zeros = 0.5 Layer prune\_low\_magnitude\_fused\_convbn\_0: % of zeros = 0.0 Layer prune\_low\_magnitude\_fused\_convbn\_1: % of zeros = 0.0 Layer prune\_low\_magnitude\_fused\_convbn\_2: % of zeros = 0.0 Layer prune\_low\_magnitude\_fused\_convbn\_3: % of zeros = 0.0 Layer prune\_low\_magnitude\_fused\_convbn\_3: % of zeros = 0.0 Layer prune\_low\_magnitude\_fused\_convbn\_3: % of zeros = 0.0 Layer prune\_low\_magnitude\_dense\_0: % of zeros = 0.0 Layer prune\_low\_magnitude\_dense\_0: % of zeros = 0.0 Layer prune\_low\_magnitude\_dense\_1: % of zeros = 0.0 Layer prune\_low\_magnitude\_dense\_0: % of zeros = 0.0 Layer output\_dense: % of zeros = 0.0





# JANA2 for ML on FPGA

Pre-processed data from the FPGA is transferred over the network (TCP/IP) to a computer running JANA2 software.

### JANA4ML4FPGA



### JANA2

(JLab ANAlysis framework)

- JANA2 is a multi-threaded modular event reconstruction framework being developed at Jlab for online and offline processing

- JANA2 is a rewrite based on modern coding and CS practices. Developed for modern NP experiments with streaming readout, heterogeneous computing and AI

- JANA2 is the main framework chosen for EIC. Used for ePIC collaboration reconstruction and further Detector 2. Used in multiple Jlab experiments and prototypes





# Validation software

12/03/24

### JANA4ML4FPGA





### Goals:

- Read and write EVIO
- Write flat ROOT files
- Receive EVIO by TCP (and save)
- Receive network streams
- Receive FPGA data
- Simulate sending detector data
- Data Quality Monitor
- Al streaming preprocessing
- Conventional preprocessing

12/03/24



# Tracking for GlueX experiment

### **GlueX** experiment



- GlueX is a particle physics experiment located at the Thomas Jefferson National Accelerator Facility (JLab) accelerator in Newport News, Virginia.
- Its primary purpose is to better understand the nature of confinement in quantum chromodynamics (QCD) by identifying a spectrum of hybrid and exotic mesons generated by the excitation of the gluonic field binding the quarks.
- Hall D is dedicated to the operation with a linearly-polarized photon beam produced by ~12 GeV electrons from CEBAF at Jefferson Lab.
- Typical L1 trigger rate 40-70 kHz
- □ Data rate 0.7 1.2 GB/s
- L1 Trigger latency 3.5 us.



# Tracking for GlueX experiment



- □ The first target for implementing neural network-based tracking is the Forward Drift Chamber (FDC).
- The GlueX experiment has relatively low occupancy:
- □ Number of hits/event:
  - > (Q25, Q75, Max) = (50, 70, 558)
- □ Number of tracks/event
  - > (Q25, Q75, Max) = (4, 6, 11)
- □ This, in principle, makes it possible to fit a neural network in existing FPGAs.
- □ The FDC consists of 4 modules, each consisting of 6 planes, providing up to 24 points per track.
- □ The FDC is placed in a magnetic field, so the particles move in a helical trajectory.

# **FDC**

### <u>Team:</u>

Ahmed Mohammed, Kishansingh Rajput, Simon Taylor, Sergey Furletov, Denis Furletov, Malachi Schram

# Tracking for GlueX experiment



- □ The first target for implementing neural network-based tracking is the Forward Drift Chamber (FDC).
- The GlueX experiment has relatively low occupancy:
- □ Number of hits/event:
  - > (Q25, Q75, Max) = (50, 70, 558)
- □ Number of tracks/event
  - > (Q25, Q75, Max) = (4, 6, 11)
- □ This, in principle, makes it possible to fit a neural network in existing FPGAs.
- □ The FDC consists of 4 modules, each consisting of 6 planes, providing up to 24 points per track.
  - 6 tracks x 24 hits/trk = 144 hits
- □ The FDC is placed in a magnetic field, so the particles move in a helical trajectory.

### <u>Team:</u>

Ahmed Mohammed, Kishansingh Rajput, Simon Taylor, Sergey Furletov, Denis Furletov, Malachi Schram





# **Event Display**



- The FDC geometry with 6 closely spaced planes and large distances between modules makes it difficult to directly use GNN for pattern recognition in a magnetic field, see event display on the right.
- Moreover, a large graph uses too many FPGA resources – need to process > 150 hits.
- Better results are achieved by using a two-stage reconstruction:
  - in first GNN, the track segments in each module are reconstructed and fitted with a straight line,
  - and then the resulting vectors are fed into a second GNN to reconstruct the full track.







12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

- 32

# **Event Display**



- The FDC geometry with 6 closely spaced planes and large distances between modules makes it difficult to directly use GNN for pattern recognition in a magnetic field, see event display on the right.
- Moreover, a large graph uses too many FPGA resources – need to process > 150 hits.
- Better results are achieved by using a two-stage reconstruction:
  - in first GNN, the track segments in each module are reconstructed and fitted with a straight line,
  - and then the resulting vectors are fed into a second GNN to reconstruct the full track.







12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

- 33

# Processing with FPGA



- The FDC geometry with 6 closely spaced planes and large distances between modules makes it difficult to directly use GNN for pattern recognition in a magnetic field, see event display on the right.
- Better results are achieved by using a two-stage reconstruction:
  - in first GNN, the track segments in each module are reconstructed and fitted with a straight line,
  - and then the resulting vectors are fed into a second GNN to reconstruct the full track.
- In this way, FDC modules are processed in parallel and the FPGA resource usage is significantly reduced.



### Reconstruction of track segments in FDC





12/03/24

# **GNN tracking performance**



- □ The bottom left figure shows the efficiency of segment reconstruction.
- □ The bottom right figure shows the efficiency of full track reconstruction.
- The relatively low efficiency for the full track is explained by the presence of low momentum tracks, and hence high curvature, for which single projection is not efficient. (top right)
- □ In the future we plan to work in 3D.
- □ For now we will move forward with the implementation of the current 2D model on FPGA.







Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

# New GNN for FDC tracking



- The results shown look good, but we are still limited to 30 hits/nodes in the network, while FDC requires at least 100 nodes.
- □ We started designing a new GNN network capable of handling 150 nodes and 256 edges.
- **The new GNN design uses the layer library from HLS4ML with a custom wrapper and aggregation functions.**
- Also removed all dependency to external libraries Hep.TrkX and sonnet from DeepMind.



### **Optimized GNN IP**



- $\Box$  The GlueX trigger rate is up to 70 kHz, so on average we have ~14 µs to process events.
- $\Box$  We optimized the GNN to have a latency of ~10 µs, which allows it to operate at 70 kHz.
- On the other hand, the neural network fits in an FPGA and supports 150 nodes and 256 edges.
- □ Next we plan to test it on hardware.

| MODULES & LOOPS 🗸                                | IS!<br>TY | SLACK | LATENCY(CYCLES) | LATENCY(NS) | ITERATION<br>LATENCY | INTERVAL | TRIP<br>COUNT | PIPELINED | BRAM(% | DSP(%) | FF(%) | LUT(%) |  |
|--------------------------------------------------|-----------|-------|-----------------|-------------|----------------------|----------|---------------|-----------|--------|--------|-------|--------|--|
| 🗸 🔵 runGraphNetwork (6)                          |           | -0.24 | 1991            | 9.955E3     | -                    | 1992     | -             | no        | ~0     | 11     | 10    | 25     |  |
| > 🔵 edge_network (1)                             |           |       | 271             | 1.355E3     |                      | 271      |               | no        | ~0     | 3      | ~0    | 3      |  |
| > 🔵 node_runner (1)                              | Δ         |       | 363             | 1.815E3     |                      | 363      |               | no        | 0      | 5      | 7     | 21     |  |
| > orunGraphNetwork_Pipeline_INPUT_HIT_LOOP (1)   |           |       | 159             | 795.000     |                      | 159      |               | no        | 0      | 2      | 1     | ~0     |  |
| > orunGraphNetwork_Pipeline_VITIS_LOOP_42_1 (1)  |           |       | 302             | 1.510E3     |                      | 302      |               | no        | 0      | 0      | ~0    | ~0     |  |
| > runGraphNetwork_Pipeline_VITIS_LOOP_72_2 (1)   |           |       | 514             | 2.570E3     |                      | 514      |               | no        | 0      | 0      | ~0    | ~0     |  |
| > orunGraphNetwork_Pipeline_VITIS_LOOP_136_3 (1) |           | -     | 258             | 1.290E3     | -                    | 258      | -             | no        | 0      | 0      | ~0    | ~0     |  |

### Outlook



- An FPGA-based Neural Network application would offer online event preprocessing and allow for data reduction based on physics at the early stage of data processing.
- The ML-on-FPGA solution complements the purely computer-based solution and mitigates DAQ performance risks.
- **FPGA** provides extremely low-latency neural-network inference.
- □ Open-source HLS4ML software tool with Xilinx<sup>®</sup> Vivado<sup>®</sup> High Level Synthesis (HLS) accelerates machine learning neural network algorithm development.
- □ The ultimate goal is to build a real-time event filter based on physics signatures.



### Published in 2007

Measurement of multijet events at low \$x\_{Bj}\$ and low \$Q^2\$ with the ZEUS detector at HERA

T. Gosau





### Sergey Furletov

### Streaming Readout Workshop SRO-XII, University of Tokyo



# Backup

12/03/24

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

### Xilinx VPK180 board





12/03/24

41

Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo

# ADC based DAQ for PANDA STT



### Level 0 Open VPX Crate

12/03/24

ADC based DAQ for PANDA STT (one of approaches):

- 160 channels (shaping, sampling and processing) per payload slot, 14 payload slots+2 controllers;
- totally 2200 channels per crate;
- time sorted output data stream (arrival time, energy,...)
- noise rejection, pile up resolution, base line correction, ...







- All information from the straw tube tracker is processed in one unit.
- Allows to build a complete STT event.
- This unit can also be used for calorimeters readout and processing.



https://doi.org/10.1088/1748-0221/17/04/C04022 2022\_JINST\_17\_C04022



pins samtec cables



Sergey Furletov

Streaming Readout Workshop SRO-XII, University of Tokyo



- 43

### Step 2a: Iterations (edge)





12/03/24



### Step 3: Final edge output





### Simple Overview





## Event display, single track





## **Tracking performance**





- Top rows: show ionization along the track in GEMTRD detector.
  - <u>Red circles</u> are reconstructed clusters using some dE/dx threshold. The size is proportional to energy.
- Middle rows: after filtering out the noisy clusters, the coordinates of the clusters are sent to the FPGA/GNN for pattern recognition.
- Bottom rows: GNN provides labeling of clusters (by color in the figure), the same colors belong to the same track.
- Then clusters of the same color (tag) are sent to the track fitting module: LSTM.
- The results of track fitting are represented by lines in the figures.
- The next step is to count all the ionization in the corridor around the track and send it to the PID module (DNN).
- As a bonus, GEMTRD provides a track segment for the global tracking system.

Sergey Furletov



