

# **GEMTRD** readout

Sergey Furletov Jefferson Lab

on behalf of the eRD22 group

Joint Streaming readout VII meeting

18 Nov 2020

11/18/20

### Outline

Jefferson Lab

Experimental setup
 Readout electronics
 Offline PID analysis with ML (root / TMVA)
 Moving on to FPGA
 Outlook

(\*) Field Programmable Gate Array

### **GEMTRD** at **EIC**





# GEM-TRD prototype



- A test module was built at the University of Virginia
   The prototype of GEMTRD/T module has a size of 10 cm × 10 cm with a corresponding to a total of 512 channels for X/Y coordinates.
- The readout is based on flash ADC system developed at JLAB (fADC125) @125 MHz sampling.
- GEM-TRD provides e/hadron separation and tracking









11/18/20

### Beam setup at JLab Hall-D



 Tests were carried out using electrons with an energy of 3-6 GeV, produced in the converter of a pair spectrometer at the upstream of GlueX detector.



### Readout electronics for GEMTRD

11/18/20

Jefferson Lab



Sergey Furletov

-







### **GEMTRD** clusters on the track



### GEM-TRD can work as mini TPC, providing 3D track segments



11/18/20

Sergey Furletov

### GAS-II preamp and shaper





11/18/20

Sergey Furletov



11/18/20

Sergey Furletov

### **Readout electronics**



|                                          | Sampling<br>MHz | ns/bin     | Peaking<br>time             | Pipeline /<br>stream     | Channels/chip<br>cost                   | ADC<br>bits | Remarks                                                                               |
|------------------------------------------|-----------------|------------|-----------------------------|--------------------------|-----------------------------------------|-------------|---------------------------------------------------------------------------------------|
| FADC125<br>+ GAS-II<br>preamp.<br>(JLAB) | 125             | 8          | 30ns                        | 8 <i>μs</i><br>or stream | \$50/channel                            | 12bit       | External preamps<br>(GAS-II) :<br>-Undershooting<br>-No baseline<br>restorer          |
| VMM3<br>(ATLAS)                          | 4               | 250        | 25-<br>200ns                |                          | 64chan/chip                             | 10bit       | L0 or continuous                                                                      |
| SAMPA<br>(ALICE)                         | 10-20           | 100-<br>50 | <mark>80ns,</mark><br>160ns | Stream<br>3.2Gbit/s      | 32chan/chip<br>30\$/chip<br>1\$/channel | 10bit       | 500ns- return to<br>baseline<br>Baseline restorer,<br>DSP (zero-<br>suppression, thr) |
| ALPHACORE                                |                 |            |                             |                          |                                         |             |                                                                                       |
| Minimal requirements                     | 80              |            | 30ns                        | Stream                   |                                         | 10bit       |                                                                                       |

11/18/20



### SAMPA ASIC



- SAMPA chip works great with regular GEM for tracking.
- For GEMTRD, it has too long an integration time.



### ALICE TPC upgrade and the SAMPA ASIC

Fastest peak time of 80 ns is too slow for cluster separation and counting.



### Data analysis



- TR photons move forward at a small angle within  $1/\gamma$ , practically along the path of the original particle, and are detected together with dE/dx from the particle.
- There are several methods that are used to discriminate TR photons and dE/dx from particle
  - 1. <u>Cluster counting method</u>
    - use one threshold on ionization amplitude (just above average dE/dx), assuming that energy deposition from TR photons is a point like and produces cluster with high amplitude. Method is widely used with straw based TRD.
  - 2. Total energy deposition
  - 3. <u>Separation in space</u>
    - Require high resolution detector (silicon pixels) to see natural angular distribution of TR photons, or magnetic field to deflect particle from TR photons.
  - 4. In case of measurements of <u>ionization along the track</u>, the likelihood or neural network methods can be used for separation of electrons and pions.
- For this test we used ionization along the track and Neural Network (Machine Learning)

### Input parameters for ML





### **GEMTRD** offline analysis





• For data analysis we used a neural network library provided by root /TMVA package : MultiLayerPerceptron (MLP)

- All data was divided into 2 samples: training and test samples
- Top right plot shows neural network output for single module:
  - Red electrons with radiator
  - Blue electrons without radiator

### **NN** input parameters





\_\_\_\_\_

11/18/20

### Moving forward



- Offline analysis using ML looks promising.
- Can it be done in real time ?
- Here are some of the possible solutions :
  - > Computer farm.
  - CPU + GPU
  - CPU + FPGA
  - FPGA only
- Steps for beginners to implement an FPGA solution:
  - Select FPGA for application in ML
  - > Export an offline trained neural network (NN) from root to C++ file.
  - Convert logical topology of NN coded in C++ to RTL structure of FPGA in VHDL or Verilog.
  - Optimize the NN for application in FPGA.
  - Create an I/O interface and configure FPGA.
  - Perform the test with hardware.



11/18/20

Sergey Furletov

-18

### **FPGA** structure





- Fist FPGA have only programmable gates and routers: Field Programmable Gate Array.
- It can perform logical operation in parallel using LUTs and FFs.
- There are problems with the math operation required by the neural network.



Image from: https://www.embeddedrelated.com/showarticle/195.php

### Modern FPGA



- Modern FPGAs have DSP slices specialized hardware blocks placed between gateways and routers that perform mathematical calculations.
- The number of DSP slices can be up to 6000-12000 per chip.
- In addition, they often have ARM cores implemented using non-programmable gates.



Modern FPGA: lots of hard, not-field-programmable gates

Image from: https://www.embeddedrelated.com/showarticle/195.php

### Xilinx Virtex<sup>®</sup> UltraScale+™



- At an early stage in this project, as hardware to test ML algorithms on FPGA, we use a standard Xilinx evaluation boards rather than developing a customized FPGA board. These boards have functions and interfaces sufficient for proof of principle of ML-FPGA.
- The Xilinx evaluation board includes the Xilinx XCVU9P and 6,840 DSP slices. Each includes a hardwired optimized multiply unit and collectively offers a peak theoretical performance in excess of 1 Tera multiplications per second.
- Second, the internal organization can be optimized to the specific computational problem. The internal data processing architecture can support deep computational pipelines offering high throughputs.
- Third, the FPGA supports high speed I/O interfaces including Ethernet and 180 high speed transceivers that can operate in excess of 30 Gbps.
   Featuring the Virtex® UltraScale+" XCVU9P-L2FLGA2104E FPGA

### Xilinx Virtex<sup>®</sup> UltraScale+™

Evaluation board XCVU9P includes software license (node locked & device-locked) with 1 year of updates.



#### 11/18/20

# Xilinx High-Level Synthesis



The Xilinx Vivado HLS (High-Level Synthesis) tool provides a higher level of abstraction for the user by synthesizing functions written in C,C++ into IP blocks, by generating the appropriate ,low-level, VHDL and Verilog code. Then those blocks can be integrated into a real hardware system.

|                                   |                                                                                                                                                | 🏇 Debug 📐 Synthesis 🎸 Analysis                 |                |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------|
|                                   | a 🕒 🗊 Synthesis(solution1)(trdann_csynth.rpt) 🕱 💽 trd_ann.cxx                                                                                  | □ B Outline X I Directive V □                  |                |
| Etrd_ann                          | General Information                                                                                                                            | General Information  Figure Formance Estimates |                |
| ▶ ऒ Includes<br>▼ ≣ Source        | Date: Wed Mar 11 18:26:42 2020                                                                                                                 | Timing (ns)                                    | The C/C++ code |
| ✓ ≦ source<br>isotropy ann.cxx    | Version: 2019.1 (Build 2552052 on Fri May 24 15:28:33 MDT 2019)<br>Project: trd ann                                                            | Latency (clock cycles)                         |                |
| 🖳 trd_ann.h                       | Solution: solution1                                                                                                                            | 👻 🔚 Utilization Estimates                      | of the trained |
| I a Test Bench                    | Product family: virtexuplus                                                                                                                    | Summary 🔤                                      |                |
| 🛚 🔁 solution1                     | Target device: xcvu9p-flga2104-2L-e                                                                                                            | 🖺 Detail                                       | network is     |
| ▼ @ constraints                   | Performance Estimates                                                                                                                          | ▼ E Interface                                  |                |
| Minectives.tcl                    | □ Timing (ns)                                                                                                                                  | 🔲 Summary                                      |                |
| ‰script.tcl<br>▼ ≽csim            | □ Summary                                                                                                                                      |                                                | used as input  |
| ► build                           | Clock Target Estimated Uncertainty                                                                                                             |                                                |                |
| ▶ 🗁 report                        | ap_clk 4.00 3.466 0.50                                                                                                                         |                                                | for Vivado_HLS |
| ▼ ≽impl                           | Eatency (clock cycles)                                                                                                                         |                                                |                |
| 🕨 🗁 ip                            | Summary                                                                                                                                        |                                                |                |
| ) 🗁 misc                          | Latency Interval<br>min max min max Type                                                                                                       |                                                |                |
| ▶ 🗁 verilog                       | 15 381 15 381 none                                                                                                                             |                                                |                |
| ▶ 🗁 vhdl<br>▼ 🗁 sim               | 🗉 Detail                                                                                                                                       |                                                | @par9          |
| <ul> <li>&gt; autowrap</li> </ul> | ■ Instance                                                                                                                                     |                                                | @par8          |
| report                            | ■ Loop                                                                                                                                         |                                                |                |
| ▶ 🗁 tv                            | Utilization Estimates                                                                                                                          |                                                | @par7          |
| 🕨 🗁 verilog                       |                                                                                                                                                |                                                | @par6          |
| ) 🗁 wrapc                         | Name BRAM 18K DSP48E FF LUT URAM                                                                                                               |                                                |                |
| > >>> wrapc_pc                    | DSP - 7                                                                                                                                        |                                                | @par5          |
| ▼                                 | Expression - 40 40 8082 -<br>FIFO                                                                                                              |                                                | @par4          |
| > 🗁 systemc                       | Instance 510 1415 142176 199915 -                                                                                                              |                                                |                |
| ▶ 🗁 verilog                       | Memory<br>Multiplexer 181 -                                                                                                                    |                                                | @par3          |
| 🕨 🗁 vhdl                          | Register 2350 -                                                                                                                                |                                                | @par2          |
|                                   | Total         510         1462         144566         208178         0           Available         4320         684023644801182240         960 |                                                |                |
|                                   | Available SLR 1440 2280 788160 394080 320                                                                                                      |                                                | @par1          |
|                                   | Utilization (%) 11 21 6 17 0                                                                                                                   |                                                |                |
|                                   | Utilization SLR (%) 35 64 18 52 0<br>⊡ Detail                                                                                                  |                                                | @par0          |
|                                   |                                                                                                                                                |                                                | -              |
|                                   | DSP48E                                                                                                                                         |                                                |                |
|                                   | Memory                                                                                                                                         |                                                |                |
|                                   | <b>▼</b> FIFO                                                                                                                                  |                                                |                |
|                                   |                                                                                                                                                |                                                |                |
|                                   | © Console ☎ ♥ Errors & Warnings 1 DRCs                                                                                                         |                                                |                |
|                                   |                                                                                                                                                |                                                |                |



### Xilinx HLS: C++ to Verilog





11/18/20

### ML FPGA Core for TRD



• Using HLS significantly decreases development time. (at the cost of lower efficiency of use of FPGA resources)



11/18/20

Sergey Furletov

- 24

### Vivado implementation report



#### Performance Estimates

Timing (ns)



Latency (clock cycles)



Initial latency estimation: From 60 ns to 1.5  $\mu$ s.

#### Summary

| Name                | BRAM_18K | DSP48E | FF      | LUT     | URAM |
|---------------------|----------|--------|---------|---------|------|
| DSP                 | -        | 7      | -       | -       | -    |
| Expression          | -        | 40     | 40      | 8082    | -    |
| FIFO                | -        | -      | -       | -       | -    |
| Instance            | 510      | 1415   | 142176  | 199915  | -    |
| Memory              | -        | -      | -       | -       | -    |
| Multiplexer         | -        | -      | -       | 181     | -    |
| Register            | -        | -      | 2350    | -       | -    |
| Total               | 510      | 1462   | 144566  | 208178  | 0    |
| Available           | 4320     | 6840   | 2364480 | 1182240 | 960  |
| Available SLR       | 1440     | 2280   | 788160  | 394080  | 320  |
| Utilization (%)     | 11       | 21     | ) 6     | 17      | 0    |
| Utilization SLR (%) | 35       | 64     | 18      | 52      | 0    |
| Utilization (%)     | 11       | 21     | 6       | 17      |      |

|                   |       |              |       |             | S         | LR2 |
|-------------------|-------|--------------|-------|-------------|-----------|-----|
| <br>X0Y14         | X1Y14 | X2Y14        | X3Y14 | X4Y14       | X5Y14     |     |
| <br>X0Y13         | X1Y13 | X2Y13        | ХЗҮ1З | X4Y13       | X5Y13     |     |
| ¦<br>X0Y12        | X1Y12 | X2Y12        | ХЗҮ12 | X4Y12       | X5Y12     |     |
| <br>X0Y11         | X1Y11 | X2Y11        | X3Y11 | X4Y11       | X5Y11     |     |
| <br>X0Y10         | X1Y10 | X2Y10        | ХЗҮ10 | X4Y10       | X5Y10     |     |
| <br>X0Y9          | X1Y9  | X2Y9         | X3Y9  | X4Y9        | s<br>X5Y9 | LRI |
| <br>X0Y8          | X1Y8  | <u>X2Y8</u>  | ХЗҮ8  | <u>X4Y8</u> | X5Y8      |     |
| :<br>X0Y7         | X1Y7  | X2Y7         | X3Y7  | X4Y7        | X5Y7      |     |
| <br>X0Y6          | X1Y6  | X2Y6         | X3Y6  | X4Y6        | X5Y6      |     |
| <br>X0Y5          | X1Y5  | X2Y5         | X3Y5  | X4Y5        | X5Y5      |     |
| XCY4              |       | <u>x2</u> Y4 |       | X4Y4        | x5Y4      | LRC |
| x: <u>/3</u>      |       | ×            | 172   |             | X513      |     |
| <b>XC</b> 12      |       | x2Y2 <       |       | X4 ( *      | 3512      | •   |
| <b>;</b><br>x0 r1 |       |              | ×371  | <u>x4Y</u>  | ×571      |     |
| l<br>Xu Yo        | XLY0  | K210         | X3Y0  | X410        | x510      |     |

11/18/20

### Test ML FPGA



### Test tools:

- 1. Vivado SDK
- 2. Petalinux





### C++ code for test : XTrdann ann; // create an instance of ML core.

#### XTrdann ann;

int ret = XTrdann Initialize(&ann, 0); xil printf(" XTrdann Initialize =%d \n\r", ret); XTrdann\_Start(&ann); xil printf(" XTrdann Started \n\r"); for (int i = 0; i < 8 ; i++ ) {</pre> for (int k=0; k<10; k++)</pre> params[k]=data[i][k]; out0=data[i][10]; ann\_stat(&ann); int offset=0: int retw = XTrdann Write input r Words(&ann, offset, (u32\*)&params[0], 10); xil\_printf("Set Input ret=%d \n\r", retw); XTrdann Set index(&ann, 0); XTrdann Start(&ann); while (!XTrdann IsReady(&ann)) ann stat(&ann); ann stat(&ann); int h1=out0; int d1=(out0-h1)\*1000; float \*xout; // \*xin0, \*xin1, \*xin2; u32 iout = XTrdann\_Get\_return(&ann); xout = (float\*) &iout; int whole = \*xout: int thousandths = (\*xout - whole) \* 1000; if (whole==0 && thousandths<0) xil printf("xout=-%d.%03d out0=%d.%03d\n\r", whole, -thousandths,h1,d1); else xil\_printf("xout=+%d.%03d out0=%d.%03d\n\r", whole, thousandths,h1,d1); //u32 in0 = XTrdann\_Get\_in0(&ann); xin0 = (float\*) &in0; int hin0 = \*xin0; int din0=(\*xin0-hin0)\*1000; //u32 inl = XTrdann Get inl(&ann); xinl = (float\*) &inl; int hinl = \*xinl; int dinl=(\*xinl-hinl)\*1000; //u32 in2 = XTrdann\_Get\_in2(&ann); xin2 = (float\*) &in2; int hin2 = \*xin2; int din2=(\*xin2-hin2)\*1000; //xil printf(" XTrdann in0=%d.%03d", hin0,din0); //xil\_printf(" inl=%d.%03d ",hinl,dinl); //xil printf(" in2=%d.%03d ",hin2,din2); xil\_printf(" ev=%d out=%d.%03d out0=%d.%03d\n\r",i,whole,thousandths,h1,d1);

11/18/20

Sergey Furletov

}

# hls4ml software



- **hls4ml** is a software package for creating HLS implementations of neural networks
- Supports common layer architectures and model software
- Highly customizable output for different latency and size needs
- Simple workflow to allow quick translation to HLS

https://fastmachinelearning.org/hls4ml/



### Optimization with hls4ml package

 A package hls4ml is developed based on High-Level Synthesis (HLS) to build machine learning models in FPGAs.



11/18/20

Sergey Furletov

Jefferson Lab

son National Accelerator Facility

# Combined PID / Filter

Jefferson Lab



11/18/20

Sergey Furletov

29



- An FPGA-based Neural Network application would offer online event preprocessing and allow for data reduction based on physics at the early stage of data processing.
- Open-source hls4ml software tool with Xilinx<sup>®</sup> Vivado<sup>®</sup> High Level Synthesis (HLS) accelerates machine learning neural network algorithm development.



# Backup





### **GEMTRD** signals with Fadc125





11/18/20

### Xilinx Vivado



Vivado Design Suite is a software suite produced by Xilinx for synthesis and analysis of HDL designs.



11/18/20

# Motivation ML on FPGA



34

- The growing computational power of modern FPGA boards allows us to add more sophisticated algorithms for real time data processing.
- Many tasks could be solved using modern Machine Learning (ML) algorithms which are naturally suited for FPGA architectures.

Sergey Furletov

Level 1 works with Regional and sub-detector Trigger primitives

Using ML on FPGA many tasks from Level 2 and/or Level 3 can be performed at Level 1



### GEMTRD in front of DIRC



