WEBVTT

1
00:09:53.840 --> 00:10:00.380
James (Jim) Ang: Let's see, Paul, I know we're at the top of the hour. Should we go ahead and get started? I know we're also recording.

2
00:10:01.730 --> 00:10:08.540
Paul McIntyre: Yes, we should get started. Can you, can you hear me?

3
00:10:08.830 --> 00:10:10.399
James (Jim) Ang: I can hear you fine.

4
00:10:10.590 --> 00:10:12.479
James (Jim) Ang: Okay, yes, let's see.

5
00:10:14.230 --> 00:10:15.260
Paul McIntyre: Thanks. Thanks, Jim.

6
00:10:15.510 --> 00:10:16.390
James (Jim) Ang: Okay.

7
00:10:18.510 --> 00:10:23.890
James (Jim) Ang: All right, well, welcome to, all of the Meerkat,

8
00:10:24.050 --> 00:10:42.780
James (Jim) Ang: PIs and their, personnel, their extend… the extended team, for the research center. Today we have, an opportunity to give an overview of the Decode, project, and, many of the,

9
00:10:43.220 --> 00:10:56.749
James (Jim) Ang: I guess the leadership team for our project is online, and we'll be sharing the presentation. I'm going to start off by kicking off with a few, introductory slides, and

10
00:10:56.750 --> 00:11:03.349
James (Jim) Ang: And I think each of our thrust leads will, will take over for me for, sharing their slides, so…

11
00:11:03.690 --> 00:11:09.829
James (Jim) Ang: Let me, you can find the link for sharing.

12
00:11:33.110 --> 00:11:36.350
James (Jim) Ang: Okay, are you all seeing this in presentation mode?

13
00:11:37.710 --> 00:11:38.410
Ryan Coffee: Yep.

14
00:11:38.710 --> 00:11:51.020
James (Jim) Ang: All right, very good. So, for this overview, I'm, Antonino, and I will give kind of an introduction to the democratization of co-design, and

15
00:11:51.210 --> 00:12:00.930
James (Jim) Ang: Then to present, our Thrust 1, David Brooks from Harvard will, will be giving that. And, for Thrust 2…

16
00:12:01.140 --> 00:12:11.009
James (Jim) Ang: I know John Lydell is on. I guess I wasn't sure if I saw Andreas Olofsen, but one or both of them can present Thrust 2.

17
00:12:11.140 --> 00:12:16.210
James (Jim) Ang: Luca Carloni will, from Columbia, will present Thrust 3.

18
00:12:16.730 --> 00:12:23.030
James (Jim) Ang: And, As shown in this graphic, Thrust 1 is co-design of energy efficiency.

19
00:12:23.660 --> 00:12:25.330
James (Jim) Ang: System in a package.

20
00:12:25.560 --> 00:12:32.039
James (Jim) Ang: Technology. Thrust 2 is open source co-design tools.

21
00:12:32.670 --> 00:12:35.890
James (Jim) Ang: Press 3 is prototyping and chiplet integration.

22
00:12:37.320 --> 00:12:39.550
James (Jim) Ang: Let me start off by just describing

23
00:12:39.800 --> 00:12:44.199
James (Jim) Ang: really the core of our objective for Decode.

24
00:12:46.180 --> 00:12:58.909
James (Jim) Ang: I won't read all the text here for you, but basically, our approach is to democratize co-design through the realization that design of R&D prototype hardware

25
00:12:59.100 --> 00:13:05.060
James (Jim) Ang: is analogous to and can apply lessons from the development of R&D prototype software.

26
00:13:05.380 --> 00:13:11.499
James (Jim) Ang: We leverage open source hardware design, generation, and analysis tools.

27
00:13:11.800 --> 00:13:23.669
James (Jim) Ang: That can be used to target prototype and low-volume fabrication versus production. And this is important, an important concept for us to be aware of.

28
00:13:23.950 --> 00:13:27.400
James (Jim) Ang: We try to delay the use of commercial EDA tools.

29
00:13:27.540 --> 00:13:40.150
James (Jim) Ang: Because they are expensive, and that introduces barriers for vendor proprietary design data. And delay that until the conceptual designs, the research.

30
00:13:40.210 --> 00:13:54.389
James (Jim) Ang: That we pursue shows promise that hardware design approaches are ready for, non-recurring engineering investments, and that's where, you know, costs really significantly ramp up.

31
00:13:55.210 --> 00:14:09.680
James (Jim) Ang: We try to leverage, CHIPS Act investments that are really establishing an infrastructure for hardware prototyping, and things like an open chiplet ecosystem that focus on, helps us focus on domain-specialized hardware design and generation.

32
00:14:09.680 --> 00:14:17.649
James (Jim) Ang: While we're using, chiplet designs for conventional general-purpose processors, memory interfaces, network interfaces, etc.

33
00:14:18.020 --> 00:14:27.240
James (Jim) Ang: We introduced this concept. It's actually an old concept from the product world of MVP, Minimum Viable Product.

34
00:14:27.330 --> 00:14:40.919
James (Jim) Ang: And the idea here is to use MVP test hardware to provide concrete measures of performance improvement, power and energy utilization, where we're exercising our concepts with real application workloads.

35
00:14:42.960 --> 00:14:58.829
James (Jim) Ang: In an MVP, a lot of the focus is on core functionality, where we're targeting novel or innovative architecture designs, and for us, that really is based on heterogeneous computing concepts.

36
00:14:59.080 --> 00:15:05.830
James (Jim) Ang: The most important feedback for us comes from, early adopters, and…

37
00:15:05.960 --> 00:15:16.970
James (Jim) Ang: The most critical early adopters for us are software developers. Software developers for algorithms, applications, and the software stack, compilers and runtimes.

38
00:15:17.260 --> 00:15:22.480
James (Jim) Ang: And I guess, again, a lesson from the software world.

39
00:15:22.690 --> 00:15:38.330
James (Jim) Ang: is that we're trying to, support agile, technology development, and, really to support a path to having, rapid design cycles to help explore a design space, to help guide the.

40
00:15:38.330 --> 00:15:46.700
James (Jim) Ang: the development of future, advanced computing concepts, and specifically for Decode, energy-efficient computing concepts.

41
00:15:48.040 --> 00:16:00.149
James (Jim) Ang: Very quickly, this is a summary of the entire team. You'll be hearing from our leadership team that includes representation from

42
00:16:00.150 --> 00:16:08.799
James (Jim) Ang: Most of these institutions, with the exception of Washington State University, Rice University.

43
00:16:08.800 --> 00:16:19.999
James (Jim) Ang: and, University of Minnesota. But, this table, shows all the contributions that the institutions provide to our team.

44
00:16:20.230 --> 00:16:22.610
James (Jim) Ang: Huh?

45
00:16:22.820 --> 00:16:23.860
James (Jim) Ang: Question?

46
00:16:26.740 --> 00:16:29.899
James (Jim) Ang: If you can mute yourself, that'd be great.

47
00:16:30.310 --> 00:16:31.290
Chun-Long Chen (PNNL): Well, today.

48
00:16:32.930 --> 00:16:34.180
James (Jim) Ang: Do you have a question?

49
00:16:36.650 --> 00:16:43.670
James (Jim) Ang: Alright, let me move on. Let's see, where am I?

50
00:16:44.080 --> 00:17:01.130
James (Jim) Ang: Oh, here it goes. So our approach really is focused on the idea that we could, target heterogeneous computing. This arises naturally when you talk about energy-efficient accelerators combined with, general-purpose processors.

51
00:17:01.220 --> 00:17:11.239
James (Jim) Ang: But we actually have some history at, at PNNL in, internal investments, LDRD investments in something called the Data Model Convergence Initiative.

52
00:17:11.349 --> 00:17:29.959
James (Jim) Ang: Where we think about, converged workloads being traditional scientific computing, other types of AI, app, methods and, applications, as well, as well as, data analytics applications. And the,

53
00:17:30.110 --> 00:17:42.809
James (Jim) Ang: The target then for us is, is either, can span scales from, the large-scale, high-performance computing systems to, edge computing with integrated sensors.

54
00:17:43.040 --> 00:18:01.029
James (Jim) Ang: Skipping down here, a lot of our innovation in decode is focused on some concepts for custom, analog accelerator, designs, that, are very energy efficient, but,

55
00:18:01.250 --> 00:18:16.649
James (Jim) Ang: of the examples that we use are really exemplars that we hope should be generalizable to a range of other accelerator types, and really many of the candidates that are being pursued within Meerkat and other projects are,

56
00:18:16.760 --> 00:18:21.949
James (Jim) Ang: Our, opportunities that we think could be relevant for us.

57
00:18:22.470 --> 00:18:29.420
James (Jim) Ang: Let's see… And, Maybe with that, I will,

58
00:18:29.870 --> 00:18:33.449
James (Jim) Ang: Pass the microphone over to Antonino.

59
00:18:37.060 --> 00:18:50.180
Antonino Tumeo (PNNL): Thank you very much, Jim. So, I'll briefly go, detailing a little bit the objectives of the Tree Trust, and then, obviously, you will hear from the leadership team the details.

60
00:18:50.570 --> 00:19:06.170
Antonino Tumeo (PNNL): But, basically our pre-trust, as Jim has introduced, right, are looking at the cheaper ecosystem on one side, including the novel technologies and analog title accelerators.

61
00:19:06.390 --> 00:19:25.509
Antonino Tumeo (PNNL): And, the second class would be about the design solutions, and then we'll bring on, close the loop, basically, with the prototyping. So, for, for the first one, really, the focus is on the co-design process for this heterogeneous system in a package.

62
00:19:25.680 --> 00:19:26.900
Antonino Tumeo (PNNL): that…

63
00:19:27.480 --> 00:19:45.490
Antonino Tumeo (PNNL): Should include, the digital conventional, accelerators, the processors, and basically create this, catalog of open chiplets, but also prepare, like, the, the, the system and…

64
00:19:45.490 --> 00:19:52.120
Antonino Tumeo (PNNL): Bring the co-design to the level of, can we, implement and leverage

65
00:19:52.520 --> 00:20:08.600
Antonino Tumeo (PNNL): new type of accelerators that are energy efficient. And in particulars, in particular with our exemplar for download or three-dimensional computing that can be implemented both in digital or in near-memory analog versions.

66
00:20:08.620 --> 00:20:24.879
Antonino Tumeo (PNNL): And, the resistively coupled, dynamical system machine, which is, of, the, the, electricalizing machine, able to, to deal, with, with,

67
00:20:25.270 --> 00:20:29.770
Antonino Tumeo (PNNL): Spins that are not anymore just, 01.

68
00:20:31.430 --> 00:20:44.329
Antonino Tumeo (PNNL): computing will be, led by, Tayana Rosin in UC San Diego, and, the RDSM, work is led by, Tom Jiang from, from RISE.

69
00:20:44.510 --> 00:20:48.669
Antonino Tumeo (PNNL): And, yeah, like, the… the… the…

70
00:20:49.420 --> 00:21:07.600
Antonino Tumeo (PNNL): There is a part, obviously, on the work, because we… as you see in the tools, we have available a number of generators together with the co-design tools, so we want to be able to assign them, but we want also be able to create

71
00:21:07.980 --> 00:21:17.300
Antonino Tumeo (PNNL): A low-cost, agile, system in the package that can be used, and be available also to, to the market.

72
00:21:17.360 --> 00:21:28.469
Antonino Tumeo (PNNL): teams, so that we can, really, as Jim highlighted, enable this flow from the concept to the prototype.

73
00:21:28.620 --> 00:21:31.989
Antonino Tumeo (PNNL): And I think we can move to the second trust.

74
00:21:32.140 --> 00:21:44.790
Antonino Tumeo (PNNL): Again, suppose the lighting we have available, and you'll see a number of tools already, a number of capabilities of open source tools that come from our various partners.

75
00:21:45.490 --> 00:22:06.900
Antonino Tumeo (PNNL): the focus of Decode will be integrating these tools to create a much larger ecosystem of co-design tools that will enable really us to bring the concept of a system in a package, evaluated and generated, and then tap it out, right? Really, we want to

76
00:22:06.900 --> 00:22:23.340
Antonino Tumeo (PNNL): to provide the subset and also the compiler site to be able to, quickly go from, from the idea, assemble the system, and, and enable, the, the design of the determining a package.

77
00:22:23.340 --> 00:22:24.560
Antonino Tumeo (PNNL): They, they…

78
00:22:24.890 --> 00:22:25.690
James (Jim) Ang: Bust?

79
00:22:25.780 --> 00:22:45.540
Antonino Tumeo (PNNL): Trust, is, bringing everything together with the prototyping and chip-right integration. The, from, from, from, again, from the Trust leads, but, the team has, a lot of experience in prototyping and tape-outs.

80
00:22:45.570 --> 00:22:58.520
Antonino Tumeo (PNNL): So, we bring internal prototyping capabilities, including FPGA platforms that are already available, and to kind of integrate Custom Accelerator and do the test, we'll…

81
00:22:58.520 --> 00:23:18.059
Antonino Tumeo (PNNL): We'll enable that to, emulate our system in a package, but also, like, with the EDA tools, we'll enable integration with the capabilities, the standard capabilities. Now, we'd need to see what happens with the National Silicon Center, but one of the,

82
00:23:18.320 --> 00:23:28.959
Antonino Tumeo (PNNL): TOFUS will be trying to establish connection with NSTC and foundries so that we can enable this, little

83
00:23:30.810 --> 00:23:39.200
Antonino Tumeo (PNNL): And thinking about the last slide is the list of our background repositories.

84
00:23:39.670 --> 00:23:53.390
Antonino Tumeo (PNNL): I think, switching slowly, but… it's a long list, right? So I understand. So these are already available, right? We'll, so we start from strong foundation.

85
00:23:53.410 --> 00:24:11.860
Antonino Tumeo (PNNL): These are all the tools that, you can go already use and leverage, and hopefully you already knew that. But, with the, the, with the code, we'll, will bring this, to, Plus Plus Basher, right, that, will enable,

86
00:24:11.900 --> 00:24:20.889
Antonino Tumeo (PNNL): eco-design, really echo design tool chain available for, the fast prototyping cycle. And,

87
00:24:21.360 --> 00:24:23.620
Antonino Tumeo (PNNL): I think this was the last slide, Jim, right?

88
00:24:23.620 --> 00:24:27.810
James (Jim) Ang: Right. I will stop sharing, and then pass… we can pass the,

89
00:24:27.920 --> 00:24:30.259
James (Jim) Ang: We'll feed over to David Brooks.

90
00:24:34.620 --> 00:24:36.380
James (Jim) Ang: David, take it away.

91
00:24:45.120 --> 00:24:45.960
James (Jim) Ang: Great.

92
00:24:47.410 --> 00:24:49.670
David Brooks: Are you able to see my slide okay?

93
00:24:49.670 --> 00:24:50.430
James (Jim) Ang: Yes.

94
00:24:50.430 --> 00:24:51.010
David Brooks: Great.

95
00:24:52.480 --> 00:24:56.599
David Brooks: Let's see… here we go. Okay,

96
00:24:56.750 --> 00:25:05.730
David Brooks: So yes, I'm happy to present, Thrust One on behalf of Deanna Rosing and myself, and the Thrust One co-leads for the project.

97
00:25:07.480 --> 00:25:18.759
David Brooks: If you think about the overarching motivation for a lot of what we're looking at these days, it's this notion of scale of AI compute and the cost growing at an enormous rate.

98
00:25:18.760 --> 00:25:42.249
David Brooks: And, this is just a graph from Epic AI. If you're ever interested in looking at trends, they have a lot of really interesting data on the website. This is showing the training compute needed for these frontier LLMs that are available today, and you can see the growth rate has gone up by between 4 and 5x per year over the last decade or so.

99
00:25:44.150 --> 00:26:03.759
David Brooks: We have, of course, have seen increases in the flops and the memory bandwidth to go with that, but not nearly at the same rate. And this is really increasing the training costs at a pretty large, hefty, hefty, rate of 3.5x per year. So the training computes requirements are going up.

100
00:26:03.760 --> 00:26:28.689
David Brooks: the ability to provide the flops and the computation is not scaling at the same rate, so we're in this deficit that is part of the reason we're seeing such huge investments in AI, because people still want to proceed, but the costs are going up. And as Jim mentioned, we want to think about lower-cost ways of driving computing, and we believe chiplets are one of those solutions that, particularly for DOE

101
00:26:28.690 --> 00:26:34.119
David Brooks: applications could allow drastically improved performance without this enormous cost.

102
00:26:34.910 --> 00:26:43.159
David Brooks: In Thrust 1, we're thinking about this in the context of the co-design of energy-efficient chiplets for the system and packages.

103
00:26:43.160 --> 00:26:57.079
David Brooks: And the goal is to explore a mix of different types and styles of accelerators for AI applications. So…

104
00:26:57.080 --> 00:27:16.289
David Brooks: The goal really is to come up with this mix of heterogeneous chiplets that combine computing paradigms, so some of them conventional, and some of them new and addressing specific aspects of AI that have been less well explored. So, for example, Tiana's work on hyperdimensional computing.

105
00:27:16.520 --> 00:27:40.180
David Brooks: The work that we have in the next task on icing machines and dynamical systems are really new approaches to addressing machine learning. And likely what we will see in the future is mixtures of these approaches using mixtures of different technologies, some digital, some analog, in these system and packages. And that's the goal, the thrust is to also understand how to bring those different, diverse.

106
00:27:40.290 --> 00:27:44.649
David Brooks: types of chiplets together. So there's integration of different IP styles.

107
00:27:45.860 --> 00:27:57.830
David Brooks: I'll give you a flavor of how this would this work, and why analog and HD computing are… are interesting.

108
00:27:57.830 --> 00:28:20.050
David Brooks: So just, we start from the observation that the human brain doesn't process information in simple low-dimensional forms, like pixel grids or individual signals. Instead, it converts dense sensory inputs, things like the images that come through our eyes, into high-dimensional sparse representation across neurons. And this mapping step is basically how we

109
00:28:20.050 --> 00:28:33.830
David Brooks: how we think human perception and cognition works. So if you look at the picture on the left that's showing light hitting a retina, the signal passing through these visual areas, the brain representing that information as a distributed pattern, basically a high-dimensional vector.

110
00:28:34.040 --> 00:28:58.990
David Brooks: And there's studies from about a decade ago that showed that this sparse, high-dimensional encoding makes the brain's computation both efficient, robust, and flexible. HD computing takes that principle and applies it to machine learning. It encodes data in what are called hypervectors. These vectors have thousands of dimensions, and then uses simple algebraic operations to manipulate them. And that's really the key. These operations are fully parameterized.

111
00:28:58.990 --> 00:29:03.970
David Brooks: They're very simple, Makes them much more energy efficient and resilient to noise.

112
00:29:04.210 --> 00:29:06.239
David Brooks: Hd computing is…

113
00:29:06.620 --> 00:29:15.609
David Brooks: seen a lot of research interest over the past decade, and people have shown that it can be used for online learning, symbolic reasoning, explainability, and so forth. Big challenges for today's AI.

114
00:29:15.770 --> 00:29:27.560
David Brooks: On the right, you can see the domains that this has been applied to, image classification, LiDAR segmentation, genomics, mass spectrometry, even recommendation systems, and so the point is that with this mathematical framework.

115
00:29:29.610 --> 00:29:38.570
David Brooks: Hyperdimensional representation, we can handle data that spans very different kinds of representations, from images to molecules to graphs.

116
00:29:41.310 --> 00:29:56.880
David Brooks: So Tiana's, research group is one of the leaders in the flair. She's been working on this for, for quite a while. And, if you look at some of the progression, the original work in this area, which was looking at

117
00:29:57.140 --> 00:30:10.139
David Brooks: conventional, image recognition applications using purely HD approaches, what you found is that the results work really well for small data sets like MNIST, but didn't do so well when you looked at the larger ones, like CFAR and Imminionet.

118
00:30:10.460 --> 00:30:27.580
David Brooks: And, more recently, the algorithm investment in this area has shown that HD combined with CNN layers can get the best of both worlds, meaning the accuracy of conventional

119
00:30:27.580 --> 00:30:51.389
David Brooks: image classification using CNNs, but with the efficiency of hyper-dimensional computing. And this hybrid architecture style works… you can see the picture on the right, works by pruning many of the CNN layers and inserting an HD encoding step that converts the extracted features into high-dimensional binary hypervectors. And then these hypervectors, are classified using the algebraic operation, so instead of using these

120
00:30:51.390 --> 00:30:57.600
David Brooks: These compute-heavy layers, it's using much simpler algebraic operations for those final steps.

121
00:30:57.830 --> 00:31:01.989
David Brooks: And you can see the results in the table showing that,

122
00:31:02.370 --> 00:31:11.270
David Brooks: we can achieve almost the same performance, using, the HDNN approach, but with,

123
00:31:11.370 --> 00:31:35.999
David Brooks: with basically the same accuracy as traditional CNNs. Tiana's group has also explored implementing these techniques in test ships, so she has 3 test ships that she's worked on in collaboration with, TSMC. Some of these are digital, and some of these are analog, and I think that's one of the interesting approaches to this, is that there are opportunities to leverage things like resistive RAM

124
00:31:36.000 --> 00:31:41.360
David Brooks: And, do this analog-style computing using, hyperdimensional computing and

125
00:31:41.360 --> 00:31:58.739
David Brooks: Again, this is why this is a promising approach, because with our chiplet-style architecture, we can imagine including an HDC chiplet next to a CNN chiplet, next to some other kind of chiplet, and combine the best of both worlds in terms of the

126
00:31:58.740 --> 00:32:02.570
David Brooks: The neural network style, or machine learning style.

127
00:32:03.580 --> 00:32:21.650
David Brooks: So now I'm going to spend a little bit of time talking about how we imagine integrating that existing IP, and most of this work is building on the efforts from Andrea Olson's company called ZeroASIC, which has been one of the leaders in trying to establish

128
00:32:21.650 --> 00:32:25.509
David Brooks: standards for triple ecosystems, particularly for these

129
00:32:25.510 --> 00:32:36.809
David Brooks: style of prototypes, like the one that… that I just looked… that I just showed, where there's, a little bit different from the NVIDIA style, where you have these massive chiplets that are… that are… that are, taking up

130
00:32:37.070 --> 00:32:41.829
David Brooks: theoretical. These are looking at smaller triplets and how to put them together.

131
00:32:42.660 --> 00:33:03.580
David Brooks: So the goal that Andreas' team has is to build a full-stack triplet standard, with the protocol layer, basically the way that the memory is addressed using what's called Universal Memory Interface, basically make everything addressable, that's the goal, is to be able to fully address all the different parts of

132
00:33:03.580 --> 00:33:19.379
David Brooks: the triplets using this universal memory interface, and, inspired by, sort of, the extraction standards that have been built up over time in other areas of computer science, they have to find these different layers of the stack, from the protocol down to the electrical… electrical layer.

133
00:33:20.020 --> 00:33:30.119
David Brooks: They've also been looking at trying to come up with standardized footprints that allow you to imagine plugging into something like a breadboard, except this would be, plugging in triplets with

134
00:33:30.120 --> 00:33:49.459
David Brooks: very fine pitch for high connectivity. Rotational symmetry, so it doesn't matter how you put your triplet there. Be able to use this for both analog with various power domains, passing through to other systems and whatnot. And then also defining some of the electrical and standards of how you communicate between the triplets.

135
00:33:56.930 --> 00:34:17.259
David Brooks: Let me see, I messed up my screen. Here's the next slide. Okay, so standardized discrete chiplets, so the goal, again, building on what Sarah Isaac has been pushing, is to come up with this LEGO brick standard, except for chiplets, where we define that standard, and then just like LEGO bricks have remained the same for…

136
00:34:17.260 --> 00:34:27.600
David Brooks: so many years, can we do that for discretized triplets, so that I could come up with a grid, come up with standard-sized triplets, and then as long as you build these triplets with this size, you'd be able to plug it in.

137
00:34:27.600 --> 00:34:37.840
David Brooks: to a package. This is how I imagine low-cost prototyping moving in the future, and also perhaps low-volume

138
00:34:37.840 --> 00:34:52.330
David Brooks: But high important computing strategies where, you know, you could imagine prototyping with thousands or tens of thousands of chiplets instead of needing to have millions that you would have to build a custom ASIC.

139
00:34:53.590 --> 00:34:57.130
David Brooks: And being able to do this across different technology nodes with

140
00:34:57.770 --> 00:35:12.859
David Brooks: Different chiplet spacings and pitch month ratios, and ideally building this and thinking about all the constraints early so that we have a long-lasting legacy of the ability to carry forward.

141
00:35:14.130 --> 00:35:16.089
David Brooks: Zeroasic has built

142
00:35:16.200 --> 00:35:29.260
David Brooks: several composable triplet prototypes already. This is just a mix of designs from a CPU, an FPGA, to a memory, and showing… they show that using these standard approaches

143
00:35:29.260 --> 00:35:37.090
David Brooks: It can drastically reduce the amount of design time, amount of designers needed to be able to design these.

144
00:35:37.090 --> 00:35:44.500
David Brooks: Very impressive results in, in, in, in, in some of these diverse, very diverse sets of applications.

145
00:35:44.590 --> 00:35:50.920
David Brooks: So with that, I will… Wrap things up, and happy to take any questions.

146
00:35:54.130 --> 00:36:04.700
James (Jim) Ang: Thank you, David. I think, we might hold questions for the end. So, and, and actually now we'll hear next from, I'm not sure if it's Andreas or John Lydell,

147
00:36:04.990 --> 00:36:06.990
James (Jim) Ang: who are co-leading Thrust Tube.

148
00:36:07.820 --> 00:36:10.799
aolofsson: I think we're… we're gonna share… share the burden here.

149
00:36:10.800 --> 00:36:11.960
James (Jim) Ang: Okay, okay, very good.

150
00:36:11.960 --> 00:36:14.140
aolofsson: I'll bring up the slides, and then,

151
00:36:14.280 --> 00:36:20.190
aolofsson: John's gonna speak on the first three tasks, and I'll speak about the last… 3.

152
00:36:20.380 --> 00:36:23.509
aolofsson: So let me just, let me share my screen here.

153
00:36:28.610 --> 00:36:38.630
James (Jim) Ang: And David, thank you for giving a very good overview this closing session on Xero ASICS capabilities. Fantastic.

154
00:36:40.070 --> 00:36:42.239
aolofsson: Alright, here we go.

155
00:36:46.110 --> 00:36:48.090
aolofsson: Okay, you see that okay?

156
00:36:48.440 --> 00:36:48.850
James (Jim) Ang: Yes.

157
00:36:48.850 --> 00:36:49.640
John Leidel: Yes.

158
00:36:49.920 --> 00:36:51.830
aolofsson: Alright, so,

159
00:36:52.280 --> 00:37:03.379
aolofsson: So, you know, David spoke about actualizing, you know, real circuits. Us 2 is more about the tooling to create the circuits, and, you know, creating open source co-design tools.

160
00:37:03.600 --> 00:37:08.230
aolofsson: to, lift all boats, or, you know, lift all ships.

161
00:37:08.390 --> 00:37:12.410
aolofsson: So there's, there's 6 different tasks in this area,

162
00:37:12.560 --> 00:37:18.950
aolofsson: And, you know, ranging from, efficient algorithms for analog computing, compilerless frameworks.

163
00:37:19.080 --> 00:37:31.449
aolofsson: architectural simulations, and then, you know, heterogeneous SIP, thermal analysis, co-design, and then, IP generators. So it's sort of a stack from, like, a full-stack software, you know, algorithms to…

164
00:37:31.550 --> 00:37:34.610
aolofsson: to bits, is the way I would look at it.

165
00:37:34.740 --> 00:37:47.220
aolofsson: And, represented by participants from PNNL, from, from Harvard, from TCL, Wisconsin, Washington State, Minnesota, and Zero ASIC.

166
00:37:47.610 --> 00:37:54.430
aolofsson: So with that, I'm gonna hand over to… to John to speak about, 2.1 to 2.3.

167
00:37:57.270 --> 00:38:05.209
John Leidel: All right, perfect. So the first task in this is led by P&L, and this is where everything sort of starts.

168
00:38:05.280 --> 00:38:14.259
John Leidel: One of the issues that we've had, previously is going from traditional, sort of instruction set-based, digital designs into.

169
00:38:14.290 --> 00:38:27.570
John Leidel: more chiplet ecosystems, system and package, and analog designs, is there's really a lack of focus on the algorithm infrastructure necessary to port applications and actually get real kernels and real

170
00:38:27.570 --> 00:38:38.260
John Leidel: Real workloads, executing efficiently to realize all of the energy efficiency and programmatic efficiency of the various different devices.

171
00:38:38.260 --> 00:38:54.970
John Leidel: So this is very analogous to, sort of, the quantum world, where, they had to sort of crawl out of the primordial ooze, and start developing new programming models, new compiler techniques, and new algorithms sufficient to actually utilize real quantum systems.

172
00:38:55.280 --> 00:39:15.220
John Leidel: So as we move into the space of brain-inspired HDCs, there's a lot of interesting proposals with respect to porting things like neural networks, sparse linear algebraic solvers, and various different high-dimensional random vector algorithms, onto these analog computing platforms.

173
00:39:15.460 --> 00:39:34.729
John Leidel: And this thrust area is specifically focused on developing those algorithmic constructs in a way that we can move from a very high-level conceptual design down to more of a co-design ecosystem where we start cross-collaborating with the compiler.

174
00:39:34.730 --> 00:39:36.100
John Leidel: subtask.

175
00:39:36.100 --> 00:39:38.100
Keren Bergman: Whoa, what are you doing?

176
00:39:38.650 --> 00:39:58.339
John Leidel: As well as, collaborating with some of the other, hardware implementation thrusts that we can begin, mapping, sort of, traditional instruction set concepts down to more analog-style computing devices. So… so this is going to be really key and very,

177
00:39:58.700 --> 00:40:03.840
John Leidel: influential across all the different domains of the Decode project.

178
00:40:05.080 --> 00:40:07.740
John Leidel: Okay, on to the next slide, Andreas.

179
00:40:08.870 --> 00:40:26.560
John Leidel: So, so this is where, things start to move more into traditional computer science. So P&L is, again, leading the compiler and accelerator generate… Compiler framework and accelerator generators. I, I, I hate saying those two words together.

180
00:40:26.600 --> 00:40:34.050
John Leidel: So, with the advent of accelerated computer architectures, we've done a fairly good job of

181
00:40:34.110 --> 00:40:53.020
John Leidel: putting together very coarse-grained programming models like CUDA and SYCL and OpenCL, and even some of the OpenMP and OpenACC target offload specifications to take traditional, sort of, iterative approaches to algorithmic design and begin doing,

182
00:40:53.230 --> 00:41:07.540
John Leidel: parallel computing with host and target offload. That's been fairly simplistic. You know, accelerators traditionally live on the other end of a relatively inefficient bus or inefficient interconnect.

183
00:41:07.540 --> 00:41:16.479
John Leidel: And now with the system and packages, as we start to standardize the interfaces across these various different, heterogeneous devices.

184
00:41:16.480 --> 00:41:24.269
John Leidel: That may or may not live inside even reasonably similar clock domains or, or architectural models.

185
00:41:24.610 --> 00:41:43.430
John Leidel: We need to start raising the level of abstraction in the compiler frameworks such that we have any chance of all at actually taking traditional parallel applications and begin executing them with some degree of efficiency. So this particular thrust area is going to really focus on using LLVM and MLIR

186
00:41:43.800 --> 00:41:52.320
John Leidel: To start building dialects around performing explicitly heterogeneous computing across computing domains on a system and package.

187
00:41:52.360 --> 00:42:14.270
John Leidel: So this is going to, quite literally, reutilize a lot of the existing dialects that are being used for, parallel computing in the AI and ML space, as well as, we can hoist some of the interesting efforts being done in the quantum space for, MLIR, as well as looking at things,

188
00:42:14.760 --> 00:42:19.680
John Leidel: like some of the hardware generators that are built around MLIR,

189
00:42:19.940 --> 00:42:24.490
John Leidel: To literally start doing, co-design,

190
00:42:24.690 --> 00:42:42.199
John Leidel: at the system and package level, sort of taking the Pandas and Bamboo work previously done at PNNL, and start bringing that ecosystem into the sort of top-level, application-level compilation and optimization frameworks.

191
00:42:42.200 --> 00:42:52.070
John Leidel: So… so this is going to be a very interesting effort that could, not only bear fruit in the system and package world, but also bear fruit in…

192
00:42:52.320 --> 00:43:01.499
John Leidel: The future, you know, sort of, digital-centric, large-scale, monolithic chiplets, ecosystem as well.

193
00:43:03.730 --> 00:43:06.539
John Leidel: Okay, on to the next slide, Andreas.

194
00:43:09.080 --> 00:43:23.310
John Leidel: So the, Thrust 2.3, which, is a co-effort between, TCL and David Brooks' group at Harvard, is the architectural simulation of heterogeneous SIPs.

195
00:43:23.830 --> 00:43:29.290
John Leidel: So, architectural simulators is something that is both near and dear to my heart.

196
00:43:29.310 --> 00:43:43.199
John Leidel: And something that everyone loves to hate. Historically, architectural simulators are both fantastic and allow you to accelerate co-design workflows, and are also demonized in the sense that

197
00:43:43.200 --> 00:43:52.480
John Leidel: If the architectural simulator is not well-formed and well-written, then it doesn't accurately represent the particular device you're attempting to simulate.

198
00:43:53.280 --> 00:44:05.550
John Leidel: So, what we seek to do with this is actually combine two efforts, that have sort of started, holistically at Harvard and TCL, the first of which is what's called Harvard Cascade.

199
00:44:05.590 --> 00:44:21.390
John Leidel: Which is a high-level design exploration infrastructure that allows users to explore application domains, whether they're dense or sparse, or whether you're targeting a specific numerical representation.

200
00:44:21.610 --> 00:44:29.910
John Leidel: And combine that with some kernel partitioning logic that allows you to then, using topological concerns.

201
00:44:29.910 --> 00:44:41.749
John Leidel: And energy and device scaling concerns, and start looking at the design space exploration of, okay, here's an application, here's what the kernels are specifically going to do.

202
00:44:41.970 --> 00:44:58.210
John Leidel: you know, what is the prescriptive level of chiplet or chiplet devices that best fit this protect… this particular application? You know, I have this much energy, this much device area, you know, what's the correct mixture of where these things go?

203
00:44:58.660 --> 00:45:14.569
John Leidel: The flip side of that is the TCL SST infrastructure, which is entirely based upon Sandia's structural simulation toolkit, is a large-scale parallel discrete event simulator, really designed to

204
00:45:14.610 --> 00:45:28.390
John Leidel: take applications or application traces, and then do very fine-grained analysis of how those interact with specific architectural components. So we have a very large catalog of existing components we've simulated.

205
00:45:28.690 --> 00:45:41.839
John Leidel: things like, RISC-V cores, network on chips, system and package devices that we're currently working on, memory devices, into that list, we're going to start adding analog computing devices.

206
00:45:42.020 --> 00:45:53.969
John Leidel: And the interesting thing about the extensions that we've made to SST is we actually have a series of SST components that allow you to directly replace an existing functional

207
00:45:53.970 --> 00:46:06.739
John Leidel: simulation component with an RTL component as well. So basically, we wrap up your, your, Verilog, or system Verilog RTL in some C++ library magic.

208
00:46:07.030 --> 00:46:12.879
John Leidel: And then we allow you to co-simulate this sort of multimodal simulation framework.

209
00:46:12.880 --> 00:46:35.389
John Leidel: So, what we seek to do with this particular thrust area is combine these high-level design space exploration concepts from Cascade with these very low-level mechanisms in TCL's SST infrastructure, and then do backward validation using a massive amount of telemetry that comes out of SST with, okay, here's your designs,

210
00:46:35.390 --> 00:46:49.610
John Leidel: design space exploration, and here's what the high-level tools are sort of prescribing in terms of SIP architecture. What does this actually do at a more functional level with, very high,

211
00:46:50.110 --> 00:47:03.579
John Leidel: high-fidelity functional simulation, or even RTL simulation. So, you'll be able to sort of rinse and repeat a large number of designs back and forth through simulation, and get a very, very,

212
00:47:03.700 --> 00:47:09.960
John Leidel: High degree of confidence in terms of, you know, what you seek to actually construct in real silicon.

213
00:47:11.400 --> 00:47:16.860
John Leidel: And I think I can pass it off to Andreas for Thrust 2.4.

214
00:47:18.770 --> 00:47:19.440
James (Jim) Ang: Thank you.

215
00:47:19.880 --> 00:47:31.600
aolofsson: Okay, all right, so, so thrust to 4 to 2, 6, I think can be best described as, as filling out the, the picture and, and tying the…

216
00:47:31.910 --> 00:47:47.330
aolofsson: the simulation to the bits and the atoms. So, as was discussed, right, if you're… you can have the best architectural simulator in the world, if it's uncoupled from physics, it's still open loop, right? You need a cost function that's grounded.

217
00:47:47.330 --> 00:47:53.110
aolofsson: And that is incredibly important. It's something that's been missing from architectural simulations in the past.

218
00:47:53.350 --> 00:48:02.400
aolofsson: Sometimes, right? Not always, but sometimes. And so… and of course, one of the problems is that you'd like to run everything at the atomic level, but you can't afford to.

219
00:48:02.530 --> 00:48:14.019
aolofsson: And so you have to make choices. And, you know, do you have these knobs, you know, accuracy versus, performance, that are, you know, those are conflicting.

220
00:48:14.030 --> 00:48:22.820
aolofsson: And, and so, so in 2.4, one of the ideas here is that, for example, for thermal modeling, is that you're going to try to trade off

221
00:48:23.070 --> 00:48:29.890
aolofsson: and be able to leverage the full spectrum of accurate models, like FEM models, to analytical models that run very fast.

222
00:48:30.000 --> 00:48:44.969
aolofsson: And then, you know, being able to do trade-offs that span from days to milliseconds, right? From, you know, many, many degrees of, like, you know, 5 to 10 degrees, let's say, to, you know, sub-1 degree accuracy.

223
00:48:45.030 --> 00:49:00.339
aolofsson: And, you know, I can tell you that, you know, this is something that we've, we've run into in the past, where, especially early in the design process, we don't have all the information, to really run the, you know, the ANSYS console-type, physics,

224
00:49:00.390 --> 00:49:14.979
aolofsson: FEM solvers, or open source packages, but we'd like to have, you know, ballpark of what the system's like, and that's been a real gap, and so it's, it's great that the team is working on this area.

225
00:49:15.990 --> 00:49:31.929
aolofsson: Second, which is, you know, kind of rhymes with Thrust 2.2 in a way, is this idea of chiplet packaging and co-design optimization. And so, so as you're defining your chiplets, you're defining the structure and your accelerators.

226
00:49:31.930 --> 00:49:36.180
aolofsson: With… with heterogeneous, integration and system and package.

227
00:49:36.210 --> 00:49:48.220
aolofsson: you actually have a new design space that… that has not been fully fleshed out before. You know, there's been a lot of talk about co-design for SOCs, where you might have IP blocks, and

228
00:49:48.260 --> 00:49:53.250
aolofsson: The interfaces are completely fluid, and you do partitioning of logic into blocks.

229
00:49:53.310 --> 00:49:56.729
aolofsson: But once you start having discrete devices.

230
00:49:56.810 --> 00:50:00.609
aolofsson: That may be hard-coded, and you might be mixing discrete devices.

231
00:50:01.040 --> 00:50:09.879
aolofsson: With, with new devices, new designs, new core processors. You do have a pretty interesting search space,

232
00:50:10.180 --> 00:50:20.759
aolofsson: And especially when you start bringing economics into the picture. So, you know, you're not talking about not just size, weight, and power, and performance, but possibly also, you know, how many,

233
00:50:20.850 --> 00:50:38.169
aolofsson: How many dies are you going to get out of the wafer? Which technology node are you going to be working on? Are you going to be working at 0.18 micron or at 3 nanometer? Where do you put your I.O? Where do you put your, your compute? Where do you put your memory? Do you want to do 2D, 3D, organic substrate?

234
00:50:38.170 --> 00:50:46.420
aolofsson: It's such a, a squishy, search space, and it's something that most of us, as architects.

235
00:50:46.510 --> 00:50:50.519
aolofsson: We used to use… we resort to spreadsheets, still today.

236
00:50:50.620 --> 00:50:55.059
aolofsson: And, so that's a… that's a huge gap in the architect's toolbox that,

237
00:50:55.200 --> 00:50:58.760
aolofsson: I think, this team can address.

238
00:50:58.980 --> 00:51:12.430
aolofsson: And then, and then the last one is, is really, down to the metal, and this is a bit of a repeat, because David already showed this, so ignore the picture, I'll just talk about what we're actually doing in this task.

239
00:51:12.440 --> 00:51:24.180
aolofsson: So… so we're, we're, we're creating a full-stack, open-source, chiplet interface generator, that's portable. It's gonna be digital only.

240
00:51:24.180 --> 00:51:33.119
aolofsson: So, you know, imported any node, it's gonna be open source, and so if you look at and support the full-stack standard that Dave mentioned, that we're working on.

241
00:51:33.120 --> 00:51:42.280
aolofsson: So, so if you look at the chiplet interface base today, it is a, it is a place of the, of the folks with the money, the rich.

242
00:51:42.350 --> 00:51:56.880
aolofsson: You know, the big data center players, there's very few open implementations. There was one sort of opened, AIB, that was funded by the CHIPS program at DARPA, but, but since then, everything has pretty much been closed source.

243
00:51:57.100 --> 00:52:06.339
aolofsson: And most of it's been targeted at data center rates, so very, very high frequency with analog PHYs, which are very expensive to port, or complicated to port.

244
00:52:06.500 --> 00:52:21.420
aolofsson: And, so it's a huge barrier to access. So if you want to create a, low-barrier, democratized, chiplet-based design system, we need the artifacts, we need the IP for these, for these interfaces. So, so this task is all about creating a…

245
00:52:21.790 --> 00:52:27.410
aolofsson: A generator, an RTL generator that supports the standard that was discussed in Thrust 1.

246
00:52:27.430 --> 00:52:46.810
aolofsson: And so, you know, if you look at the, especially the right part, the electrical side, the protocol and footprints, those are pretty easy to define, it's just a spec, basically, and certainly protocols are very RTL-friendly, but when it comes to the electrical interface, you have to make some very careful choices, and, also.

247
00:52:46.850 --> 00:52:51.990
aolofsson: Have an ability to decouple things like clocking, synchronization.

248
00:52:52.080 --> 00:53:08.030
aolofsson: ESD and so forth from the technology node in a very light matter, or zero-cost matter. So that's going to be the bulk of the effort here. But the outcome is, you know, delivering an open-source, full-stack, piece of IP to the community.

249
00:53:08.760 --> 00:53:13.029
aolofsson: So I think that's… that's our thrust, too, so I'll hand it off to…

250
00:53:14.060 --> 00:53:16.949
aolofsson: Luca or… or Andrew for those three?

251
00:53:16.950 --> 00:53:19.410
James (Jim) Ang: It's LUCA for Thrust 3.

252
00:53:22.400 --> 00:53:24.679
luca: Yes, good afternoon, everyone, can you hear me?

253
00:53:25.740 --> 00:53:26.430
James (Jim) Ang: Yes.

254
00:53:26.770 --> 00:53:32.509
luca: Okay, and let's see if I can share properly…

255
00:53:38.260 --> 00:53:39.460
luca: Let it work?

256
00:53:40.800 --> 00:53:41.430
James (Jim) Ang: Yep.

257
00:53:41.430 --> 00:53:44.999
luca: Unless maybe take the video out, so to avoid.

258
00:53:45.500 --> 00:53:49.690
luca: So, I think we are running a few minutes late, so I'll try to,

259
00:53:49.890 --> 00:53:57.180
luca: Let's see if we can cover some time for Q&A. Task 3 is, the, about…

260
00:53:57.480 --> 00:54:08.369
luca: Prototyping and integration of this system in a package solution that we are pursuing, it consists of two tasks. The first one, called internal prototyping.

261
00:54:08.370 --> 00:54:24.159
luca: is led by myself and Guyon at Harvard, and tackles the challenge on how we enable in-house agile design, rapid prototyping, and early software development of this advanced chiplet-based system in a package.

262
00:54:24.180 --> 00:54:26.879
luca: With an emphasis on DOE applications.

263
00:54:26.950 --> 00:54:29.410
luca: And the approach that we will,

264
00:54:29.600 --> 00:54:45.180
luca: Portillo is to develop scalable, open-source platforms, which really are meant to make it possible to have a distributed, collaborative engineering of these components, which are highly heterogeneous, as you've seen from

265
00:54:45.420 --> 00:54:47.440
luca: The previous two presentations.

266
00:54:47.770 --> 00:55:03.129
luca: And, we are going to do that by leveraging quite a bit of experience that, we had accumulated both at Columbia and Harvard over the years, experience in design and developing a heterogeneous system, a chip.

267
00:55:03.130 --> 00:55:13.399
luca: So, a monolithic system, not cheap, so one of the challenges in terms of the research that we are going to do in this task is to extend this experience to a chiplet-based system in a package.

268
00:55:13.440 --> 00:55:32.720
luca: And similarly, we have over 10 years of experience in developing an open-source hardware platform for a system on chip that has been silicon proven, as I'll show you in a minute. And again, we want to extend that to a chiplet-based system. And Jim mentioned about… Antonino mentioned about

269
00:55:32.980 --> 00:55:52.589
luca: whether… which is still unclear which chips we will be able to develop as demonstration in this… as part of this project, but we certainly aim to have quite a bit of prototyping, early prototyping using FPGA's technology, something for, which we have, again, extensive experience.

270
00:55:52.630 --> 00:56:02.169
luca: So experience starts with, on the side of Harvard, with a variety of chips, that, Harvard has designed.

271
00:56:02.340 --> 00:56:21.150
luca: and fabricated and successfully tested over the years, and you'll see them here, and I'm going to focus on one of them, maybe a couple of them, but particularly one of them, which are these EPOX chip that you see at the bottom, EPOX0 and EPOX 1, as they were not designed just by Harvard, but they were designed by the Harvard and Columbia together, as well as IBM.

272
00:56:21.150 --> 00:56:35.909
luca: In particular, I'm gonna tell you, as a demonstration of our approach, this EPOX-1 system on ship, and… which, has been really, the ultimate, deliverable of a, five…

273
00:56:35.910 --> 00:57:00.030
luca: years plus project, supported by DARPA, the Domain Specific System on Ship project, led by IBM Research, and had, Columbia and, Harvard and UIUC Sky partners. And in fact, for the design of this chip, the PhD students at Columbia and Harvard really took the lead. This is a fairly complex chip to be designed in academia, 12 nanometer technology, 64mm square.

274
00:57:00.140 --> 00:57:04.900
luca: It has a fairly sophisticated frequency and voltage.

275
00:57:05.100 --> 00:57:16.439
luca: management, quite a bit of, SRAM memory, and is highly heterogeneous. And we already heard from Jim and others the importance of heterogeneous computing.

276
00:57:16.440 --> 00:57:27.249
luca: It contains, it has a tile-based architecture and contains, 23 different accelerators, which are very important for energy efficient computing, of 14 different types.

277
00:57:27.250 --> 00:57:41.339
luca: And some of them are composed in a cluster, which is the one with the red border there, which has a fairly, quite novel, actually, distributed hardware power management scheme developed at IBM, and for the first time tested on silicon.

278
00:57:41.340 --> 00:57:50.990
luca: And this chip, all the tiles are connected by a network on chip, in fact, a multiple network on chip, 6 planes of 2D mesh, delivering 74 terabits per second.

279
00:57:51.080 --> 00:58:07.359
luca: And this old chip was designed by a fairly small team of mostly PhD students, led by PhD students, a couple of industry researchers, in a period of about 3 months, using our approach, in particular the ESP open source platform for agile SOC design.

280
00:58:07.940 --> 00:58:21.320
luca: This was the ultimate chip for this project. Before that, we had the so-called pipe cleaner, which was this epoch Zero, which was a smaller version. You can see here the impact of our tile-based approach. Again, these are monolithic chips, not chipless, just to be clear.

281
00:58:21.330 --> 00:58:36.150
luca: And as you can go from epoch 0 to epoch 1, you can see, even at a quick glance, that epoch 1 is much more complex in term… along many different metrics, number of accelerators, numbers of tiles, power domains, and so forth.

282
00:58:36.150 --> 00:58:50.329
luca: And in something that is a demonstration of the scalability of our approach to design these integrated circuits, is that essentially the same team of about 8 people, in fact, a junior member replaced a more senior member.

283
00:58:50.330 --> 00:58:56.539
luca: Design a much more complex chip in, essentially the same amount of time, about 3 months.

284
00:58:56.580 --> 00:58:58.219
luca: How that was possible.

285
00:58:58.540 --> 00:59:10.639
luca: with various things, but in particular, when it comes to the design of the components and the RTL, by doing a lot of reuse of open-source intellectual property blocks.

286
00:59:10.740 --> 00:59:28.800
luca: In fact, the whole chip is based on hardware, which is open source, and not all the hardware was designed by the members of the team. In fact, in particular, the four RISC-V processors, which are capable of booting Linux SMP, were designed from, come from an open source project.

287
00:59:29.100 --> 00:59:34.700
luca: from ETH Zurich, the… for NVIDIA Deep Learning Accelerator come from NVIDIA Research.

288
00:59:34.700 --> 00:59:54.990
luca: But the other components, for the most part, were designed by the team, although fairly independently. So Harvard contributed with four accelerators, IBM contributed with one accelerator and the power management scheme, and Columbia really contributed to the accelerator, and the overall glue, if you will, the memory hark in the network on chip, and the approach to bring things together.

289
00:59:55.650 --> 01:00:10.300
luca: So, this was built on top of this open-source platform, ESP, which is this 12-year project that I've been leading at Columbia, and this is a snapshot of the website that I invite you to visit if you are interested.

290
01:00:10.300 --> 01:00:16.589
luca: And there are various things that can be said about ESP, but the things that I think is most important in this context

291
01:00:16.590 --> 01:00:29.939
luca: is how ESP was designed to begin with, with the idea of doing… of promoting collaborative engineering and reuse of design. In fact, even the components that are developed using ESP design flows.

292
01:00:29.940 --> 01:00:35.440
luca: And there are a variety of design flaws, are meant to be designed in a way that are highly reusable.

293
01:00:35.440 --> 01:00:56.910
luca: And when I say a variety of design flow, it's part of the philosophy of ESP that we don't use a particular language, or we don't dictate one language on one CAD tool, and we try to integrate a variety of design flows, starting from different languages, whether we start at RTL or a C++ system C with high-level synthesis to develop accelerator, for instance, for…

294
01:00:57.130 --> 01:01:13.919
luca: AI applications, so we even have domain-specific design flow leveraging other open-source projects, like HLS4ML. And then all these things can be brought together to the library, and through a fairly just push-button approach, we can arrive to FPGA prototyping, and

295
01:01:13.920 --> 01:01:23.430
luca: we could do a chip like that, maybe it wasn't exactly push-button, but in terms of design productivity, I think it was fairly impressive, what we were able to achieve.

296
01:01:23.440 --> 01:01:48.240
luca: So now, as we look at how we want to do things in the code, one of key components is to take this lesson and all this experience and, if you will, chipletize this approach, and we're gonna do it in collaboration with Harvard, that have already started working on a system quite scalable and modular for developed chiplets components, so we're gonna bring the two things together.

297
01:01:48.270 --> 01:01:54.329
luca: In a platform that really is meant to, not only allow,

298
01:01:54.990 --> 01:01:57.010
luca: Do demonstration of design.

299
01:01:57.010 --> 01:02:19.109
luca: with all the various components developing the other two trusts of the code, but also bring together the, some of the open source tools that you have heard more at the architectural and software level. And yes, this particular task that we have is meant to allow integration of this and going towards the physical design.

300
01:02:19.250 --> 01:02:30.780
luca: For the physical design, there will be the second task, 3.2, led by Andrew Kang and Kevin Cowell from UC San Diego and University of Minnesota.

301
01:02:30.840 --> 01:02:45.660
luca: Respectively, and I'm giving the presentation on their behalf. So the overarching challenge here is, again, leverage. Leverage the development of an energy-efficient, domain-specialized design, particularly for DOE scientific application.

302
01:02:45.660 --> 01:03:03.530
luca: And while doing this, in the context of the code, at the same time, making sure that we can also leverage a lot of interaction that many of us, and particularly the leaders of this task, have with other programs, ecosystem, and standards.

303
01:03:03.530 --> 01:03:15.920
luca: And here, again, we have quite a bit of experience that we can leverage in chiplet design and tape-out, in the development of EDA tool and infrastructure, particularly from RTL design all the way to tape-out.

304
01:03:15.940 --> 01:03:25.060
luca: And, in collaboration with, and in leading, effectively, many tasks, particularly in, coming with standards.

305
01:03:25.300 --> 01:03:39.369
luca: in EDA, that, particularly Andrew Kang has. So Minnesota brings, capability in terms of, chiplet prototyping, and these are examples of ongoing projects. San Diego brings, brings,

306
01:03:39.370 --> 01:04:02.870
luca: and really remarkable leadership in developing the open road infrastructure, which is meant to produce a completely open source CAD flow for design advanced integrated circuits. To be clear, the integrated circuits that I showed before, Epoch 1, was designed in part with open source tool, but when it comes to RTL to GDS2, meaning to Synergy and physical design.

307
01:04:02.870 --> 01:04:17.959
luca: it leveraged commercial tools. So one of the challenges that we plan to tackle in the code is to bring together what you have seen with ESPEED and what OpenRoad does to really enable a fully open source design of,

308
01:04:17.960 --> 01:04:23.560
luca: the chiplets, meaning the monolithic chips, as well as a collaboration also with the other.

309
01:04:23.700 --> 01:04:28.700
luca: trust the chiplet, base system in a package.

310
01:04:28.830 --> 01:04:37.000
luca: And in doing that, there is ongoing work in collaboration with SELA in terms of system by finding four chiplets.

311
01:04:37.140 --> 01:04:59.979
luca: And quite a bit of leverage of other projects led particularly by the two copies that lead to this trust from these two NSF Chip Design Hub, and this NSF POS Phase 2 project, which, is meant to support further development of the open road open source ecosystem. So it's really… this is about bringing everybody together, and, in fact, I would say, is a…

312
01:04:59.980 --> 01:05:10.160
luca: very proper lead for the Q&A, because what we are planning to achieve today is to show what we plan to do, and see if we can foster collaboration with all of you.

313
01:05:10.360 --> 01:05:11.250
luca: Thank you.

314
01:05:13.000 --> 01:05:14.649
James (Jim) Ang: All right, thank you, Luca.

315
01:05:14.760 --> 01:05:20.970
James (Jim) Ang: Alright, well, I know that was a whirlwind, but maybe we have a little bit of time here for…

316
01:05:21.320 --> 01:05:24.249
James (Jim) Ang: Questions from any of the audience?

317
01:05:28.700 --> 01:05:43.040
James (Jim) Ang: Maybe I'll just add Luca's comments about a lot of these external collaborations. I recognize that there are linkages across Meerkat with several of the other projects that also have engagement with NSF,

318
01:05:43.040 --> 01:05:53.659
James (Jim) Ang: with, maybe DOD, some of those microelectronics hubs. So there, there's many areas where, we can, I think, join forces and…

319
01:05:53.870 --> 01:06:05.149
James (Jim) Ang: And, speak with a common voice, and within the Meerkat Research Center for, for the need and use of these, harbor prototyping, capabilities.

320
01:06:08.920 --> 01:06:25.840
Paul McIntyre: Yeah, thanks so much. I know Valerie's got her hand up, so I'll try to keep this one quick. First of all, thanks for a great overview. It's very exciting, and great to see how the three thrusts are so well connected with each other in such a logical way.

321
01:06:25.890 --> 01:06:30.249
Paul McIntyre: I, I guess one question I had just sort of…

322
01:06:30.380 --> 01:06:33.460
Paul McIntyre: Cindy are taking a step back,

323
01:06:33.660 --> 01:06:39.530
Paul McIntyre: Democratization of co-design in this, context obviously is very important.

324
01:06:39.580 --> 01:06:59.330
Paul McIntyre: And a lot of the, innovations that are happening, sort of, in the hyperscale regime are, not democratized at all. But, of course, a lot of the really major energy challenges, societally, are at the data center.

325
01:06:59.470 --> 01:07:04.110
Paul McIntyre: level. So, to what extent are the,

326
01:07:04.250 --> 01:07:09.249
Paul McIntyre: the tools that are going to be developed by Decode, likely to…

327
01:07:09.360 --> 01:07:16.210
Paul McIntyre: be, adopted by the bigger players, in the,

328
01:07:16.520 --> 01:07:25.259
Paul McIntyre: In this… in this field, in the… in the data… data-rich, AI computing field, and particularly, you know,

329
01:07:25.260 --> 01:07:38.999
Paul McIntyre: companies like, just recently, you know, in the last couple of weeks, the announcement of, this, alliance between AMD and OpenAI, and they're actually quoting the compute, in terms of the number of gigawatts.

330
01:07:39.000 --> 01:07:50.709
Paul McIntyre: That they're going to be supporting to OpenAI. So it's really the energy… the energy consumption is the new currency, it's not really necessarily the computing power. So with that in mind.

331
01:07:50.770 --> 01:08:01.349
Paul McIntyre: How transferable or how, how… how much, influence would… do you think these tools will have on the broader, ecosystem?

332
01:08:01.710 --> 01:08:21.640
James (Jim) Ang: Yeah, I can share my thoughts, but I'll also invite my leadership team to also add their feedback. For me, Paul, I don't view that our open source tools will ever be, you know, like a replacement for commercial tools. That, when it comes to productization, a lot of the

333
01:08:21.640 --> 01:08:30.210
James (Jim) Ang: a lot of the infrastructure and the expense for using commercial tools will still remain, but but I think there's an important…

334
01:08:30.210 --> 01:08:48.540
James (Jim) Ang: opportunity, especially as we enter this post-Moore's Law era, where… where architectural innovations are… are going to prevail over… are going to have much, much greater potential for impact than, than, innovations based on, you know, improvements in process technology.

335
01:08:48.720 --> 01:09:03.140
James (Jim) Ang: And, I come from the DOE exascale computing project world, where Path Forward and StarForward investments were not meant to lead to the development of new,

336
01:09:03.359 --> 01:09:19.729
James (Jim) Ang: you know, first-of-a-kind product designs. They were meant to actually provide a platform for hardware-software co-design between the national lab software developers, algorithm developers, and the industry, industry developers that were pursuing

337
01:09:19.880 --> 01:09:36.150
James (Jim) Ang: maybe they're analogous to MVPs, again. You know, not products, but actually working prototypes to help, develop a co-design… help provide a foundation for a co-design collaboration.

338
01:09:36.330 --> 01:09:56.009
James (Jim) Ang: So our hope is that what we can do is provide existence proofs for advanced architecture concepts that would be worthy of proceeding to NRE investments. At that stage, I can't speak for DOE, but I think the natural step, as was done in the exascale computing project.

339
01:09:56.100 --> 01:10:06.060
James (Jim) Ang: You would, you would actually look to competitive procurements for, non-recurring engineering investments and follow-on platform procurements.

340
01:10:06.220 --> 01:10:23.769
James (Jim) Ang: That's for large scale, and that's an important step. For edge computing, and many of the DOE application drivers that Meerkat's targeting, we're going to have to think about how low-volume production

341
01:10:23.840 --> 01:10:43.760
James (Jim) Ang: Is actually going to be a viable outcome of our research center, where you'll never attract a commercial market to produce the kinds of microelectronics, edge computing devices that we need to deploy in our experimental user facilities.

342
01:10:43.760 --> 01:10:49.920
James (Jim) Ang: Because there's not enough production volume that would be interesting to NVIDIA and AMD, etc.

343
01:10:50.010 --> 01:11:00.190
James (Jim) Ang: So, let me… before we move on to the other hands for questions, the leadership team, do you have any additional, comments to add for Paul's question?

344
01:11:02.910 --> 01:11:04.080
James (Jim) Ang: Okay, let's…

345
01:11:04.080 --> 01:11:04.640
luca: Nice.

346
01:11:04.960 --> 01:11:06.399
James (Jim) Ang: Luca, go ahead.

347
01:11:06.540 --> 01:11:11.740
luca: No, no, I said that I think you captured the thought, so I think we can ask it.

348
01:11:12.160 --> 01:11:13.090
James (Jim) Ang: Very good, very good.

349
01:11:13.090 --> 01:11:14.160
luca: To the next questions.

350
01:11:14.350 --> 01:11:17.690
James (Jim) Ang: Okay, Valerie, I think you were next, and then Maya.

351
01:11:18.690 --> 01:11:32.070
Valerie Taylor: So, thanks, Jim, and thanks, Luca, Antonia, David. So, it was a really great presentation, and really nice to know a lot more about Decode.

352
01:11:32.070 --> 01:11:47.520
Valerie Taylor: My question has to do with the higher level, so I found it very interesting, too, that you were looking at, the compiler level as well, which I think is great and so much needed.

353
01:11:47.600 --> 01:11:57.080
Valerie Taylor: And, and often overlooked. So, I guess one question that comes up is that around, programming models.

354
01:11:57.100 --> 01:12:11.539
Valerie Taylor: And if you can comment on what you're looking at with respect to programming models, are you taking into account different programming models? Are you looking at having a meta-programming model?

355
01:12:11.540 --> 01:12:17.270
Valerie Taylor: to interoperate with different programming models. Just was wondering,

356
01:12:17.400 --> 01:12:23.070
Valerie Taylor: About your thoughts around the programming models with different applications.

357
01:12:23.070 --> 01:12:36.680
James (Jim) Ang: Sure, sure. And we've done a lot of background work on compiler frameworks and programming models, runtime systems. Maybe I'll actually let Antonino take this question, because…

358
01:12:36.710 --> 01:12:43.079
James (Jim) Ang: He's been very engaged in compiler frameworks for… for over a decade. So, Antonino?

359
01:12:44.980 --> 01:12:47.330
Antonino Tumeo (PNNL): Sure. So,

360
01:12:48.480 --> 01:13:10.249
Antonino Tumeo (PNNL): Long story short, we obviously are going to leverage something that is task-based, and is very amenable on what we can do right now with frameworks that are based on MLIR. One of the opportunities that we have with MLIR is that we already have, actually.

361
01:13:10.440 --> 01:13:27.699
Antonino Tumeo (PNNL): infrastructure, that was listed in the open source tools, right, Comet, that, was developed, started by Bertrand Castor, and is continuing, being developed by other team members. And, the…

362
01:13:27.900 --> 01:13:39.340
Antonino Tumeo (PNNL): The thing that we are using there is leveraging domain-specific languages, because for scientific computing, that's potentially the way to go.

363
01:13:39.370 --> 01:13:51.590
Antonino Tumeo (PNNL): And as we explore opportunities, especially for the new type of accelerators, as you have seen in the presentation, the ideas that we have is leveraging

364
01:13:51.590 --> 01:14:00.369
Antonino Tumeo (PNNL): potentially specific VSTs and specific libraries, right? And obviously, MLIR will enable us to do this, this mapping.

365
01:14:00.370 --> 01:14:07.399
Antonino Tumeo (PNNL): So, there's definitely a lot of opportunities there, there's a lot of discussion that we can have, and

366
01:14:07.660 --> 01:14:10.459
Antonino Tumeo (PNNL): Oh, God.

367
01:14:10.460 --> 01:14:13.879
James (Jim) Ang: And Valerie, you can have a longer, deeper follow-up conversation.

368
01:14:13.880 --> 01:14:14.570
Antonino Tumeo (PNNL): Yes.

369
01:14:14.570 --> 01:14:15.750
James (Jim) Ang: beam on this, okay?

370
01:14:15.750 --> 01:14:20.069
Valerie Taylor: Okay, that would be great, thank you, but enjoyed the presentation.

371
01:14:20.070 --> 01:14:21.989
James (Jim) Ang: Very good. Maya?

372
01:14:22.970 --> 01:14:38.639
Maya: So, I have a question that probably has a lot of different answers, but I know time is short, but it seems as if when you're doing chiplet and chip design that the majority of the time is spent in verification, and I wondered what

373
01:14:38.640 --> 01:14:45.329
Maya: how you were addressing that, or for the MVP concept, that isn't as important as getting the ideas out there.

374
01:14:47.940 --> 01:14:54.470
James (Jim) Ang: Yeah, you know, part of… part of our, approach, I think, is to… is to try and…

375
01:14:55.140 --> 01:15:04.920
James (Jim) Ang: What, maybe… Sideline a lot of the verification, tasks, through rapidly and

376
01:15:05.260 --> 01:15:11.760
James (Jim) Ang: inexpensive lead, creating test hardware so that

377
01:15:11.830 --> 01:15:17.780
James (Jim) Ang: Our verification of, chiplet packaging and chiplet integration

378
01:15:17.780 --> 01:15:33.749
James (Jim) Ang: comes from the use of testing, but maybe this is something Andreas could talk about, how the definition of our chiplet interfaces will also help address the verification challenge.

379
01:15:35.750 --> 01:15:45.799
aolofsson: Yeah, hi Maya. I think that there's a couple of things, that, that we're doing. One is, the open source aspect, of,

380
01:15:45.840 --> 01:16:00.060
aolofsson: Of some of the IP that we're developing, and so, you know, you certainly… there's less verification when you don't have to write code, so when you download, you know, TensorFlow, PyTorch, right, you… somebody else has done the verification, so…

381
01:16:00.170 --> 01:16:06.360
aolofsson: Yes, you do have to worry about how you compose those library functions and create the upper-level logic, but at least somebody has

382
01:16:06.640 --> 01:16:21.510
aolofsson: unit testers and fuzz and verify the lower level stuff, right? So that's a net gain. And in hardware, we generally don't do that. We, you know, we do reuse some IP out there, but for the most part, we write a lot of code

383
01:16:21.730 --> 01:16:28.360
aolofsson: From scratch, and so we can say that verification, but I think the much more important thing is the fact that once you go into chiplets.

384
01:16:28.870 --> 01:16:42.259
aolofsson: you just short-circuit the verification altogether, especially all the physical stuff, the running DRC on the deck, running signal integrity, power resolution. Once you have a die, you have a physical device that you can do a data sheet for.

385
01:16:42.410 --> 01:16:45.270
aolofsson: And then, the verification goes down drastically.

386
01:16:45.270 --> 01:16:45.890
luca: So…

387
01:16:46.500 --> 01:16:47.720
aolofsson: The idea of turning

388
01:16:48.000 --> 01:16:59.340
aolofsson: IP into a hardened physical device is sort of raising the level of abstraction by a lot. Now we just have a box with pins on it, as opposed to,

389
01:16:59.540 --> 01:17:16.739
aolofsson: a white box that you really have to verify every time you tape it out. And we see this a lot with, especially with complex IP, like certies and memory controllers, where even though we license it from big vendors and we pay them millions of dollars, we still have to spend 6 to 9 months before we tape out.

390
01:17:16.880 --> 01:17:29.200
aolofsson: And imagine if instead of that, you had a memory controller chiplet that just worked, and all you had to do was hook up the pins properly, and you're done, you can imagine how much, how much you can save in that. So, that's my… that's my spiel.

391
01:17:30.420 --> 01:17:31.479
James (Jim) Ang: Thank you. Thank you.

392
01:17:33.910 --> 01:17:48.950
Maya: So, let me just quickly say to Andreas, right now, at the OCP thing, ARM is presenting a chiplet… their chiplet interface. But at this moment, it's on their screen, because Ron just sent it to me in a…

393
01:17:48.950 --> 01:17:52.850
aolofsson: I saw that. The more standards, the merrier, they say.

394
01:17:52.850 --> 01:17:59.340
James (Jim) Ang: Yes, I learned from Sadash Shankar that we're in conflict with the…

395
01:17:59.460 --> 01:18:03.560
James (Jim) Ang: For the OCP, workshop today.

396
01:18:06.180 --> 01:18:13.389
James (Jim) Ang: Okay, I think that's all. Any… any last questions or comments from anyone?

397
01:18:17.700 --> 01:18:30.200
James (Jim) Ang: All right, if not, invitation's open to all the Meerkat teams. Feel free to reach out to me, or any of our leadership team here at Decode. We're happy to have follow-up discussions.

398
01:18:32.300 --> 01:18:33.410
Paul McIntyre: Thanks so much, Jim.

399
01:18:34.010 --> 01:18:35.059
Paul McIntyre: Thank you. Thanks, everybody.

400
01:18:35.060 --> 01:18:35.810
luca: Everyone.

401
01:18:36.060 --> 01:18:38.539
James (Jim) Ang: Thanks, everyone, and we will provide our slides.

402
01:18:38.540 --> 01:18:39.260
John Leidel: So, thanks.

