***ATTENTION Indico Users***
Important changes to user logins are coming to Indico at BNL.
Please see the News section for more information.
I’ll present results of work from several publications which utilize mutual information for machine learning tasks in HEP/NP. We developed a method for using mutual information as an upperlimit of separability for any algorithm which attempts to separate events according to a set of classes. Most common among these in HEP/NP are binary classiﬁcation problems in which one attempts to determine whether an event (data point) originated from the distribution of interest (signal) or some other distribution (background). ML algorithms are often deployed to tackle these problems by training on some known samples. We show that the upper-limit,
(a) determines a priori how well any algorithm can do in principle, and hence provides a natural mechanism for when to stop training.
(b) is quickly computable from data.
The second use of the upper-limit is for the task of feature/variable selection. Often in HEP/NP one is given a set of variables to use in classiﬁcation/event reconstruction tasks, which can at times be large (∼ 10−100) and diﬃcult to work with. One eﬀort to reduce the number of used variables is to employ some type of feature/variable selection algorithm which searches the subspaces of variable space in an attempt to ﬁnd a maximally informative set of variables, while only keeping a set number of variables out of the full list. The standard approach involves training on each chosen subset and then comparing the relative results with other subset choices. This can be computationally cumbersome, especially when the variable space is large.
We’ve also developed an algorithm called the mutual information search tree (MIST) that not only
(a) calculates the upper-limit for large dimensional data sets, but
(b) allows one to sample the subspaces of variable space and ﬁnd maximally informative subspaces without having to train on each subspace.
I’ll discuss results of using MIST on a widely analyzed data set, the Kaggle HiggsML Challenge data set, which concerns a binary classiﬁcation problem associated to a mock Higgs search. Our algorithm MIST is able to ﬁnd a subset of variables from the Higgs set which
(a) included 9 out of the 30 discriminating variables which
(b) contained the maximal amount of information (i.e. the upper-limit), and
(c) took only 20 minutes to complete.
Topic: BNL Particle Physics Seminar Thursday Mar 11 2021
Time: Mar 11, 2021 03:00 PM Eastern Time (US and Canada)
Join Zoom Meeting
Meeting ID: 945 5522 6448
Brett Viren, Hanyu Wei