I’ll present results of work from several publications which utilize mutual information for machine learning tasks in HEP/NP. We developed a method for using mutual information as an upperlimit of separability for any algorithm which attempts to separate events according to a set of classes. Most common among these in HEP/NP are binary classification problems in which one attempts to determine whether an event (data point) originated from the distribution of interest (signal) or some other distribution (background). ML algorithms are often deployed to tackle these problems by training on some known samples. We show that the upper-limit,
(a) determines a priori how well any algorithm can do in principle, and hence provides a natural mechanism for when to stop training.
(b) is quickly computable from data.
The second use of the upper-limit is for the task of feature/variable selection. Often in HEP/NP one is given a set of variables to use in classification/event reconstruction tasks, which can at times be large (∼ 10−100) and difficult to work with. One effort to reduce the number of used variables is to employ some type of feature/variable selection algorithm which searches the subspaces of variable space in an attempt to find a maximally informative set of variables, while only keeping a set number of variables out of the full list. The standard approach involves training on each chosen subset and then comparing the relative results with other subset choices. This can be computationally cumbersome, especially when the variable space is large.
We’ve also developed an algorithm called the mutual information search tree (MIST) that not only
(a) calculates the upper-limit for large dimensional data sets, but
(b) allows one to sample the subspaces of variable space and find maximally informative subspaces without having to train on each subspace.
I’ll discuss results of using MIST on a widely analyzed data set, the Kaggle HiggsML Challenge data set, which concerns a binary classification problem associated to a mock Higgs search. Our algorithm MIST is able to find a subset of variables from the Higgs set which
(a) included 9 out of the 30 discriminating variables which
(b) contained the maximal amount of information (i.e. the upper-limit), and
(c) took only 20 minutes to complete.
ZOOM link:
Topic: BNL Particle Physics Seminar Thursday Mar 11 2021
Time: Mar 11, 2021 03:00 PM Eastern Time (US and Canada)
Join Zoom Meeting
https://fnal.zoom.us/j/94555226448?pwd=cnYvTGlRQ0diekg2YzVtdjJYTEtiZz09
Meeting ID: 945 5522 6448
Passcode: 644658
Brett Viren, Hanyu Wei