Energy Mover's Distance
The Energy Mover's Distance (EMD), also known as the Earth Mover's Distance, is a metric between particle collider events introduced in 1902.02346. This submodule contains convenient functions for computing EMDs between individual events and collections of events. The core of the computation is handled by either the Wasserstein library or the Python Optimal Transport (POT) library, one of which must be installed in order to use this submodule.
From Eqs. (1.2) and (1.3) in 2004.04159, the EMD between two events is the minimum ''work'' required to rearrange one event \mathcal E into the other \mathcal E' by movements of energy f_{ij} from particle i in one event to particle j in the other:
where E_i,E^\prime_j are the energies of the particles in the two events, \theta_{ij} is an angular distance between particles, and E_\text{min}=\min\left(\sum_{i=1}^ME_i,\,\sum_{j=1}^{M'}E^\prime_j\right) is the smaller of the two total energies. In a hadronic context, transverse momenta are used instead of energies.
emd
energyflow.emd.emd(*args, **kwargs)
Computes the EMD between two events. The emd
function is set equal to
emd_wasserstein
or emd_pot
, with the
former preferred unless the Wasserstein library is not available.
emds
energyflow.emd.emds(*args, **kwargs)
Computes the EMDs between collections of events. The emds
function is
set equal to emds_wasserstein
or
emds_pot
, with the former preferred unless the Wasserstein
library is not available.
emd_wasserstein
energyflow.emd.emd_wasserstein(ev0, ev1, dists=None, R=1.0, beta=1.0, norm=False, gdim=2, mask=False,
return_flow=False, do_timing=False,
n_iter_max=100000,
epsilon_large_factor=10000.0, epsilon_small_factor=1.0)
Compute the EMD between two events using the Wasserstein library.
Arguments
- ev0 : numpy.ndarray
- The first event, given as a two-dimensional array. The event is
assumed to be an
(M,1+gdim)
array of particles, whereM
is the multiplicity andgdim
is the dimension of the ground space in which to compute euclidean distances between particles (as specified by thegdim
keyword argument). The zeroth column is the weights of the particles, typically their energies or transverse momenta. For typical hadron collider jet applications, each particle will be of the form(pT,y,phi)
wherey
is the rapidity andphi
is the azimuthal angle. Ifdists
are provided, then the columns after the zeroth are ignored; alternatively a one-dimensional array consisting of just the particle weights may be passed in this case.
- The first event, given as a two-dimensional array. The event is
assumed to be an
- ev1 : numpy.ndarray
- The other event, same format as
ev0
.
- The other event, same format as
- dists : numpy.ndarray
- A distance matrix between particles in
ev0
andev1
. IfNone
, then the columns of the events after the zeroth are taken to be coordinates and thegdim
-dimensional Euclidean distance is used.
- A distance matrix between particles in
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- norm : bool
- Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
- gdim : int
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space when using
the internal euclidean distances between particles. Has no effect if
dists
are provided.
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space when using
the internal euclidean distances between particles. Has no effect if
- return_flow : bool
- Whether or not to return the flow matrix describing the optimal transport found during the computation of the EMD. Note that since the second term in Eq. 1 is implemented by including an additional particle in the event with lesser total weight, this will be reflected in the flow matrix.
- mask : bool
- If
True
, masks out particles farther thanR
away from the origin. Has no effect ifdists
are provided.
- If
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- epsilon_large_factor : float
- Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
- epsilon_small_factor : float
- Analogous to
epsilon_large_factor
but used where the numerical tolerance can be stricter.
- Analogous to
Returns
- float
- The EMD value.
- [numpy.ndarray], optional
- The flow matrix found while solving for the EMD. The
(i,j)
th entry is the amount ofpT
that flows between particle i inev0
and particle j inev1
.
- The flow matrix found while solving for the EMD. The
emds_wasserstein
energyflow.emd.emds_wasserstein(events0, events1=None, R=1.0, beta=1.0, norm=False, gdim=2, mask=False,
external_emd_handler=None,
n_jobs=-1, print_every=0, verbose=0,
throw_on_error=True, n_iter_max=100000,
epsilon_large_factor=10000.0,
epsilon_small_factor=1.0)
Compute the EMDs between collections of events. This can be used to compute EMDs between all pairs of events in a set or between events in two different sets.
Arguments
- events0 : list
- Iterable collection of events. Each event is assumed to be an
(M,1+gdim)
array of particles, whereM
is the multiplicity andgdim
is the dimension of the ground space in which to compute euclidean distances between particles (as specified by thegdim
keyword argument). The zeroth column is the weights of the particles, typically their energies or transverse momenta. For typical hadron collider jet applications, each particle will be of the form(pT,y,phi)
wherey
is the rapidity andphi
is the azimuthal angle. Ifdists
are provided, then the columns after the zeroth are ignored; alternatively a one-dimensional array consisting of just the particle weights may be passed in this case.
- Iterable collection of events. Each event is assumed to be an
- events1 : list or
None
- Iterable collection of events in the same format as
events0
, orNone
. If the latter, the pairwise distances between events inevents0
will be computed and the returned matrix will be symmetric.
- Iterable collection of events in the same format as
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- norm : bool
- Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- gdim : int
- The dimension of the ground metric space. Useful for restricting which dimensions are considered part of the ground space when using the internal euclidean distances between particles.
- mask : bool
- If
True
, ignores particles farther thanR
away from the origin.
- If
- external_emd_handler : wasserstein.ExternalEMDHandler
- An instance of an external EMD handler from the wasserstein
module, e.g.
CorrelationDimension
.
- An instance of an external EMD handler from the wasserstein
module, e.g.
- n_jobs : int or
None
- The number of cpu cores to use. A value of
None
or-1
will use as many threads as there are CPUs on the machine.
- The number of cpu cores to use. A value of
- print_every : int
- The number of computations to do in between printing the progress. Even if the verbosity level is zero, this still plays a role in determining when the worker threads report the results back to the main thread and check for interrupt signals.
- verbose : int
- Controls the verbosity level. A value greater than
0
will print the progress of the computation at intervals specified byprint_every
.
- Controls the verbosity level. A value greater than
- throw_on_error : bool
- Whether or not to raise an exception when an issue is encountered. Can be useful when debugging.
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- epsilon_large_factor : float
- Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
- epsilon_small_factor : float
- Analogous to
epsilon_large_factor
but used where the numerical tolerance can be stricter.
- Analogous to
Returns
- numpy.ndarray
- The EMD values as a two-dimensional array, except if an external
EMD handler was provided, in which case no value is returned. If
events1
wasNone
, then the shape will be(len(events0), len(events0))
and the array will be symmetric, otherwise it will have shape(len(events0), len(events1))
.
- The EMD values as a two-dimensional array, except if an external
EMD handler was provided, in which case no value is returned. If
emd_pot
energyflow.emd.emd_pot(ev0, ev1, R=1.0, norm=False, beta=1.0, measure='euclidean', coords='hadronic',
return_flow=False, gdim=None, mask=False, n_iter_max=100000,
periodic_phi=False, phi_col=2, empty_policy='error')
Compute the EMD between two events using the Python Optimal Transport library.
Arguments
- ev0 : numpy.ndarray
- The first event, given as a two-dimensional array. The event is
assumed to be an
(M,1+gdim)
array of particles, whereM
is the multiplicity andgdim
is the dimension of the ground space in which to compute euclidean distances between particles (as specified by thegdim
keyword argument. The zeroth column is assumed to be the energies (or equivalently, the transverse momenta) of the particles. For typical hadron collider jet applications, each particle will be of the form(pT,y,phi)
wherey
is the rapidity andphi
is the azimuthal angle.
- The first event, given as a two-dimensional array. The event is
assumed to be an
- ev1 : numpy.ndarray
- The other event, same format as
ev0
.
- The other event, same format as
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- norm : bool
- Whether or not to normalize the pT values of the events prior to computing the EMD.
- measure : str
- Controls which metric is used to calculate the ground distances
between particles.
'euclidean'
uses the euclidean metric in however many dimensions are provided and specified bygdim
.'spherical'
uses the opening angle between particles on the sphere (note that this is not fully tested and should be used cautiously).
- Controls which metric is used to calculate the ground distances
between particles.
- coords : str
- Only has an effect if
measure='spherical'
, in which case it controls if'hadronic'
coordinates(pT,y,phi,[m])
are expected versus'cartesian'
coordinates(E,px,py,pz)
.
- Only has an effect if
- return_flow : bool
- Whether or not to return the flow matrix describing the optimal transport found during the computation of the EMD. Note that since the second term in Eq. 1 is implemented by including an additional particle in the event with lesser total pT, this will be reflected in the flow matrix.
- gdim : int
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space. Can be
larger than the number of dimensions present in the events (in
which case all dimensions will be included). If
None
, has no effect.
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space. Can be
larger than the number of dimensions present in the events (in
which case all dimensions will be included). If
- mask : bool
- If
True
, ignores particles farther thanR
away from the origin.
- If
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- periodic_phi : bool
- Whether to expect (and therefore properly handle) periodicity
in the coordinate corresponding to the azimuthal angle \phi.
Should typically be
True
for event-level applications but can be set toFalse
(which is slightly faster) for jet applications where all \phi differences are less than or equal to \pi.
- Whether to expect (and therefore properly handle) periodicity
in the coordinate corresponding to the azimuthal angle \phi.
Should typically be
- phi_col : int
- The index of the column of \phi values in the event array.
- empty_policy : float or
'error'
- Controls behavior if an empty event is passed in. When set to
'error'
, aValueError
is raised if an empty event is encountered. If set to a float, that value is returned is returned instead on an empty event.
- Controls behavior if an empty event is passed in. When set to
Returns
- float
- The EMD value.
- [numpy.ndarray], optional
- The flow matrix found while solving for the EMD. The
(i,j)
th entry is the amount ofpT
that flows between particle i inev0
and particle j inev1
.
- The flow matrix found while solving for the EMD. The
emds_pot
energyflow.emd.emds_pot(X0, X1=None, R=1.0, norm=False, beta=1.0, measure='euclidean', coords='hadronic',
gdim=None, mask=False, n_iter_max=100000,
periodic_phi=False, phi_col=2, empty_policy='error',
n_jobs=None, verbose=0, print_every=10**6)
Compute the EMDs between collections of events. This can be used to compute EMDs between all pairs of events in a set or between events in two different sets.
Arguments
- X0 : list
- Iterable collection of events. Each event is assumed to be an
(M,1+gdim)
array of particles, whereM
is the multiplicity andgdim
is the dimension of the ground space in which to compute euclidean distances between particles (specified by thegdim
keyword argument). The zeroth column is assumed to be the energies (or equivalently, the transverse momenta) of the particles. For typical hadron collider jet applications, each particle will be of the form(pT,y,phi)
wherey
is the rapidity andphi
is the azimuthal angle.
- Iterable collection of events. Each event is assumed to be an
- X1 : list or
None
- Iterable collection of events in the same format as
X0
, orNone
. If the latter, the pairwise distances between events inX0
will be computed and the returned matrix will be symmetric.
- Iterable collection of events in the same format as
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- norm : bool
- Whether or not to normalize the pT values of the events prior to computing the EMD.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- measure : str
- Controls which metric is used to calculate the ground distances
between particles.
'euclidean'
uses the euclidean metric in however many dimensions are provided and specified bygdim
.'spherical'
uses the opening angle between particles on the sphere (note that this is not fully tested and should be used cautiously).
- Controls which metric is used to calculate the ground distances
between particles.
- coords : str
- Only has an effect if
measure='spherical'
, in which case it controls if'hadronic'
coordinates(pT,y,phi,[m])
are expected versus'cartesian'
coordinates(E,px,py,pz)
.
- Only has an effect if
- gdim : int
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space. Can be
larger than the number of dimensions present in the events (in
which case all dimensions will be included). If
None
, has no effect.
- The dimension of the ground metric space. Useful for restricting
which dimensions are considered part of the ground space. Can be
larger than the number of dimensions present in the events (in
which case all dimensions will be included). If
- mask : bool
- If
True
, ignores particles farther thanR
away from the origin.
- If
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- periodic_phi : bool
- Whether to expect (and therefore properly handle) periodicity
in the coordinate corresponding to the azimuthal angle \phi.
Should typically be
True
for event-level applications but can be set toFalse
(which is slightly faster) for jet applications where all \phi differences are less than or equal to \pi.
- Whether to expect (and therefore properly handle) periodicity
in the coordinate corresponding to the azimuthal angle \phi.
Should typically be
- phi_col : int
- The index of the column of \phi values in the event array.
- empty_policy : float or
'error'
- Controls behavior if an empty event is passed in. When set to
'error'
, aValueError
is raised if an empty event is encountered. If set to a float, that value is returned is returned instead on an empty event.
- Controls behavior if an empty event is passed in. When set to
- n_jobs : int or
None
- The number of worker processes to use. A value of
None
will use as many processes as there are CPUs on the machine. Note that for smaller numbers of events, a smaller value ofn_jobs
can be faster.
- The number of worker processes to use. A value of
- verbose : int
- Controls the verbosity level. A value greater than
0
will print the progress of the computation at intervals specified byprint_every
.
- Controls the verbosity level. A value greater than
- print_every : int
- The number of computations to do in between printing the progress. Even if the verbosity level is zero, this still plays a role in determining when the worker processes report the results back to the main process.
Returns
- numpy.ndarray
- The EMD values as a two-dimensional array. If
X1
wasNone
, then the shape will be(len(X0), len(X0))
and the array will be symmetric, otherwise it will have shape(len(X0), len(X1))
.
- The EMD values as a two-dimensional array. If