The Energy Mover's Distance (EMD), also known as the Earth Mover's Distance, is a metric between particle collider events introduced in 1902.02346. This submodule contains convenient functions for computing EMDs between individual events and collections of events. The core of the computation is done using the Python Optimal Transport (POT) library, which must be installed in order to use this submodule.

From Eq. 1 in 1902.02346, the EMD between two events is the minimum ''work'' required to rearrange one event $\mathcal E$ into the other $\mathcal E'$ by movements of energy $f_{ij}$ from particle $i$ in one event to particle $j$ in the other: where $E_i,E^\prime_j$ are the energies of the particles in the two events, $\theta_{ij}$ is an angular distance between particles, and $E_\text{min}=\min\left(\sum_iE_i,\,\sum_jE^\prime_j\right)$ is the smaller of the two total energies. In a hadronic context, transverse momenta are used instead of energies.


emd

energyflow.emd.emd(ev0, ev1, R=1.0, norm=False, beta=1.0, measure='euclidean', coords='hadronic', return_flow=False, gdim=None, mask=False, n_iter_max=100000, periodic_phi=False, phi_col=2, empty_policy='error')

Compute the EMD between two events.

Arguments

  • ev0 : numpy.ndarray
    • The first event, given as a two-dimensional array. The event is assumed to be an (M,1+gdim) array of particles, where M is the multiplicity and gdim is the dimension of the ground space in which to compute euclidean distances between particles (as specified by the gdim keyword argument. The zeroth column is assumed to be the energies (or equivalently, the transverse momenta) of the particles. For typical hadron collider jet applications, each particle will be of the form (pT,y,phi) where y is the rapidity and phi is the azimuthal angle.
  • ev1 : numpy.ndarray
    • The other event, same format as ev0.
  • R : float
    • The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric.
  • beta : float
    • The angular weighting exponent. The internal pairwsie distance matrix is raised to this power priot to solving the optimal transport problem.
  • norm : bool
    • Whether or not to normalize the pT values of the events prior to computing the EMD.
  • measure : str
    • Controls which metric is used to calculate the ground distances between particles. 'euclidean' uses the euclidean metric in however many dimensions are provided and specified by gdim. 'spherical' uses the opening angle between particles on the sphere (note that this is not fully tested and should be used cautiously).
  • coords : str
    • Only has an effect if measure='spherical', in which case it controls if 'hadronic' coordinates (pT,y,phi,[m]) are expected versus 'cartesian' coordinates (E,px,py,pz).
  • return_flow : bool
    • Whether or not to return the flow matrix describing the optimal transport found during the computation of the EMD. Note that since the second term in Eq. 1 is implemented by including an additional particle in the event with lesser total pT, this will be reflected in the flow matrix.
  • gdim : int
    • The dimension of the ground metric space. Useful for restricting which dimensions are considered part of the ground space. Can be larger than the number of dimensions present in the events (in which case all dimensions will be included). If None, has no effect.
  • mask : bool
    • If True, ignores particles farther than R away from the origin.
  • n_iter_max : int
    • Maximum number of iterations for solving the optimal transport problem.
  • periodic_phi : bool
    • Whether to expect (and therefore properly handle) periodicity in the coordinate corresponding to the azimuthal angle $\phi$. Should typically be True for event-level applications but can be set to False (which is slightly faster) for jet applications where all $\phi$ differences are less than or equal to $\pi$.
  • phi_col : int
    • The index of the column of $\phi$ values in the event array.
  • empty_policy : float or 'error'
    • Controls behavior if an empty event is passed in. When set to 'error', a ValueError is raised if an empty event is encountered. If set to a float, that value is returned is returned instead on an empty event.

Returns

  • float
    • The EMD value.
  • [numpy.ndarray], optional
    • The flow matrix found while solving for the EMD. The (i,j)th entry is the amount of pT that flows between particle i in ev0 and particle j in ev1.

emds

energyflow.emd.emds(X0, X1=None, R=1.0, norm=False, beta=1.0, measure='euclidean', coords='hadronic', gdim=None, mask=False, n_iter_max=100000, periodic_phi=False, phi_col=2, empty_policy='error', n_jobs=None, verbose=0, print_every=1000000)

Compute the EMD between collections of events. This can be used to compute EMDs between all pairs of events in a set or between events in two different sets.

Arguments

  • X0 : list
    • Iterable collection of events. Each event is assumed to be an (M,1+gdim) array of particles, where M is the multiplicity and gdim is the dimension of the ground space in which to compute euclidean distances between particles (specified by the gdim keyword argument). The zeroth column is assumed to be the energies (or equivalently, the transverse momenta) of the particles. For typical hadron collider jet applications, each particle will be of the form (pT,y,phi) where y is the rapidity and phi is the azimuthal angle.
  • X1 : list or None
    • Iterable collection of events in the same format as X0, or None. If the latter, the pairwise distances between events in X0 will be computed and the returned matrix will be symmetric.
  • R : float
    • The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric.
  • norm : bool
    • Whether or not to normalize the pT values of the events prior to computing the EMD.
  • beta : float
    • The angular weighting exponent. The internal pairwsie distance matrix is raised to this power priot to solving the optimal transport problem.
  • measure : str
    • Controls which metric is used to calculate the ground distances between particles. 'euclidean' uses the euclidean metric in however many dimensions are provided and specified by gdim. 'spherical' uses the opening angle between particles on the sphere (note that this is not fully tested and should be used cautiously).
  • coords : str
    • Only has an effect if measure='spherical', in which case it controls if 'hadronic' coordinates (pT,y,phi,[m]) are expected versus 'cartesian' coordinates (E,px,py,pz).
  • gdim : int
    • The dimension of the ground metric space. Useful for restricting which dimensions are considered part of the ground space. Can be larger than the number of dimensions present in the events (in which case all dimensions will be included). If None, has no effect.
  • mask : bool
    • If True, ignores particles farther than R away from the origin.
  • n_iter_max : int
    • Maximum number of iterations for solving the optimal transport problem.
  • periodic_phi : bool
    • Whether to expect (and therefore properly handle) periodicity in the coordinate corresponding to the azimuthal angle $\phi$. Should typically be True for event-level applications but can be set to False (which is slightly faster) for jet applications where all $\phi$ differences are less than or equal to $\pi$.
  • phi_col : int
    • The index of the column of $\phi$ values in the event array.
  • empty_policy : float or 'error'
    • Controls behavior if an empty event is passed in. When set to 'error', a ValueError is raised if an empty event is encountered. If set to a float, that value is returned is returned instead on an empty event.
  • n_jobs : int or None
    • The number of worker processes to use. A value of None will use as many processes as there are CPUs on the machine. Note that for smaller numbers of events, a smaller value of n_jobs can be faster.
  • verbose : int
    • Controls the verbosity level. A value greater than 0 will print the progress of the computation at intervals specified by print_every.
  • print_every : int
    • The number of computations to do in between printing the progress. Even if the verbosity level is zero, this still plays a role in determining when the worker processes report the results back to the main process.

Returns

  • numpy.ndarray
    • The EMD values as a two-dimensional array. If X1 was None, then the shape will be (len(X0), len(X0)) and the array will be symmetric, otherwise it will have shape (len(X0), len(X1)).