Quark and Gluon Jets

Four datasets of quark and gluon jets, each having two million total jets, have been generated with Pythia and Herwig and are accessible through this submodule of EnergyFlow. The four datasets are:

  • Pythia 8.226 quark (uds) and gluon jets.
  • Pythia 8.235 quark (udscb) and gluon jets.
  • Herwig 7.1.4 quark (uds) and gluon jets.
  • Herwig 7.1.4 quark (udscb) and gluon jets

To avoid downloading unnecessary samples, the datasets are contained in twenty files with 100k jets each, and only the required files are downloaded. These are based on the samples used in 1810.05165. Splitting the data into 1.6M/200k/200k train/validation/test sets is recommended for standardized comparisons.

Each dataset consists of two components:

  • X : a three-dimensional numpy array of the jets with shape (num_data,max_num_particles,4).
  • y : a numpy array of quark/gluon jet labels (quark=1 and gluon=0).

The jets are padded with zero-particles in order to make a contiguous array. The particles are given as (pt,y,phi,pid) values, where pid is the particle's PDG id. Quark jets either include or exclude $c$ and $b$ quarks depending on the with_bc argument.

The samples are generated from $q\bar q\to Z(\to\nu\bar\nu)+g$ and $qg\to Z(\to\nu\bar\nu)+(uds[cb])$ processes in $pp$ collisions at $\sqrt{s}=14$ TeV. Hadronization and multiple parton interactions (i.e. underlying event) are turned on and the default tunings and shower parameters are used. Final state non-neutrino particles are clustered into $R=0.4$ anti-$k_T$ jets using FastJet 3.3.0. Jets with transverse momentum $p_T\in[500,550]$ GeV and rapidity $|y|<1.7$ are kept. Particles are ensured have to $\phi$ values within $\pi$ of the jet (i.e. no $\phi$-periodicity issues). No detector simulation is performed.

The samples are also hosted on Zenodo and we ask that you cite them appropriately if they are useful to your research. For BibTex entries, see the FAQs.

DOI - Pythia samples
DOI - Herwig samples


load

energyflow.datasets.qg_jets.load(num_data=100000, generator='pythia', pad=True, with_bc=False, cache_dir='~/.energyflow')

Loads samples from the dataset (which in total is contained in twenty files). Any file that is needed that has not been cached will be automatically downloaded. Downloading a file causes it to be cached for later use. Basic checksums are performed.

Arguments

  • num_data : int
    • The number of events to return. A value of -1 means read in all events.
  • generator : str
    • Specifies which Monte Carlo generator the events should come from. Currently, the options are 'pythia' and 'herwig'.
  • pad : bool
    • Whether to pad the events with zeros to make them the same length. Note that if set to False, the returned X array will be an object array and not a 3-d array of floats.
  • with_bc : bool
    • Whether to include jets coming from bottom or charm quarks. Changing this flag does not mask out these jets but rather accesses an entirely different dataset. The datasets with and without b and c quarks should not be combined.
  • cache_dir : str
    • The directory where to store/look for the files. Note that 'datasets' is automatically appended to the end of this path.

Returns

  • 3-d numpy.ndarray, 1-d numpy.ndarray
    • The X and y components of the dataset as specified above. If pad is False then these will be object arrays holding the events, each of which is a 2-d ndarray.

Quark and Gluon Nsubs

A dataset consisting of 45 $N$-subjettiness observables for 100k quark and gluon jets generated with Pythia 8.230. Following 1704.08249, the observables are in the following order:

The dataset contains two members: 'X' which is a numpy array of the nsubs that has shape (100000,45) and 'y' which is a numpy array of quark/gluon labels (quark=1 and gluon=0).


load

energyflow.datasets.qg_nsubs.load(num_data=-1, cache_dir=None)

Loads the dataset. The first time this is called, it will automatically download the dataset. Future calls will attempt to use the cached dataset prior to redownloading.

Arguments

  • num_data : int
    • The number of events to return. A value of -1 means read in all events.
  • cache_dir : str
    • The directory where to store/look for the file.

Returns

  • 3-d numpy.ndarray, 1-d numpy.ndarray
    • The X and y components of the dataset as specified above.