Datasets
CMS Open Data and the MOD HDF5 Format
Starting in 2014, the CMS Collaboration began to release research-grade recorded and simulated datasets on the CERN Open Data Portal. These fantastic resources provide a unique opportunity for researchers with diverse connections to experimental particle phyiscs world to engage with cutting edge particle physics by developing tools and testing novel strategies on actual LHC data. Our goal in making portions of the CMS Open Data available in a reprocessed format is to ease as best as possible the technical complications that have thus far been present when attempting to use Open Data (see also recent efforts by the CMS Collaboration to make the data more accessible).
To facilitate access to Open Data, we have developed a format utilizing the widespread HDF5 file format that stores essential information for some particle physics analyses. This "MOD HDF5 Format" is currently optimized for studies based on jets, but may be updated in the future to support other types of analyses.
To further the goals of Open Data, we have made our reprocessed samples
available on the Zenodo platform. Currently, the only
"collection" of datasets that is available is CMS2011AJets
, which was used in
Exploring the Space of Jets with CMS Open Data for EMD-based studies. More collections may be added
in the future as our research group completes more studies with the Open Data.
For now, this page focuses on the CMS2011AJets
collection. This collection
includes datasets of jets that are CMS-recorded (CMS), Pythia-generated (GEN),
and detector-simulated (SIM), or in code 'cms'
, 'gen'
, 'sim'
,
respectively. The datasets include all available jets above 375 GeV, which is
where the HLT_Jet300 trigger was found to be fully efficient in both data and
simulation. Note that the pT values referenced in the name of the SIM/GEN
datasets are those of the generator-level hard parton. The DOIs of
CMS2011AJets
MOD HDF5 datasets are:
- CMS 2011A Jets, pT > 375 GeV
- SIM/GEN QCD Jets 170-300 GeV
- SIM/GEN QCD Jets 300-470 GeV
- SIM/GEN QCD Jets 470-600 GeV
- SIM/GEN QCD Jets 600-800 GeV
- SIM/GEN QCD Jets 800-1000 GeV
- SIM/GEN QCD Jets 1000-1400 GeV
- SIM/GEN QCD Jets 1400-1800 GeV
- SIM/GEN QCD Jets 1800-\infty GeV
For more details regarding the creation of these samples, as well as for the
DOIs of the original CMS Open Datasets, see Exploring the Space of Jets with
CMS Open Data. To get started using the
samples, see the MOD Jet Demo which makes use of the
load
function.
load
energyflow.mod.load(*args, amount=1, cache_dir='~/.energyflow', collection='CMS2011AJets',
dataset='cms', subdatasets=None, validate_files=False,
store_pfcs=True, store_gens=True, verbose=0)
Loads samples from the specified MOD dataset. Any file that is needed that has not been cached will be automatically downloaded from Zenodo. Downloaded files are cached for later use. File checksums are optionally validated to ensure dataset fidelity.
Arguments
- *args : arbitrary positional arguments
- Used to specify cuts to be made to the dataset while loading; see
the detailed description of the positional arguments accepted by
MODDataset
.
- Used to specify cuts to be made to the dataset while loading; see
the detailed description of the positional arguments accepted by
- amount : int or float
- Approximate amount of the dataset to load. If an integer, this is the
number of files to load (a warning is issued if more files are
requested than are available). If a float, this is the fraction of the
number of available files to load, rounded up to the nearest whole
number. Note that since ints and floats are treated different, a value
of
1
loads one file whereas1.0
loads the entire dataset. A value of-1
also loads the entire dataset.
- Approximate amount of the dataset to load. If an integer, this is the
number of files to load (a warning is issued if more files are
requested than are available). If a float, this is the fraction of the
number of available files to load, rounded up to the nearest whole
number. Note that since ints and floats are treated different, a value
of
- cache_dir : str
- The directory where to store/look for the files. Note that
'datasets'
is automatically appended to the end of this path, as well as the collection name. For example, the default is to download/look for files in the directory'~/.energyflow/datasets/CMS2011AJets'
.
- The directory where to store/look for the files. Note that
- collection : str
- Name of the collection of datasets to consider. Currently the only
collection is
'CMS2011AJets'
, though more may be added in the future.
- Name of the collection of datasets to consider. Currently the only
collection is
- dataset : str
- Which dataset in the collection to load. Currently the
'CMS2011AJets'
collection has'cms'
,'sim'
, and'gen'
datasets.
- Which dataset in the collection to load. Currently the
- subdatasets : {tuple, list} of str or
None
- The names of subdatasets to use. A value of
None
uses all available subdatasets. Currently, for the'CMS2011AJets'
collection, the'cms'
dataset has one subdataset,'CMS_Jet300_pT375-infGeV'
, the'sim'
dataset has eight subdatasets arrange according to generator \hat p_T, e.g.'SIM470_Jet300_pT375-infGeV'
, and the'gen'
dataset also has eight subdatasets arranged similaraly, e.g.'GEN470_pT375-infGeV'
.
- The names of subdatasets to use. A value of
- validate_files : bool
- Whether or not to validate files according to their MD5 hashes. It
is a good idea to set this to
True
when first downloading the files from Zenodo in order to ensure they downloaded properly.
- Whether or not to validate files according to their MD5 hashes. It
is a good idea to set this to
- store_pfcs : bool
- Whether or not to store PFCs if they are present in the dataset.
- store_gens : bool
- Whether or not to store gen-level particles (referred to as "gens") if they are present in the dataset.
- verbose : int
- Verbosity level to use when loading.
0
is the least verbose,1
is more verbose, and2
is the most verbose.
- Verbosity level to use when loading.
Returns
- MODDataset
- A
MODDataset
object containing the selected events or jets from the specified collection, dataset, and subdatasets.
- A
filter_particles
energyflow.mod.filter_particles(particles, which='all', pt_cut=None, chs=False,
pt_i=0, pid_i=4, vertex_i=5)
Constructs a mask that will select particles according to specified properties. Currently supported are selecting particles according to their charge, removing particles associated to a pileup vertex, and implementing a minimum particle-level pT cut
Arguments
- particles : numpy.ndarray
- Two-dimensional array of particles.
- which : {
'all'
,'charged'
,'neutral'
}- Selects particles according to their charge.
- pt_cut : float or
None
- If not
None
, the minimum transverse momentum a particle can have to be selected.
- If not
- chs : bool
- Whether or not to implement charged hadron subtraction (CHS), which removes particles associated to a non-leading vertex (i.e. with vertex ids greater than or equal to 1).
- pt_i : int
- Column index of the transverse momentum values of the particles.
- pid_i : int
- Column index of the particle IDs (used to select by charge).
- vertex_i : int
- Column index of the vertex IDs (used to implement CHS).
Returns
- numpy.ndarray
- A boolean mask which selects the specified particles, i.e.
particles[filter_particles(particles, ...)]
will be an array of only those particles passing the specified cuts.
- A boolean mask which selects the specified particles, i.e.
kfactors
energyflow.mod.kfactors(dataset, pts, npvs=None, collection='CMS2011AJets',
apply_residual_correction=True)
Evaluates k-factors used by a particular collection. Currently, since CMS2011AJets is the only supported collection, some of the arguments are specific to the details of this collection (such as the use of jet pTs) and may change in future versions of this function.
Arguments
- dataset : {
'sim'
,'gen'
}- Specifies which type of k-factor to use.
'sim'
includes a reweighting to match the distribution of the number of primary vertices between the simulation dataset and the CMS data whereas'gen'
does not.
- Specifies which type of k-factor to use.
- pts : numpy.ndarray
- The transverse momenta of the jets, used to determine the pT-dependent k-factor due to using only leading order matrix elements in the event generation. For the CMS2011AJets collection, these are derived from Figure 5 of this reference.
- npvs : numpy.ndarray of integer type or
None
- The number of primary vertices of a simulated event, used to
reweight a simulated event to match the pileup distribution of data.
Should be the same length as
pts
and correspond to the same events. Not used ifdataset
is'gen'
.
- The number of primary vertices of a simulated event, used to
reweight a simulated event to match the pileup distribution of data.
Should be the same length as
- collection : str
- Name of the collection of datasets to consider. Currently the only
collection is
'CMS2011AJets'
, though more may be added in the future.
- Name of the collection of datasets to consider. Currently the only
collection is
- apply_residual_correction : bool
- Whether or not to apply a residual correction derived from the first bin of the pT spectrum that corrects for the remaining difference between data and simulation.
Returns
- numpy.ndarray
- An array of k-factors corresponding to the events specified by the
pts
and (optionally) thenpvs
arrays. These should be multiplied into any existing weight for the simulated or generated event.
- An array of k-factors corresponding to the events specified by the
MODDataset
Loads and provides access to datasets in MOD HDF5 format. Jets can be
selected when loading from file according to a number of kinematic
attributes. MOD HDF5 datasets are created via the save
method.
Currently, the MOD HDF5 format consists of an HDF5 file with the following
arrays, each of which are stored as properties of the MODDataset
:
/jets_i
- int64- An array of integer jet attributes, which are currently:
fn
: The file number of the jet, used to index thefilenames
array.rn
: The run number of the jet.lbn
: The lumiblock number (or lumisection) of the jet.evn
: The event number of the jet.npv
(CMS/SIM only) : The number of primary vertices of the event containing the jet.quality
(CMS/SIM only) : The quality of the jet, where0
means no quality,1
is "loose",2
is "medium", and3
is "tight".hard_pid
(SIM/GEN only) : The particle ID of the hard parton associated to the jet (0
if not associated).
- An array of integer jet attributes, which are currently:
/jets_f
- float64- An array of floating point jet attributes, which are currently:
jet_pt
: Transverse momentum of the jet.jet_y
: Rapidity of the jet.jet_phi
: Azimuthal angle of the jet.jet_m
: Mass of the jet.jet_eta
: Pseudorapidity of the jet.jec
(CMS/SIM only) : Jet energy correction.jet_area
(CMS/SIM only) : Area of the jet.jet_max_nef
(CMS/SIM only) : Maximum of the hadronic and electromagnetic energy fractions of the jet.gen_jet_pt
(SIM only) : Transverse momentum of an associated GEN jet.-1
if not associated.gen_jet_y
(SIM only) : Rapidity of an associated GEN jet.-1
if not associated.gen_jet_phi
(SIM only) : Azimuthal angle of an associated GEN jet.-1
if not associated.gen_jet_m
(SIM only) : Mass of an associated GEN jet.-1
if not associated.gen_jet_eta
(SIM only) : Pseudorapidity of an associated GEN jet.-1
if not associated.hard_pt
(SIM/GEN only) : Transverse momentum of an associated hard parton.-1
if not associated.hard_y
(SIM/GEN only) : Rapidity of an associated hard parton.-1
if not associated.hard_phi
(SIM/GEN only) : Azimuthal angle of an associated hard parton.-1
if not associated.weight
: Contribution of this jet to the cross-section, in nanobarns.
- An array of floating point jet attributes, which are currently:
/pfcs
- float64 (CMS/SIM only)- An array of all particle flow candidates, with attributes listed
below. There is a separate
/pfcs_index
array in the file which contains information forMODDataset
to separate these particles into separate jets. The columns of the array are currently:pt
: Transverse momentum of the PFC.y
: Rapidity of the PFC.phi
: Azimuthal angle of the PFC.m
: Mass of the PFC.pid
: PDG ID of the PFC.vertex
: Vertex ID of the PFC.0
is leading vertex,>0
is a pileup vertex, and-1
is unknown. Neutral particles are assigned to the leading vertex.
- An array of all particle flow candidates, with attributes listed
below. There is a separate
/gens
- float64 (SIM/GEN only)- An array of all generator-level particles, currently with the same
columns as the
pfcs
array (the vertex column contains all0
s). For the SIM dataset, these are the particles of jets associated to the SIM jets which are described in thejets_i
andjets_f
arrays. As withpfcs
, there is a separategens_index
array which tellsMODDataset
how to separate these gen particles into distinct jets.
- An array of all generator-level particles, currently with the same
columns as the
/filenames
- str- An array of strings indexed by the
fn
attribute of each jet. For CMS, this array is one dimensional and contains the CMS-provided filenames. For SIM/GEN, this array is two dimensional where the first column is the pT value that appears in the name of the dataset and the second column is the CMS-provided filename. In all cases, indexing this array with thefn
attribute of a jet gives the file information in which the event containing that jet is to be found.
- An array of strings indexed by the
Note that the column names of the jets_i
, jets_f
, pfcs
, and gens
arrays are stored as lists of strings in the attributes jets_i_cols
,
jets_f_cols
, pfcs_cols
, and gens_cols
.
For each of the above arrays, MODDataset
stores the index of the column
as an attribute with the same name as the column. For example, for an
instance called modds
, modds.fn
has a value of 0
since it is the
first column in the jets_i
array, modds.jet_phi
has a value of 2
,
modds.m
has a value of 3
, etc.
Even more helpfully, a view of each column of the jets arrays is stored
as an attribute as well, so that modds.jet_pts
is the same as
modds.jets_f[:,modds.jet_pt]
, modds.evns
is the same as
modds.jets_i[:,modds.evn]
, etc. Additionally, one special view is stored,
corr_jet_pts
, which is equal to the product of the jet pTs and the JECs,
i.e. modds.jet_pts*modds.jecs
.
MODDataset
supports the builtin len()
method, which returns the number
of jets currently stored in the dataset, as well as the print()
method,
which prints a summary of the dataset.
energyflow.mod.MODDataset(*args, datasets=None, path=None, num=-1, shuffle=True,
store_pfcs=True, store_gens=True)
MODDataset
can be initialized from a MOD HDF5 file or from a list
of existing MODDataset
s. In the first case, the filename should be
given as the first positional argument. In the second case, the
datasets
keyword argument should be set to a list of MODDataset
objects.
Arguments
- *args : arbitrary positional arguments
- Each argument specifies a requirement for an event to be selected
and kept in the
MODDataset
. All requirements are ANDed together. Each specification can be a string or a tuple/list of objects that will be converted to strings and concatenated together. Each string specifies the name of one of the columns of one of the jets arrays ('corr_jet_pts'
is also accepted, see above, as well as'abs_jet_eta'
,'abs_gen_jet_y'
, etc, which use the absolute values of the [pseudo]rapidities of the jets) as well as one or more comparisons to be performed using the values of that column and the given values in the string. For example,('corr_jet_pts >', 400)
, which is the same as'corr_jet_pts>400'
, will select jets with a corrected pT above 400 GeV. Ampersands may be used within one string to indicated multiple requirements, e.g.'corr_jet_pts > 400 & abs_jet_eta'
, which has the same effect as using multiple arguements each with a single requirement.
- Each argument specifies a requirement for an event to be selected
and kept in the
- datasets : {tuple, list} of
MODDataset
instances orNone
MODDataset
s from which to initialize this dataset. Effectively what this does is to concatenate the arrays held by the datasets. Should always beNone
when initializing from an existing file.
- path : str or
None
- If not
None
, thenpath
is prepended to the filename when initializing from file. Has no effect when initializing from existing datasets.
- If not
- num : int
- The number of events or jets to keep after subselections are
applied. A value of
-1
keeps the entire dataset. The weights are properly rescaled to preserve the total cross section of the selection.
- The number of events or jets to keep after subselections are
applied. A value of
- shuffle : bool
- When subselecting a fraction of the dataset (i.e.
num!=-1
), ifFalse
the firstnum
events passing cuts will be kept, ifTrue
then a random subset ofnum
events will be kept. Note that this has no effect whennum
is-1
, and also that this flag only affects which events are selected and does not randomize the order of the events that are ultimately stored by theMODDataset
object.
- When subselecting a fraction of the dataset (i.e.
- store_pfcs : bool
- Whether or not to store PFCs if they are present in the dataset.
- store_gens : bool
- Whether or not to store gen-level particles (referred to as "gens") if they are present in the dataset.
sel
sel(args, kwargs)
Returns a boolean mask that selects jets according to the specified requirements.
Arguments
- *args : arbitrary positional arguments
- Used to specify cuts to be made to the dataset while loading; see
the detailed description of the positional arguments accepted by
MODDataset
.
- Used to specify cuts to be made to the dataset while loading; see
the detailed description of the positional arguments accepted by
Returns
- numpy.ndarray of type bool
- A boolean mask that will select jets that pass all of the specified requirements.
apply_mask
apply_mask(mask, preserve_total_weight=False)
Subselects jets held by the MODDataset
according to a boolean
mask.
Arguments
- mask : numpy.ndarray or type bool
- A boolean mask used to select which jets are to be kept. Should
be the same length as the
MODDataset
object.
- A boolean mask used to select which jets are to be kept. Should
be the same length as the
- preserve_total_weight : bool
- Whether or not to keep the cross section of the
MODDataset
fixed after the selection.
- Whether or not to keep the cross section of the
save
save(filepath, npf=-1, compression=None, verbose=1, n_jobs=1)
Saves a MODDataset
in the MOD HDF5 format.
Arguments
- filepath : str
- The filepath (with or without the
'.h5'
suffix) where the saved file will be located.
- The filepath (with or without the
- npf : int
- The number of jets per file. If not
-1
, multiple files will be created withnpf
jets as the maximum number stored in each file, in which case'_INDEX'
, whereINDEX
is the index of that file, will be appended to the filename.
- The number of jets per file. If not
- compression : int or
None
- If not
None
, the gzip compression level to use when saving the arrays in the HDF5 file. If notNone
,'_compressed'
will be appended to the end of the filename.
- If not
- verbose : int
- Verbosity level to use when saving the files.
- n_jobs : int
- The number of processes to use when saving the files; only
relevant when
npf!=-1
.
- The number of processes to use when saving the files; only
relevant when
close
close()
Closes the underlying HDF5 file, if one is associated with the
MODDataset
object. Note that associated HDF5 files are closed by
default when the MODDataset
object is deleted.
properties
jets_i
jets_i
The jets_i
array, described under MODDataset
.
jets_f
jets_f
The jets_f
array, described under MODDataset
.
pfcs
pfcs
The pfcs
array, described under MODDataset
.
gens
gens
The gens
array, described under MODDataset
.
particles
particles
If this is a CMS or SIM dataset, particles
is the same as pfcs
;
for GEN it is the same as gens
.
filenames
filenames
The filenames
array, described under MODDataset
.
hf
hf
The underlying HDF5 file, if one is associated to the MODDataset
.
Z + Jets with Delphes Simulation
Datasets of QCD jets used for studying unfolding in OmniFold: A Method to Simultaneously Unfold All Observables. Four different datasets are present:
- Herwig 7.1.5 with the default tune
- Pythia 8.243 with tune 21 (ATLAS A14 central tune with NNPDF2.3LO)
- Pythia 8.243 with tune 25 (ATLAS A14 variation 2+ of tune 21)
- Pythia 8.243 with tune 26 (ATLAS A14 variation 2- of tune 21)
Z + jet events (with the Z set to be stable) were generated for each of the above generator/tune pairs with the Z boson \hat p_T^\text{min} > 150 GeV and \sqrt{s}=14 TeV. Events were then passed through the Delphes 3.4.2 fast detector simulation of the CMS detector. Jets with radius parameter R=0.4 were found with the anti-kt algorithm at both particle level ("gen"), where all non-neutrino, non-Z particle are used, and detector level ("sim"), where reconstructed energy flow objects (tracks, electromagnetic calorimeter cells, and hadronic calorimeter cells) are used. Only jets with transverse momentum greater than are kept (note that sim jets have a simple jet energy correction applied by Delphes). The hardest jet from events with a Z boson with a final transverse momentum of 200 GeV or greater are kept, yielding approximately 1.6 million jets at both gen-level and sim-level for each generator/tune pair.
Each zipped NumPy file consists of several arrays, the names of which begin with either 'gen_' or 'sim_' depending on which set of jets they correspond to. The name of each array ends in a key word indicating what it contains. With the exception of 'gen_Zs' (which contains the (p_T,\,y,\,\phi) of the final Z boson), there is both a gen and sim version of each array. The included arrays are (listed by their key words):
'jets'
- The jet axis four vector, as (p_T^\text{jet},\,y^\text{jet},\, \phi^\text{jet},\,m^\text{jet}) where y^\text{jet} is the jet rapidity, \phi^\text{jet} is the jet azimuthal angle, and m^\text{jet} is the jet mass.'particles'
- The rescaled and translated constituents of the jets as (p_T/100,\,y-y^\text{jet},\,\phi-\phi^\text{jet},\,f_\text{PID}) where f_\text{PID} is a small float corresponding to the PDG ID of the particle. The PIDs are remapped according to 22\to0.0, 211\to0.1, -211\to0.2, 130 \to0.3, 11\to0.4, -11\to0.5, 13\to0.6, -13\to0.7, 321\to0.8, -321 \to0.9, 2212\to1.0, -2212\to1.1, 2112\to1.2, -2112\to1.3. Note that ECAL cells are treated as photons (id 22) and HCAL cells are treated as K_L^0 (id 130).'mults'
- The constituent multiplicity of the jet.'lhas'
- The Les Houches (\beta=1/2) angularity.'widths'
- The jet width ($\beta=1 angularity).'ang2s'
- The \beta=2 angularity (note that this is very similar to the jet mass, but does not depend on particle masses).'tau2s'
- The 2-subjettiness value with \beta=1.'sdms'
- The groomed mass with Soft Drop parameters z_\text{cut}=0.1 and \beta=0.'zgs'
- The groomed momentum fraction (same Soft Drop parameters as above).
If you use this dataset, please cite the Zenodo record as well as the corresponding paper:
- Pythia/Herwig + Delphes samples
- A. Andreassen, P. T. Komiske, E. M. Metodiev, B. Nachman, J. Thaler, OmniFold: A Method to Simultaneously Unfold All Observables, arXiv:1911.09107.
load
energyflow.zjets_delphes.load(dataset, num_data=100000, pad=False, cache_dir='~/.energyflow',
source='zenodo', which='all',
include_keys=None, exclude_keys=None)
Loads in the Z+jet Pythia/Herwig + Delphes datasets. Any file that is needed that has not been cached will be automatically downloaded. Downloaded files are cached for later use. Checksum verification is performed.
Arguments
- datasets: {
'Herwig'
,'Pythia21'
,'Pythia25'
,'Pythia26'
}- The dataset (specified by which generator/tune was used to produce it) to load. Note that this argument is not sensitive to capitalization.
- num_data: int
- The number of events to read in. A value of
-1
means to load all available events.
- The number of events to read in. A value of
- pad: bool
- Whether to pad the particles with zeros in order to form contiguous arrays.
- cache_dir: str
- Path to the directory where the dataset files should be stored.
- source: {
'dropbox'
,'zenodo'
}- Which location to obtain the files from.
- which: {
'gen'
,'sim'
,'all'
}- Which type(s) of events to read in. Each dataset has corresponding generated events at particle-level and simulated events at detector-level.
- include_keys: list or_tuple_ of str, or
None
- If not
None
, load these keys from the dataset files. A value ofNone
uses all available keys (theKEYS
global variable of this module contains the available keys as keys of the dictionary and brief descriptions as values). Note that keys do not have 'sim' or 'gen' prepended to the front yet.
- If not
- exclude_keys: list or tuple or str, or
None
- Any keys to exclude from loading. Most useful when a small number of keys should be excluded from the default set of keys.
Returns
- dict of numpy.ndarray
- A dictionary of the jet, particle, and observable arrays for the specified dataset.
Quark and Gluon Jets
Four datasets of quark and gluon jets, each having two million total jets, have been generated with Pythia and Herwig and are accessible through this submodule of EnergyFlow. The four datasets are:
- Pythia 8.226 quark (uds) and gluon jets.
- Pythia 8.235 quark (udscb) and gluon jets.
- Herwig 7.1.4 quark (uds) and gluon jets.
- Herwig 7.1.4 quark (udscb) and gluon jets
To avoid downloading unnecessary samples, the datasets are contained in twenty files with 100k jets each, and only the required files are downloaded. These are based on the samples used in 1810.05165. Splitting the data into 1.6M/200k/200k train/validation/test sets is recommended for standardized comparisons.
Each dataset consists of two components:
X
: a three-dimensional numpy array of the jets with shape(num_data,max_num_particles,4)
.y
: a numpy array of quark/gluon jet labels (quark=1
and gluon=0
).
The jets are padded with zero-particles in order to make a contiguous array.
The particles are given as (pt,y,phi,pid)
values, where pid
is the
particle's PDG id. Quark jets either include or exclude c and b
quarks depending on the with_bc
argument.
The samples are generated from q\bar q\to Z(\to\nu\bar\nu)+g and qg\to Z(\to\nu\bar\nu)+(uds[cb]) processes in pp collisions at \sqrt{s}=14 TeV. Hadronization and multiple parton interactions (i.e. underlying event) are turned on and the default tunings and shower parameters are used. Final state non-neutrino particles are clustered into R=0.4 anti-k_T jets using FastJet 3.3.0. Jets with transverse momentum p_T\in[500,550] GeV and rapidity |y|<1.7 are kept. Particles are ensured have to \phi values within \pi of the jet (i.e. no \phi-periodicity issues). No detector simulation is performed.
The samples are also hosted on Zenodo and we ask that you cite them appropriately if they are useful to your research. For BibTex entries, see the FAQs.
- Pythia samples
- Herwig samples
load
energyflow.qg_jets.load(num_data=100000, pad=True, ncol=4, generator='pythia',
with_bc=False, cache_dir='~/.energyflow')
Loads samples from the dataset (which in total is contained in twenty files). Any file that is needed that has not been cached will be automatically downloaded. Downloading a file causes it to be cached for later use. Basic checksums are performed.
Arguments
- num_data : int
- The number of events to return. A value of
-1
means read in all events.
- The number of events to return. A value of
- pad : bool
- Whether to pad the events with zeros to make them the same length.
Note that if set to
False
, the returnedX
array will be an object array and not a 3-d array of floats.
- Whether to pad the events with zeros to make them the same length.
Note that if set to
- ncol : int
- Number of columns to keep in each event.
- generator : str
- Specifies which Monte Carlo generator the events should come from.
Currently, the options are
'pythia'
and'herwig'
.
- Specifies which Monte Carlo generator the events should come from.
Currently, the options are
- with_bc : bool
- Whether to include jets coming from bottom or charm quarks. Changing this flag does not mask out these jets but rather accesses an entirely different dataset. The datasets with and without b and c quarks should not be combined.
- cache_dir : str
- The directory where to store/look for the files. Note that
'datasets'
is automatically appended to the end of this path.
- The directory where to store/look for the files. Note that
Returns
- 3-d numpy.ndarray, 1-d numpy.ndarray
- The
X
andy
components of the dataset as specified above. Ifpad
isFalse
then these will be object arrays holding the events, each of which is a 2-d ndarray.
- The
Quark and Gluon Nsubs
A dataset consisting of 45 N-subjettiness observables for 100k quark and gluon jets generated with Pythia 8.230. Following 1704.08249, the observables are in the following order:
The dataset contains two members: 'X'
which is a numpy array of the nsubs
that has shape (100000,45)
and 'y'
which is a numpy array of quark/gluon
labels (quark=1
and gluon=0
).
load
energyflow.qg_nsubs.load(num_data=-1, cache_dir='~/.energyflow')
Loads the dataset. The first time this is called, it will automatically download the dataset. Future calls will attempt to use the cached dataset prior to redownloading.
Arguments
- num_data : int
- The number of events to return. A value of
-1
means read in all events.
- The number of events to return. A value of
- cache_dir : str
- The directory where to store/look for the file.
Returns
- 3-d numpy.ndarray, 1-d numpy.ndarray
- The
X
andy
components of the dataset as specified above.
- The