Chapter 15
New features in MCFM9
15.1 Download of MCFM9

MCFM9.1.tar.gz (April 7th, 2020)
 New implementation of H+2j virtual matrix elements with mass effects.

MCFM9.0.tar.gz (released September 25th, 2019)
 Overhaul of the code offering many new features such as an efficient calculation of scale and PDF uncertainties and a robust estimate of resi dual slicingdependence (for NNLO calculations). For a detailed description of many of the improvements, please refer to the companion publication arXiv:1909.09117 [hepph].
15.2 Description of MCFM9
In this section we present the new and modified features in MCFM9.0 and describe how to use them on a technical level. This serves mostly as a quickstart for users already familiar with MCFM. With all reimplemented and newly implemented components we strive for Fortran 2008 compliance, making explicit use of its features. Following the Fortran standard furthermore allows us to achieve compatibility with not just the GNU compiler. In previous versions of MCFM the licensing was unclear, since none was specified. We now license all code under the GNU GPL 3 license^{1}. For supporting material we recommend studying the release paper of MCFM9.0 in ref. [62], from which this section is taken.
Improved input file mechanism.
We have implemented a new input file mechanism based on the configuration file parser
config_fortran
[112]. This INIlike file format no longer depends on a strict ordering
of configuration elements, allows easy access to configuration elements through
a single global configuration object, and makes it easy to add new configuration
options of scalar and array numerical and string types. Using the parser package
also allows one to override or specify all configuration options as command line
arguments to MCFM, for example running MCFM like ./mcfm_omp input.ini
general%nproc=200 general%part=nlo
. This is useful for batch parameter run scripts.
Settings can also be overridden with additional input files that specify just a subset of
options.
New histogramming. We replaced the previous Fortran77 implementation of histograms, that used routines from 1988 by M. Mangano, with a new suite of routines. The new histogram implementation allows for any number of histograms with any number of bins, each of which is dynamically allocated. Furthermore, everything is also handled in a fully multithreaded approach with the integration. For each OMP thread temporary histograms are allocated which are then reduced to a single one after each integration iteration, so that no OMP locks (critical regions) are required.
New Vegas integration, partadaptive and resumable.
The previous implementation of the Vegas routine was based on Numerical Recipes code.
We have reimplemented Vegas and the surrounding integration routines. All parts of a
NLO or NNLO calculation are now chosen adaptively based on the largest absolute
numerical uncertainty. A precision goal can be set in the input file as well as a
${\chi}^{2}\u2215\text{it}$ goal
and a precision goal for the warmup run. If the goals for the warmup are not reached, the
warmup repeats with twice the number of calls. With the setting writeintermediate
one can
control whether histograms are written in intermediate stages during the integration.
Enabling the setting readin
allows one to resume the integration from any point from a
previous run. Snapshots saving the whole integration state are saved automatically. When
resuming, the only parameter that the user can safely officially change is the precisiongoal
.
Further tweak configuration options to control the stages of the integration have
been introduced, which can provide benefits over the default settings in certain
situations.
The section integration
in the configuration file allows for tweaks in the
following way. The precision goal can be adjusted by setting precisiongoal
to a relative precision that should be reached. Similarly, the settings
warmupprecisiongoal
and warmupchisqgoal
control the minimum relative precision and
${\chi}^{2}\u2215\text{it}$ for the
warmup phase of iterbatchwarmup
(default 5) iterations. If the warmup criterion fails, the
number of calls is increased by a factor of two. The calls per iteration get increased by a
factor of callboost
(default 4) after the warmup. From then on the number of calls per
iteration is increased by a factor of itercallmult
(default 1.4) for a total of iterbatch1
iterations. After these first iterbatch1
iterations, the increase happens for every iterbatch2
iterations. The setting maxcallsperiter
controls the cap for the number of calls per
iteration. The number of Vegas grid subdivisions can be controlled with ndmx
(default
100).
The purpose of these settings is a fine control in certain situations. For example
to compute expensive PDF uncertainties, one wants a relatively precise warmup
run (where additional PDF sets are not sampled) and as few calls as necessary
afterwards: For the plots in this paper we thus chose a relative warmup precision goal of
$10\%$, and set
callboost
to $0.25$.
This means that the first iterbatch1
iterations after the warmup run only with a
quarter of the calls than during the warmup. This precision is sufficient to compute
precise PDF uncertainties, when making use of the strong correlations as in
MCFM9.0. Any further iterations come in batches of iterbatch2
, which we set to
$1$. It
allows for a quick switching to parts of the NNLO cross section that have the largest
uncertainty. For normal applications one wants to boost the number of calls after the warmup
significantly, so a default value of callboost=4
is chosen.
We provide default settings for the initial number of calls per iteration for all components of a
NNLO calculation. They can be overridden with the following settings in the integration
section: initcallslord
, initcallsnlovirt
, initcallsnloreal
, initcallsnlofrag
for
parts of a NLO calculations, initcallssnlobelow
, initcallssnloabove
for parts of a SCET
based NLO calculation, and initcallsnnlobelow
, initcallsnnlovirtabove
, as well as
initcallsnnlorealabove
for the parts of the NNLO coefficient.
Low discrepancy sequence.
MCFM8.0 and prior relied on a linear congruential generator implementation from
Numerical Recipes for the generation of a pseudorandom sequence. With newer versions the
MT19937 implementation of the C++ standard library is used, and with this version of
MCFM we include an implementation of the Sobol low discrepancy sequence based on the
code sobseq [113] with initialization numbers from ref. [114]. The Sobol sequence is used by
default and can be toggled using the flag usesobol = .true.
in the integration
section of
the input file, see ref. [62]. When running in MPI mode, the number of nodes has to
be a power of two for the Sobol sequence, because we use it in a strided manner.
Otherwise the code will automatically fall back to using the MT19937 sequence
with seed value seed
in the integration section of the input file. A seed
value of
$0$
denotes a randomly initialized seed.
Fully parallelized OMP+MPI use of LHAPDF.
In previous versions of MCFM calls to LHAPDF were forced to access from only a single
OMP thread through a lock. This is because the interface was based on the old
LHAglue interface, part of LHAPDF. We have written an interface to LHAPDF from
scratch based on the new object oriented treatment of PDFs in LHAPDF 6. For each
OMP thread we initialize a copy of the used PDF members which can be called
fully concurrently. The amount of PDF sets with or without PDF uncertainties is
only limited by the available system memory. The memory usage of MCFM can
then range from roughly 20MB when only one central PDF grid is being used, to
$\sim 7.4$ GB
when 32 OMP threads fully load all members of the PDF sets CT14nnlo
, MMHT2014nnlo68cl
,
ABMP16als118_5_nnlo
, NNPDF30_nnlo_as_0118
, NNPDF31_nnlo_as_0118
and
PDF4LHC15_nnlo_30
for PDF uncertainties. The total number of members for these grids is
371, each loaded for every of the 32 OMP threads.
Since each OMP thread allocates its own copy of PDF members and histograms we have no need to introduce any OMP locks. On the other hand the memory usage increases and one runs into being CPU cache or DRAM bandwidth bound earlier. In practice, we find that this is still faster than having OMP locks, which directly decrease the speedup in the spirit of Amdahl’s law. Ideally the LHAPDF library should be improved to allow for threadsafe calls with just one memory allocation.
Histograms for additional values of $\tau \text{cut}$,
${\mu}_{R},{\mu}_{F}$ and
multiple PDFs.
When using the automatic scale variation, in addition to the normal histograms,
additional histograms with filenames _scale_XY_
are generated, where X
is a placeholder for
the renormalization scale variation and Y
for the factorization scale variation. X
and Y
can
either be u
for an upwards variation by a factor of two, d
for a downwards variation by a
factor of two, or just 
if no change of that scale was made. The envelope of maximum and
minimum can then easily be obtained.
For the sampling of additional values of
$\tau \text{cut}$ for
NLO and NNLO calculations using jettiness subtractions, additional histograms
with filenames _taucut_XXX_
are written. Here XXX
is a placeholder for the chosen
$\tau \text{cut}$ values
in the optional array taucutarray
, if specified, or one of the five automatically chosen
values. These additional files only contain the differences to the nominal choice of
$\tau \text{cut}$, so
that $\mathrm{\Delta}\sigma ({\tau}_{\text{cut,nominal}})\mathrm{\Delta}\sigma ({\tau}_{\text{cut,i}})$ is
stored. If taucutarray
has not been specified, the automatic choice of additional
$\tau \text{cut}$ values is enabled based
on the default nominal $\tau \text{cut}$
for the process or the users choice of the nominal
$\tau \text{cut}$
value as specified in taucut
. In addition a file with _taucutfit_
is generated,
which in addition to the fitted corrections and its uncertainty includes columns
for the maximum relative integration uncertainty for the additionally sampled
$\tau \text{cut}$ values and the reduced
${\chi}^{2}$ of the fit. The fit, together
with the individual $\tau \text{cut}$
histograms, allows the user to assess the systematic
$\tau \text{cut}$ error
and possibly improve results.
When multiple PDF sets are chosen, additional files with the names of the PDF sets are generated. In case PDF uncertainties are enabled, the histograms also include the upper and lower bounds of the PDF uncertainties.
User cuts, histograms and reweighting.
Modifying the plotting routines in the files src/User/nplotter*.f
allows for modification
of the predefined histograms and addition of any number of arbitrary observables. The
routine gencuts_user
can be adjusted in the file src/User/gencuts_user.f90
for
additional cuts after the jet algorithm has performed the clustering. In the same file the
routine reweight_user
can be modified to include a manual reweighting for all
integral contributions. This can be used to obtain improved uncertainties in, for
example, tails of distributions. One example is included in the subdirectory examples
,
where the reweight_user
function approximately flattens the Higgs transverse
momentum distribution, leading to equal relative uncertainties even in the tail at
1TeV.
Compatibility with the Intel compiler and benchmarks
Previous versions of MCFM were developed using gfortran
as a compiler. MCFM
contained code that did not follow a specific Fortran standard, and was only
compatible with using gfortran
. We fixed code that did not compile or work with
the recent Intel Fortran compiler ifort
19.0.1. This does not mean that we claim
to be strictly standards compliant with a specific Fortran version, but we aim to
be compliant with Fortran 2008. We now fully support GCC versions newer than
$7$ and Intel compilers
newer than $19$.
There might still be compatibility issues with other Fortran compilers, but we are happy to
receive bug reports for any issues regarding compilation, that are not due to a lack of modern
Fortran 2008 features. To use the Intel compiler one has to change the USEINTEL flag in the
files Install
and makefile
to YES
.
To see whether MCFM can make use of potential Intel compiler improvements over the GNU compiler collection (GCC) we benchmarked the double real emission component of Higgs production at NNLO. We perform tests on our cluster with Intel Xeon 64bit X5650 2.67 GHz Westmere CPUs, where two sixcore CPUs are run in a dualsocket mode with a total of twelve cores. Similarly, we have an AMD 6128 HE Opteron 2GHz quadsocket eightcore setup, thus each having 32 cores per node.
We benchmark both the Intel and GCC compilers on both the Intel and AMD systems. On the Intel system we use 16 MPI processes each with 12 OMP threads, and on the AMD system we have 8 MPI processes using 32 OMP threads. With this we have the same total clockrate of 512GHz for each setup. For all benchmarks we find that the scaling is perfect up to this size, that is if we use half the number of MPI or OMP threads we double our runtime.
We first try both the Intel fortran compiler 19.0.1 and GCC 9.1.0 on the
Intel system with the highest generic optimization flags O3 xsse4.2
and O3
march=westmere
, respectively. Furthermore, we lower the optimizations to O2
each and
remove the processor specific optimization flags xsse4.2
and march=westmere
,
respectively. All our benchmark runtimes in the following are consistent within
$\pm 0.5$ seconds.
We do not support enabling unsafe math operations with ffastmath
, since the code
relies on the knowledge of NaN values and checks on those. Such checks would be skipped
with the meta flagffastmath
which sets ffinitemathonly
.
Compiler/flags  wall time $\text{\xb1}$ 0.5s

ifort O3 xsse4.2  90s 
ifort O2 xsse4.2  86s 
ifort O2  90s 
ifort O1  103s 
gfortran O3 march=westmere  101s 
gfortran O2 march=westmere  105s 
gfortran O2  105s 
gfortran O1  110s 
The benchmark results in 15.1 show that using the Intel compiler, performance benefits of
$\simeq 1020\%$ can be
achieved. Our goal here is not to go beyond this and check whether exactly equivalent
optimization flags have been used in both cases. Enabling optimizations beyond
O2
have little impact, but come with a penalty for the Intel compiler and with a
slight benefit for gfortran. We also notice that processor specific optimizations play
no significant role. This might also be in part due to the fact that MCFM does
not offer much space for (automatic) vectorization optimizations. To summarize,
the default optimization flags of O2
should be sufficient in most cases. We do not
expect that the conclusions from these benchmarks change for different processes. On
the other hand if computing PDF uncertainties, the majority of time is used by
LHAPDF and different optimization flags for LHAPDF might play a role then. We
performed the same benchmark with an older version of GCC, version 7.1.0 using
O2
optimizations, and found that the runtimes are the same as for the newer
version.
Finally, we performed some benchmarks on our AMD setup and found that it is $\simeq 2.5$ times slower for the same total clockrate. Using the Intel compiler for the AMD setup decreased the performance by another $\simeq 30\%$. This is likely due to the fact that the Intel compiler already optimizes for the general Intel architecture.
These benchmarks try to give a general impression and might depend in detail on the process, the number of histograms and whether to compute PDF uncertainties, for example. Especially when computing PDF uncertainties the perfect scaling we tested here might break down since the computation can become memory bound. We discuss this caveat in more detail in 15.2.
Remarks on memory bound performance issues To get numerically precise predictions at the per mille level for NNLO cross sections, already hundreds of million of calls are necessary. Obtaining PDF uncertainties using those NNLO matrix elements significantly increases the computational time. In a simplified view, the total computational time composes as ${N}_{\text{calls}}\ast (T+{N}_{\mathit{PDF}}\cdot {T}_{\mathit{PDF}})$, where $T$ is the computational effort for the matrix element piece, and the PDF part is proportional to the time calling the PDF evolution ${N}_{\mathit{PDF}}$ times and code related to performing the convolutions. For tree level matrix element evaluations, usually also $T\ll {T}_{P}\mathit{DF}$ holds, so the computational cost grows linearly with the number of PDFs.
This naive picture breaks down in practice when a lot of PDFs are sampled together with a lot of histograms or histogram bins. The total memory necessary to store all the histogram information grows like ${N}_{\mathit{PDF}}\cdot {N}_{\text{bins}}\cdot {N}_{\text{thr.}}$, where ${N}_{\mathit{PDF}}$ is the number of PDF members, ${N}_{\text{bins}}$ the number of histogram bins summed over all histograms and ${N}_{\text{thr.}}$ is the number of OMP threads. The factor ${N}_{\text{thr.}}$ enters since we have threadlocal storage to avoid OMP locks. The values are stored in double precision, so the total memory used is ${N}_{\mathit{PDF}}\cdot {N}_{\text{bins}}\cdot {N}_{\text{thr.}}\cdot 8\text{bytes}$.
Assuming for example, 300 PDF members, 10 histograms with each 20 bins and 12 threads, this sums up to 720kb of memory. For the virtual corrections and LO pieces, one has to update this amount of memory once for each call. For the real emission matrix elements one has to accumulate all dipole contributions, so this number additionally scales with the number of dipole contributions. All the histogram updates are usually fully vectorized for modern superscalar processors with SSE and/or AVX extensions. But if this used memory is too large and does not easily fit into the CPU core caches anymore, a transfer to and from DRAM happens, which now is the limiting factor and significantly slows down the computation. Because for that reason, one should work with a minimal number of necessary histograms when working with a lot of PDF members. This is especially important for cluster setups that are not optimized towards memory bound applications, nonNUMA systems. For example in our cluster we have relatively old AMD Opteron quadsocket eightcore nodes with little CPU cache, and with above numbers we are already limited in walltime improvements with using $\sim 16$ cores. Then reducing the number of histograms will significantly improve the performance. In principle one can reduce the histogram precision to single precision and cut memory transfer and storage in half, while doubling the computational speed. This might lead to problems with accumulated rounding errors though, and we have not investigated this further, since in practice one can sufficiently limit the number of histograms or PDF sets.