Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions L1Trigger/L1CaloTrigger/test/egid_hgcal/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# HGCal L1-Trigger e/gamma ID
This repository contains a series of scripts which are used to train the e/gamma ID BDT for the HGCal L1T. The workflow has been separated into subfolder (see below), which must be ran in the correct order. The output is two .xml files, which contain the info of the trained BDT in the low eta (1.5 < |eta| < 2.7) and high eta (2.7 < |eta| < 3.0) regions. These .xml files can then be placed directly in the HGCAL L1T TPG software.


## Installation
Follow the instructions for installation (users) of the [HGCAL L1T TPG software](https://twiki.cern.ch/twiki/bin/viewauth/CMS/HGCALTriggerPrimitivesSimulation#Installation_for_users)

The next step is to clone this repository:

```
cd L1Trigger
git clone git@github.com:jonathon-langford/hgcal_l1t_egid
```
In each new terminal, must set environment: `source setup.sh`

## Contents
Each subfolder contains instructions in the form of a `README.md` which details how to run the scripts:

* `ntuples`: contains the CMSSW config to create ntuples from the HGCAL L1T TPG software. The ntuple production uses CRAB and thus requires a grid certificate. After generating the signal ntuples, you should create the efficiency plots using `plotting/make_eff_plot.py` to check the drop in efficiency for the latest clustering algorithm. This will indicate how strongly a new e/gamma ID BDT is needed.
* `cl3d_selection`: takes as input the ntuples produced in the `ntuples` subfolder and outputs a flat ntuple of 3d clusters passing the selection criteria for training the e/gamma ID.
* `training`: scripts for training the e/gamma ID BDT, converting into .xml format, evaluating the newly-trained BDTs and summaries of the performance (working points,ROC curve)
* `plotting`: scripts to plot the efficiency curves and the 3D cluster variables.

For the full workflow users should: `ntuples` -> `cl3d_selection` -> `training`

29 changes: 29 additions & 0 deletions L1Trigger/L1CaloTrigger/test/egid_hgcal/cl3d_selection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# 3D Cluster selection
The next step is to run the cluster selection on the ntuples. To run on a single file, over all events...

```
python hgcal_l1t_cl3d_selection.py --sampleType electron_200PU --fileNumber <number of file to process>
```

If the ntuples are not stored in the default location, then specify the path to the files using the `--inputPath` option. Also, if using a different clustering algorithm then add the long TDirectory name to the `clusteringAlgoDirDict`.

The script matches 3D clusters to gen particles e.g. for electrons requires a gen particle with pdgID=11, with the requirement dR between the cluster and the gen particle is less than 0.2. There is no gen matching for background (neutrino). Additionally clusters are required to have a pT > 20 (10) GeV for signal (background).

The output of this script is a flat ntuple with clusters corresponding to the input sampleType i.e. for signal you obtain an ntuple of gen-matched electron clusters, and for background you obtain an ntuple of pile-up initiated clusters.

## Running in parallel
There is an additional script, `submit_cl3d_selection.py`, which allows you to run the cluster selection over each file in parallel. The number of files to run over can be set as an option, but the default is to run over all (specified in `totalFilesDict`). The jobs are submitted using the HTCondor batch.

```
python submit_cl3d_selection.py --sampleType electron_200PU
```

These jobs should not take long. You can specify the queue using the `--queue` option.

## Adding the cluster ntuples: test, train and all
The final step is to combine the output flat ntuples, into test (10%), train (90%) and all (100%) samples. This has been automated by the `add_files.py` script. The script calculates the number of files in the relevant directory and combines accordingly.

```
python add_files.py --sampleType electron_200PU --deleteIndividualFiles 1
```
The `--deleteIndividualFiles` option deletes all the individual flat ntuples that have been used to create the test, train and all samples. This should be used to avoid taking up to much space.
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Python script for hadd-ing files for input sample type
# Combines files to have: full (100%), train (90%) and test (10%)

import ROOT
import sys
import os
from optparse import OptionParser

print "~~~~~~~~~~~~~~~~~~~~~~~~ ADD FILES ~~~~~~~~~~~~~~~~~~~~~~~~"

def get_options():
parser = OptionParser()
parser = OptionParser( usage="python add_files.py --sampleType=<sampleType>" )
parser.add_option("--sampleType", dest='sampleType', default='electron_200PU', help="Sample to process, default signal is electron_200PU, default bkg is neutrino_200PU")
parser.add_option("--clusteringAlgo", dest="clusteringAlgo", default="Histomaxvardr", help="Clustering algorithm used in ntuple production")
parser.add_option("--deleteIndividualFiles", dest="deleteIndividualFiles", default=0, type="int", help="Delete individual files after hadding [1=yes,0=no (default)]")
return parser.parse_args()

(opt,args) = get_options()

#Check if directory exists:
if not os.path.isdir("./%s"%opt.sampleType):
print " --> [ERROR] Directory %s does not exist. Try running the cl3d selection first"
print "~~~~~~~~~~~~~~~~~~~~~ ADD FILES (END) ~~~~~~~~~~~~~~~~~~~~~"
sys.exit(1)

#Get number of files in folder and separate into test and train sample sizes
N_files = len(next(os.walk('./%s'%opt.sampleType))[2])
if N_files == 0:
print " --> [ERROR] %s is empty. No files to add"%opt.sampleType
print "~~~~~~~~~~~~~~~~~~~~~ ADD FILES (END) ~~~~~~~~~~~~~~~~~~~~~"
sys.exit(1)
N_test = int(0.1*N_files)
N_train = int(N_files-N_test)

print " --> Making all (%g), test (%g) and train (%g) samples"%(N_files,N_test,N_train)

#first add all files for full
os.system('mkdir %s/all'%opt.sampleType)
os.system('hadd %s/all/%s_%s_all.root %s/%s_%s_*.root'%(opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType,opt.sampleType,opt.clusteringAlgo))

#make test and train samples
os.system('mkdir %s/test %s/train'%(opt.sampleType,opt.sampleType))
os.system('for fileNum in {1..%g}; do mv %s/%s_%s_${fileNum}.root %s/train; done'%(N_train,opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType))
os.system('mv %s/%s_%s_*.root %s/test'%(opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType))
os.system('hadd %s/%s_%s_train.root %s/train/*.root'%(opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType))
os.system('hadd %s/%s_%s_test.root %s/test/*.root'%(opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType))
os.system('mv %s/all/%s_%s_all.root %s/'%(opt.sampleType,opt.sampleType,opt.clusteringAlgo,opt.sampleType))
os.system('rm -Rf %s/all'%opt.sampleType)
print " --> Successfully made files"

if opt.deleteIndividualFiles:
print " --> Deleting individual files..."
os.system('rm -Rf %s/train'%opt.sampleType)
os.system('rm -Rf %s/test'%opt.sampleType)

print "~~~~~~~~~~~~~~~~~~~~~ ADD FILES (END) ~~~~~~~~~~~~~~~~~~~~~"
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Python script for cluster selection: gen-matching + selection cuts

import ROOT
import sys
import os
import math
from array import array
from optparse import OptionParser

print "~~~~~~~~~~~~~~~~~~~~~~~~ Cl3D Selection ~~~~~~~~~~~~~~~~~~~~~~~~"

def get_options():
parser = OptionParser()
parser = OptionParser( usage="usage: python hgcal_l1t_cl3d_selection.py <options>" )
parser.add_option("--sampleType", dest='sampleType', default='electron_200PU', help="Sample to process, default signal is electron_200PU, default bkg is neutrino_200PU")
parser.add_option("--inputPath", dest="inputPath", default="%s/ntuples"%os.environ['HGCAL_L1T_BASE'], help="Path to directories which hold input ntuples")
parser.add_option("--fileNumber", dest="fileNumber", default=1, type="int", help="Input ntuple number")
parser.add_option("--maxEvents", dest="maxEvents", default=-1, type="int", help="Maximum number of events to process")
parser.add_option("--clusteringAlgo", dest="clusteringAlgo", default="Histomaxvardr", help="Clustering algorithm used in ntuple production")
# IF want to process with a different clustering alg: need to change the cms config script in ntuples/ directory
return parser.parse_args()

(opt,args) = get_options()

# Mapping to extract TDirectory for different clustering algo
# Add full chain name if using a different alg
clusteringAlgoDirDict = {"gen":"Floatingpoint8ThresholdDummyHistomaxvardrGenclustersntuple","Histomaxvardr":"Floatingpoint8ThresholdDummyHistomaxvardrGenclustersntuple","Histomaxvardr_stc":"Floatingpoint8SupertriggercellDummyHistomaxvardrClustersntuple"}

# Mapping: sample type to pdgid used for gen matching
pdgIdDict = {
"electron":[11],
"photon":[22],
"pion":[211],
"neutrino":[]
}
pdgid = pdgIdDict[ opt.sampleType.split("_")[0] ]

#Check: clustering exists in dir
clusteringAlgo = opt.clusteringAlgo
if clusteringAlgo not in clusteringAlgoDirDict:
print " --> [ERROR] not configured for %s clustering. Leaving..."%clusteringAlgo
print "~~~~~~~~~~~~~~~~~~~~ Cl3D Selection (END) ~~~~~~~~~~~~~~~~~~~~"
sys.exit(1)

# Check: if ntuple exists
if os.path.exists("%s/%s/ntuple_%g.root"%(opt.inputPath,opt.sampleType,opt.fileNumber)):
print " --> Processing %s/%s/ntuple_%g.root"%(opt.inputPath,opt.sampleType,opt.fileNumber)
f_in_name = "%s/%s/ntuple_%g.root"%(opt.inputPath,opt.sampleType,opt.fileNumber)
print " --> Clustering algorithm: %s"%clusteringAlgo
print " --> Events to be processed: %g"%opt.maxEvents
else:
print " --> [ERROR] %s/%s/ntuple_%g.root does not exist. Leaving..."%(opt.inputPath,opt.sampleType,opt.fileNumber)
print "~~~~~~~~~~~~~~~~~~~~ Cl3D Selection (END) ~~~~~~~~~~~~~~~~~~~~"
sys.exit(1)

# Open trees to read from
f_in = ROOT.TFile.Open( f_in_name )
gen_tree = f_in.Get("%s/HGCalTriggerNtuple"%clusteringAlgoDirDict["gen"])
cl3d_tree = f_in.Get("%s/HGCalTriggerNtuple"%clusteringAlgoDirDict[clusteringAlgo])

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# CLASS DEFINITIONS

# class for 3d cluster variable: only initiate if cl3d passes selection
class Cluster3D:

#Constructor method: takes event and cluster number as input
def __init__(self, _event, _ncl3d):
#initialise TLorentzVector
_p4 = ROOT.TLorentzVector()
_p4.SetPtEtaPhiE( _event.cl3d_pt[_ncl3d], _event.cl3d_eta[_ncl3d], _event.cl3d_phi[_ncl3d], _event.cl3d_energy[_ncl3d] )
self.P4 = _p4
self.clusters_n = _event.cl3d_clusters_n[_ncl3d]
self.showerlength = _event.cl3d_showerlength[_ncl3d]
self.coreshowerlength = _event.cl3d_coreshowerlength[_ncl3d]
self.firstlayer = _event.cl3d_firstlayer[_ncl3d]
self.maxlayer = _event.cl3d_maxlayer[_ncl3d]
self.seetot = _event.cl3d_seetot[_ncl3d]
self.seemax = _event.cl3d_seemax[_ncl3d]
self.spptot = _event.cl3d_spptot[_ncl3d]
self.sppmax = _event.cl3d_sppmax[_ncl3d]
self.szz = _event.cl3d_szz[_ncl3d]
self.srrtot = _event.cl3d_srrtot[_ncl3d]
self.srrmax = _event.cl3d_srrmax[_ncl3d]
self.srrmean = _event.cl3d_srrmean[_ncl3d]
self.emaxe = _event.cl3d_emaxe[_ncl3d]
self.bdteg = _event.cl3d_bdteg[_ncl3d]
self.quality = _event.cl3d_quality[_ncl3d]

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# FUNCTION DEFINITIONS

#function to fill tree
def fill_cl3d( _cl3d, _out_var ):
_out_var['pt'][0] = _cl3d.P4.Pt()
_out_var['eta'][0] = _cl3d.P4.Eta()
_out_var['phi'][0] = _cl3d.P4.Phi()
_out_var['clusters_n'][0] = _cl3d.clusters_n
_out_var['showerlength'][0] = _cl3d.showerlength
_out_var['coreshowerlength'][0] = _cl3d.coreshowerlength
_out_var['firstlayer'][0] = _cl3d.firstlayer
_out_var['maxlayer'][0] = _cl3d.maxlayer
_out_var['seetot'][0] = _cl3d.seetot
_out_var['seemax'][0] = _cl3d.seemax
_out_var['spptot'][0] = _cl3d.spptot
_out_var['sppmax'][0] = _cl3d.sppmax
_out_var['szz'][0] = _cl3d.szz
_out_var['srrtot'][0] = _cl3d.srrtot
_out_var['srrmax'][0] = _cl3d.srrmax
_out_var['srrmean'][0] = _cl3d.srrmean
_out_var['emaxe'][0] = _cl3d.emaxe
_out_var['bdteg'][0] = _cl3d.bdteg
_out_var['quality'][0] = _cl3d.quality
out_tree.Fill()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# CONFIGURE OUTPUT
print " --> Configuring output ntuple"

#Check if output dir exists
if not os.path.isdir( "%s/cl3d_selection/%s"%(os.environ['HGCAL_L1T_BASE'],opt.sampleType) ):
print " --> Making directory: %s/cl3d_selection/%s"%(os.environ['HGCAL_L1T_BASE'],opt.sampleType)
os.system("mkdir %s/cl3d_selection/%s"%(os.environ['HGCAL_L1T_BASE'],opt.sampleType) )

# Create output file name
f_out_name = "%s/cl3d_selection/%s/%s_%s_%g.root"%(os.environ['HGCAL_L1T_BASE'],opt.sampleType,opt.sampleType,clusteringAlgo,opt.fileNumber)
print " --> Creating output file: %s"%f_out_name
f_out = ROOT.TFile( f_out_name, "RECREATE" )

# Initialise ttree and define output variables
sampleToTreeDict = {
"electron":"e_sig",
"photon":"g_sig",
"pion":"pu_sig",
"neutrino":"pu_bkg"
}
out_tree = ROOT.TTree( sampleToTreeDict[opt.sampleType.split("_")[0]], sampleToTreeDict[opt.sampleType.split("_")[0]] )
out_var_names = ['pt','eta','phi','clusters_n','showerlength','coreshowerlength','firstlayer','maxlayer','seetot','seemax','spptot','sppmax','szz','srrtot','srrmax','srrmean','emaxe','bdteg','quality']
out_var = {}
for var in out_var_names: out_var[var] = array('f',[0.])
#Create branches in output tree
for var_name, var in out_var.iteritems(): out_tree.Branch( "cl3d_%s"%var_name, var, "cl3d_%s/F"%var_name )

print " --> Output configured"

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# CL3D SELECTION: loop over events and write clusters passing selection to output file
for ev_idx in range(cl3d_tree.GetEntries()):

if ev_idx == opt.maxEvents: break
if opt.maxEvents == -1:
if ev_idx % 100 == 0: print " --> Processing event: %g/%g"%(ev_idx+1,cl3d_tree.GetEntries())
else:
if ev_idx % 100 == 0: print " --> Processing event: %g/%g"%(ev_idx+1,opt.maxEvents)

#Extract event info from both gen and cluster tree
gen_tree.GetEntry( ev_idx )
cl3d_tree.GetEntry( ev_idx )

#Extract number of gen particles + cl3d in event
N_gen = gen_tree.gen_n
N_cl3d = cl3d_tree.cl3d_n

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# GEN-MATCHED CLUSTERS
if opt.sampleType.split("_")[0] in ['electron','photon','pion']:

#Loop over gen-e/gamma in event
for gen_idx in range( N_gen ):
if abs( gen_tree.gen_pdgid[gen_idx] ) in pdgid:
#define TLorentzVector for gen particle
gen_p4 = ROOT.TLorentzVector()
gen_p4.SetPtEtaPhiE( gen_tree.gen_pt[gen_idx], gen_tree.gen_eta[gen_idx], gen_tree.gen_phi[gen_idx], gen_tree.gen_energy[gen_idx] )
# require gen e/g/pi pT > 20 GeV
if gen_p4.Pt() < 20.: continue

# loop overi 3d clusters: save index of max pt cluster if in
cl3d_genMatched_maxpt_idx = -1
cl3d_genMatched_maxpt = -999
for cl3d_idx in range( N_cl3d ):
#requre that cluster pt > 10 GeV
if cl3d_tree.cl3d_pt[cl3d_idx] < 10.: continue
#define TLorentxVector for cl3d
cl3d_p4 = ROOT.TLorentzVector()
cl3d_p4.SetPtEtaPhiE( cl3d_tree.cl3d_pt[cl3d_idx], cl3d_tree.cl3d_eta[cl3d_idx], cl3d_tree.cl3d_phi[cl3d_idx], cl3d_tree.cl3d_energy[cl3d_idx] )
#Require cluster to be dR < 0.2 within gen particle
if cl3d_p4.DeltaR( gen_p4 ) < 0.2:
#If pT of cluster is > present max then set
if cl3d_p4.Pt() > cl3d_genMatched_maxpt:
cl3d_genMatched_maxpt = cl3d_p4.Pt()
cl3d_genMatched_maxpt_idx = cl3d_idx

# if cl3d idx has been set then add fill cluster to tree
if cl3d_genMatched_maxpt_idx >= 0:
cl3d = Cluster3D( cl3d_tree, cl3d_genMatched_maxpt_idx )
fill_cl3d( cl3d, out_var )

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#BACKGROUND CLUSTERS: PU
else:

#Loop over 3d clusters: if pT > 20 GeV then fill as background
for cl3d_idx in range(0, N_cl3d ):

if cl3d_tree.cl3d_pt[cl3d_idx] > 20.:
cl3d = Cluster3D( cl3d_tree, cl3d_idx )
fill_cl3d( cl3d, out_var )

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# END OF EVENTS LOOP

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f_out.Write()
f_out.Close()
print "~~~~~~~~~~~~~~~~~~~~ Cl3D Selection (END) ~~~~~~~~~~~~~~~~~~~~"
18 changes: 18 additions & 0 deletions L1Trigger/L1CaloTrigger/test/egid_hgcal/cl3d_selection/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

# Set up environment
export HGCAL_L1T_BASE=$1
cd $HGCAL_L1T_BASE

#Script to run the HGCal L1T cluster definition
cd cl3d_selection
eval `scramv1 runtime -sh`

#Input to cl3d selection
sample_type=$2
input_path=$3
file_number=$4
clustering_algo=$5

# Run selection
python hgcal_l1t_cl3d_selection.py --sampleType $sample_type --inputPath $input_path --fileNumber $file_number --clusteringAlgo $clustering_algo
Loading