1 A PROOF
interface to the AliEn file catalog
2 ===========================================
7 Datasets have been invented to provide PROOF users
a cleaner access to
8 sets of uniform
data: each dataset has
a name which helps identifying
9 the kind of
data stored, plus some meta-information, such
as:
13 - number of events in the
default tree
17 - integrity information: *is my file corrupted?*
19 - locality information: *is my remote file available on
a local
22 Datasets are also used by the [staging daemon
24 *i.e.* to request some
data from being transferred from
a remote storage
25 to the local analysis facility disks.
27 PROOF datasets are handled by the *dataset manager*,
a generic catalog of
28 datasets which has been historically implemented by the
class
31 This dataset manager has been conceived
for a small *(i.e., hundreds)*
32 number of datasets which reflected
data stored on the local analysis facility
33 disks. As the PROOF analysis
model became popular in ALICE, the number
34 of datasets grew posing many problems.
36 - To give the possibility to process remote
data, current datasets
37 mimick file catalog functionalities by including also lists of files
38 currently not staged on the local analysis facility.
40 - Since users can create their own datasets, in many cases containing
41 duplicate
data, it has become demanding to provide maintenance and
44 - Locality information in datasets is
static:
this means that,
if a
45 file gets deleted from
a disk, the corresponding dataset(s) must be
46 synchronized manually.
48 ### An
interface to the AliEn file catalog
51 as an intermediate layer between PROOF datasets and the AliEn file
54 Dataset names
do not represent any longer
a static list of files:
55 instead, it represents
a **query
string** to the AliEn file catalog that
56 creates
a dataset dynamically.
58 **Locality information** is also filled on the fly by contacting the local
59 file server:
for instance, in
case a *xrootd* pool of disks is used,
60 fresh online information along with the exact host (endpoint) where each
61 file is located is provided dynamically in
a reasonable amount of time.
63 Both file catalog queries and locality information are cached on ROOT
64 files: cache is shared between users and its expiration time is
67 Since dataset information is now volatile,
a separate and more
68 straightforward method
for issuing staging requests has also been
76 Using the
new dataset manager requires the `xpd.datasetsrc` directive in
77 the xproofd configuration file:
79 xpd.datasetsrc alien cache:/path/to/dataset/cache urltemplate:http:
82 : Tells PROOF that the dataset manager is the AliEn interface (
as
86 : Specify
a path *on the local filesystem* of the host running user
's
89 > This path is not a URL but just a local path. Moreover, the path
90 > must be visible from the host that will run each user's master,
91 > since
a separate dataset manager instance is created per user.
93 > If the cache directory does not exist, it is created,
if possible,
94 > with
open permissions (`rwxrwxrwx`). On
a production environment
95 > it is advisable to create the cache directory manually beforehand
96 > with the same permissions.
99 : Template used
for translating between an `alien:
102 `<path>` is written literally and will be substituted with the full
103 AliEn path without the protocol.
105 > An example on how URL translation works:
109 > root://alice-caf.cern.ch/<path>
113 > alien:///alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
117 > root://alice-caf.cern.ch//alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
120 : Number of seconds before cached information is considered expired
121 and refetched *(e.g., 86400 for one day)*.
125 One of the advantages of such a dynamic AliEn catalog interface is that
126 it is possible to use it with PROOF-Lite.
128 By default, PROOF-Lite creates on the client session (which acts as a
129 master as well) a file-based dataset manager. To enable the AliEn
130 dataset manager in a PROOF-Lite session, run:
133 gEnv->SetValue("Proof.DataSetManager",
134 "alien cache:/path/to/dataset/cache "
135 "urltemplate:root://alice-caf.cern.ch/<path> "
136 "cacheexpiresecs:86400");
140 where the parameters meaning has been described in the previous section.
142 > Please note that the environment must be set **before** opening the
143 > PROOF-Lite session!
148 The new dataset manager is backwards-compatible with the legacy
149 interface: each time you want to process or obtain a dataset, instead of
150 specifying a string containing a dataset name you will specify a query
151 string to the file catalog.
153 ### Query string format
155 The query string is the string you will use in place of the dataset
156 name. It does not correspond to a static dataset: instead it represents
157 a virtual dataset whose information is filled in on the fly.
159 There are two different formats you can use:
161 - specify data features (such as period and run numbers) for **official
162 data or Monte Carlo**
164 - specify the **AliEn find** command parameters directly
166 In the query string it is also possible to specify if you want to
167 process data from AliEn, only staged data or data from AliEn in "cache
170 #### Official data and Monte Carlo format
172 These are the string formats to be used respectively for official data
173 and official Monte Carlo productions:
175 Data;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;Pass=<PASS>
177 Sim;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;
182 Example of valid values: `LHC10h`, `LHC11h_2`, `LHC11f_Technical`
185 : Data variant, which might be `ESDs` (or `ESD`) for ESDs and `AODXXX`
186 for AODs corresponding to the *XXX* set.
188 Example of valid values: `ESDs`, `AOD073`, `AOD086`
191 : Runs to be processed, in the form of a single run (`130831`), an
192 inclusive range (`130831-130833`), or a list of runs and/or ranges
193 (`130831-130835,130840,130842`).
195 Duplicate runs are automatically removed, so in case you specify
196 `130831-130835,130833` run number 130833 will be processed only
199 Pass *(only for data, not for Monte Carlo)*
200 : The pass number or name. In case you specify only a number `X`, it
201 will be expanded to `passX`.
203 Example of valid values: `1`, `pass1`, `pass2`, `cpass1_muon`
205 This is an example of a full valid string:
207 Data;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1
209 #### AliEn find format
211 Whenever a user would like to process data which has not been produced
212 officially, or whose directory structure in the AliEn file catalog is
213 non-standard, an interface to the AliEn shell's `find` command is
216 This is the command
format:
218 Find;BasePath=<BASEPATH>;FileName=<FILENAME>;Anchor=<ANCHOR>;TreeName=<TREENAME>;Regexp=<REGEXP>
220 Parameters `BasePath` and `FileName` are passed
as-is to the AliEn [find
224 Parameters `Anchor`, `TreeName` and `Regexp` are optional.
226 Here
's a detailed description of the parameters.
229 : Start search under the specified path on the AliEn file catalog.
231 Jolly characters are supported: the asterisk (`*`) and the
232 percentage sign (`%`) are interchangeable.
234 Examples of valid values are:
236 /alice/data/2010/LHC10h/000123456/*.*
237 /alice/cern.ch/user/d/dummy/my_pp_production/%.%
240 : File name to look for.
242 Examples of valid values are: `root_archive.zip`, `aod_archive.zip`,
243 `custom_archive.zip`, `AliAOD.root.
246 : In case `FileName` is a zip archive, the anchor is the name of a
247 ROOT file inside the archive to point to.
249 Examples of valid values are: `AliAOD.root`, `AliESDs.root`,
252 > Using the AliEn file catalog it is possible to point directly to a
253 > ROOT file stored in an archive without using the anchor.
255 > There is however a substantial difference in how data is
256 > retrieved, especially during staging: auxiliary ROOT files
257 > *(friends)* are stored inside the archive along with the "main"
258 > file, so that when you use the archive as `FileName` with the
259 > proper `Anchor` you are still referencing to the same file, but
260 > you are giving instructions of downloading the archive.
262 > Using the ROOT file name directly must be done in very special
263 > cases (*i.e.*, to save space) and only when one is completely sure
264 > that no external files in the archive are required for analysis.
266 TreeName *(optional)*
267 : Name of each file's
default tree.
269 Examples of valid values are: `/aodTree`, `/esdTree`, `/myCustomTree`,
270 `/TheDirectory/TheTree`.
273 : Additional extended regular expression applied after find command is
274 run, to fine-grain search results.
277 considered, others are discarded.
279 Examples of valid values are:
286 > used to perform regular expression matching.
289 Example of an AliEn raw find dataset
string:
291 Find;BasePath=/alice/
data/2010/LHC10h/000139505/ESDs/pass1
static std::string format(double x, double y, int digits, int width)
RooCmdArg Parameters(const RooArgSet ¶ms)
void run(bool only_compile=false)
UInt_t Find(std::list< std::pair< const Node< T > *, Float_t > > &nlist, const Node< T > *node, const T &event, UInt_t nfind)
Wrapper for PCRE library (Perl Compatible Regular Expressions).