Logo ROOT   6.10/00
Reference Guide
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Groups Pages
proof/doc/confman/TDataSetManagerAliEn.md
Go to the documentation of this file.
1 A PROOF interface to the AliEn file catalog
2 ===========================================
3 
4 Overview
5 --------
6 
7 Datasets have been invented to provide PROOF users a cleaner access to
8 sets of uniform data: each dataset has a name which helps identifying
9 the kind of data stored, plus some meta-information, such as:
10 
11 - default tree name
12 
13 - number of events in the default tree
14 
15 - file size
16 
17 - integrity information: *is my file corrupted?*
18 
19 - locality information: *is my remote file available on a local
20  storage?*
21 
22 Datasets are also used by the [staging daemon
23 afdsmgrd](http://afdsmgrd.googlecode.com/) to trigger data staging,
24 *i.e.* to request some data from being transferred from a remote storage
25 to the local analysis facility disks.
26 
27 PROOF datasets are handled by the *dataset manager*, a generic catalog of
28 datasets which has been historically implemented by the class
29 `TDataSetManagerFile`, which stored each dataset inside a ROOT file.
30 
31 This dataset manager has been conceived for a small *(i.e., hundreds)*
32 number of datasets which reflected data stored on the local analysis facility
33 disks. As the PROOF analysis model became popular in ALICE, the number
34 of datasets grew posing many problems.
35 
36 - To give the possibility to process remote data, current datasets
37  mimick file catalog functionalities by including also lists of files
38  currently not staged on the local analysis facility.
39 
40 - Since users can create their own datasets, in many cases containing
41  duplicate data, it has become demanding to provide maintenance and
42  support.
43 
44 - Locality information in datasets is static: this means that, if a
45  file gets deleted from a disk, the corresponding dataset(s) must be
46  synchronized manually.
47 
48 ### An interface to the AliEn file catalog
49 
50 The new `TDataSetManagerAliEn` class is a new dataset manager which acts
51 as an intermediate layer between PROOF datasets and the AliEn file
52 catalog.
53 
54 Dataset names do not represent any longer a static list of files:
55 instead, it represents a **query string** to the AliEn file catalog that
56 creates a dataset dynamically.
57 
58 **Locality information** is also filled on the fly by contacting the local
59 file server: for instance, in case a *xrootd* pool of disks is used,
60 fresh online information along with the exact host (endpoint) where each
61 file is located is provided dynamically in a reasonable amount of time.
62 
63 Both file catalog queries and locality information are cached on ROOT
64 files: cache is shared between users and its expiration time is
65 configurable.
66 
67 Since dataset information is now volatile, a separate and more
68 straightforward method for issuing staging requests has also been
69 provided.
70 
71 Configuration
72 -------------
73 
74 ### PROOF
75 
76 Using the new dataset manager requires the `xpd.datasetsrc` directive in
77 the xproofd configuration file:
78 
79  xpd.datasetsrc alien cache:/path/to/dataset/cache urltemplate:http://myserver:1234/data<path> cacheexpiresecs:86400
80 
81 alien
82 : Tells PROOF that the dataset manager is the AliEn interface (as
83  opposed to `file`).
84 
85 cache
86 : Specify a path *on the local filesystem* of the host running user's
87  PROOF master.
88 
89  > This path is not a URL but just a local path. Moreover, the path
90  > must be visible from the host that will run each user's master,
91  > since a separate dataset manager instance is created per user.
92 
93  > If the cache directory does not exist, it is created, if possible,
94  > with open permissions (`rwxrwxrwx`). On a production environment
95  > it is advisable to create the cache directory manually beforehand
96  > with the same permissions.
97 
98 urltemplate
99 : Template used for translating between an `alien://` URL and the
100  local storage's URL.
101 
102  `<path>` is written literally and will be substituted with the full
103  AliEn path without the protocol.
104 
105  > An example on how URL translation works:
106  >
107  > - Template URL:
108  >
109  > root://alice-caf.cern.ch/<path>
110  >
111  > - Source URL:
112  >
113  > alien:///alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
114  >
115  > - Resulting URL:
116  >
117  > root://alice-caf.cern.ch//alice/data/2012/LHC12b/000178209/ESDs/pass1/12000178209061.17/AliESDs.root
118  >
119 cacheexpiresecs
120 : Number of seconds before cached information is considered expired
121  and refetched *(e.g., 86400 for one day)*.
122 
123 ### PROOF-Lite
124 
125 One of the advantages of such a dynamic AliEn catalog interface is that
126 it is possible to use it with PROOF-Lite.
127 
128 By default, PROOF-Lite creates on the client session (which acts as a
129 master as well) a file-based dataset manager. To enable the AliEn
130 dataset manager in a PROOF-Lite session, run:
131 
132 ``` {.cpp}
133 gEnv->SetValue("Proof.DataSetManager",
134  "alien cache:/path/to/dataset/cache "
135  "urltemplate:root://alice-caf.cern.ch/<path> "
136  "cacheexpiresecs:86400");
137 TProof::Open("");
138 ```
139 
140 where the parameters meaning has been described in the previous section.
141 
142 > Please note that the environment must be set **before** opening the
143 > PROOF-Lite session!
144 
145 Usage
146 -----
147 
148 The new dataset manager is backwards-compatible with the legacy
149 interface: each time you want to process or obtain a dataset, instead of
150 specifying a string containing a dataset name you will specify a query
151 string to the file catalog.
152 
153 ### Query string format
154 
155 The query string is the string you will use in place of the dataset
156 name. It does not correspond to a static dataset: instead it represents
157 a virtual dataset whose information is filled in on the fly.
158 
159 There are two different formats you can use:
160 
161 - specify data features (such as period and run numbers) for **official
162  data or Monte Carlo**
163 
164 - specify the **AliEn find** command parameters directly
165 
166 In the query string it is also possible to specify if you want to
167 process data from AliEn, only staged data or data from AliEn in "cache
168 mode".
169 
170 #### Official data and Monte Carlo format
171 
172 These are the string formats to be used respectively for official data
173 and official Monte Carlo productions:
174 
175  Data;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;Pass=<PASS>
176 
177  Sim;Period=<LHCPERIOD>;Variant=[ESDs|AODXXX];Run=<RUNLIST>;
178 
179 Period
180 : The LHC period.
181 
182  Example of valid values: `LHC10h`, `LHC11h_2`, `LHC11f_Technical`
183 
184 Variant
185 : Data variant, which might be `ESDs` (or `ESD`) for ESDs and `AODXXX`
186  for AODs corresponding to the *XXX* set.
187 
188  Example of valid values: `ESDs`, `AOD073`, `AOD086`
189 
190 Run
191 : Runs to be processed, in the form of a single run (`130831`), an
192  inclusive range (`130831-130833`), or a list of runs and/or ranges
193  (`130831-130835,130840,130842`).
194 
195  Duplicate runs are automatically removed, so in case you specify
196  `130831-130835,130833` run number 130833 will be processed only
197  once.
198 
199 Pass *(only for data, not for Monte Carlo)*
200 : The pass number or name. In case you specify only a number `X`, it
201  will be expanded to `passX`.
202 
203  Example of valid values: `1`, `pass1`, `pass2`, `cpass1_muon`
204 
205 This is an example of a full valid string:
206 
207  Data;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1
208 
209 #### AliEn find format
210 
211 Whenever a user would like to process data which has not been produced
212 officially, or whose directory structure in the AliEn file catalog is
213 non-standard, an interface to the AliEn shell's `find` command is
214 provided.
215 
216 This is the command format:
217 
218  Find;BasePath=<BASEPATH>;FileName=<FILENAME>;Anchor=<ANCHOR>;TreeName=<TREENAME>;Regexp=<REGEXP>
219 
220 Parameters `BasePath` and `FileName` are passed as-is to the AliEn [find
221 command](http://alien2.cern.ch/index.php?option=com_content&view=article&id=53&Itemid=99#Searching_for_files),
222 and are mandatory.
223 
224 Parameters `Anchor`, `TreeName` and `Regexp` are optional.
225 
226 Here's a detailed description of the parameters.
227 
228 BasePath
229 : Start search under the specified path on the AliEn file catalog.
230 
231  Jolly characters are supported: the asterisk (`*`) and the
232  percentage sign (`%`) are interchangeable.
233 
234  Examples of valid values are:
235 
236  /alice/data/2010/LHC10h/000123456/*.*
237  /alice/cern.ch/user/d/dummy/my_pp_production/%.%
238 
239 FileName
240 : File name to look for.
241 
242  Examples of valid values are: `root_archive.zip`, `aod_archive.zip`,
243  `custom_archive.zip`, `AliAOD.root.
244 
245 Anchor *(optional)*
246 : In case `FileName` is a zip archive, the anchor is the name of a
247  ROOT file inside the archive to point to.
248 
249  Examples of valid values are: `AliAOD.root`, `AliESDs.root`,
250  `MyRootFile.root`.
251 
252  > Using the AliEn file catalog it is possible to point directly to a
253  > ROOT file stored in an archive without using the anchor.
254  >
255  > There is however a substantial difference in how data is
256  > retrieved, especially during staging: auxiliary ROOT files
257  > *(friends)* are stored inside the archive along with the "main"
258  > file, so that when you use the archive as `FileName` with the
259  > proper `Anchor` you are still referencing to the same file, but
260  > you are giving instructions of downloading the archive.
261  >
262  > Using the ROOT file name directly must be done in very special
263  > cases (*i.e.*, to save space) and only when one is completely sure
264  > that no external files in the archive are required for analysis.
265 
266 TreeName *(optional)*
267 : Name of each file's default tree.
268 
269  Examples of valid values are: `/aodTree`, `/esdTree`, `/myCustomTree`,
270  `/TheDirectory/TheTree`.
271 
272 Regexp *(optional)*
273 : Additional extended regular expression applied after find command is
274  run, to fine-grain search results.
275 
276  Only `alien://` paths matching the regular expression are
277  considered, others are discarded.
278 
279  Examples of valid values are:
280 
281  /[0-9]{6}/[0-9]{3,4}
282  \.root$
283 
284  > ROOT class
285  > [TPMERegexp](http://root.cern.ch/root/html/TPMERegexp.html) is
286  > used to perform regular expression matching.
287 
288 
289 Example of an AliEn raw find dataset string:
290 
291  Find;BasePath=/alice/data/2010/LHC10h/000139505/ESDs/pass1/*.*;FileName=root_archive.zip;Anchor=AliESDs.root
292 
293 #### Data access modes
294 
295 It is possible to append to the format string the `Mode` specifier that
296 affects the way URLs are generated.
297 
298  Mode=[local|remote|cache]
299 
300 This parameter is optional and defaults to `local`. Description of each
301 possible value follows:
302 
303 local
304 : Local storage is checked for the presence of data you requested.
305  Output URLs will be relative to your local storage. Also, locality
306  information *(i.e., is your file staged?)* is filled.
307 
308  If you run a PROOF analysis on a dataset with this mode specified,
309  only data marked as "staged" will be processed.
310 
311  This method is the preferred one, since it does not overload the
312  remote storage, and it enables users to process partially-staged
313  datasets, or partially-reconstructed runs, without the need to
314  manually update static datasets.
315 
316  > This is the default if no mode is specified, and it is also the
317  > most efficient one.
318  >
319  > Despite it might take some time (up to a couple of minutes to
320  > locate ~4000 files), returned information is always reliable
321  > (because it's dynamic) and speeds up analysis (because analysis
322  > will always be run only on files having local copies).
323  >
324  > Moreover this information is cached for a configurable period of
325  > time, so that subsequent calls to the same dataset will be faster.
326 
327 remote
328 : Only AliEn URLs are returned.
329 
330  A PROOF analysis run on a dataset with this mode specified will
331  always obtain data from a remote storage, according to the AliEn
332  file catalog.
333 
334  > Tasks run on remote data are usually much slower than using local
335  > storage.
336 
337 cache
338 : URLs pointing to local copies of files are returned, but does not check
339  whether the file is locally present or not.
340 
341  If local storage is configured for retrieving from AliEn files that
342  are not available locally (which is the case of xrootd with vMSS),
343  then data will be downloaded *while analysis is running*.
344 
345  It is called *cache mode* because it treats the local storage as a
346  cache for the remote storage.
347 
348  > This mode is usually very slow on a busy analysis facility since
349  > retrieving data in real time without any kind of scheduling is
350  > inefficient. It also conflicts with the preferred method, which is
351  > to stage data asynchronously using the [stager
352  > daemon](http://afdsmgrd.googlecode.com/).
353 
354 #### Force cache refresh
355 
356 If the cached information for a certain AliEn file catalog query is wrong,
357 it is possible to force querying the catalog again by using the keyword
358 `ForceUpdate`:
359 
360  Data;ForceUpdate;Period=LHC10h;Variant=AOD086;Run=130831-130833;Pass=pass1
361 
362 ### Staging requests
363 
364 Issuing staging requests and keeping track of them requires an auxiliary
365 database that can be read and updated by the [data stager
366 daemon](http://afdsmgrd.googlecode.com/).
367 
368 Whenever a staging request is issued, a ROOT file containing the dataset
369 is saved in a special directory on the master's filesystem, monitored by
370 the file stager.
371 
372 #### PROOF configuration
373 
374 In the xproofd configuration file, there is a directive to specify the
375 directory used as repository for staging requests:
376 
377  xpd.stagereqrepo [dir:]/path/to/local/directory
378 
379 > The literal `dir:` prefix is optional.
380 
381 This directive is shared between PROOF and the stager daemon, so that the
382 same configuration file can be used for both.
383 
384 Permissions on this directory must be kept open.
385 
386 > Versions of the stager daemon prior to v1.0.7 do not support open
387 > permissions and the staging repository directive.
388 
389 #### Request and monitor staging
390 
391 Staging requests and monitoring can be done from within a PROOF session.
392 
393 `gProof->RequestStagingDataSet("QueryString")`
394 : Requests staging of the dataset specified via the query string.
395 
396  Staging request is honored if the stager daemon is running.
397 
398  > In order to avoid requesting to stage undesired data, it is
399  > advisable to check in advance the results of your query string:
400  >
401  > `gProof->ShowDataSet("QueryString")`
402 
403 `TProof->ShowStagingStatusDataSet("QueryString"[, "opts"])`
404 : Shows progress status of a previously given staging request with
405  data specified by the query string.
406 
407  Options are optional, and passed as-is to the `::Print()` method.
408 
409  > It is possible to show all the files marked as corrupted by the
410  > daemon:
411  >
412  > gProof->ShowStagingStatusDataSet("QueryString", "C")
413  >
414  > Or all the files successfully staged and not corrupted:
415  >
416  > gProof->ShowStagingStatusDataSet("QueryString", "Sc")
417 
418 `gProof->GetStagingStatusDataSet("QueryString")`
419 : Gets a `TFileCollection` containing information on the staging
420  request specified by the query string.
421 
422  Works exactly like `ShowStagingStatusDataSet()` but returns an
423  object instead of displaying information on the screen.
424 
425 `gProof->CancelStagingDataSet("QueryString")`
426 : Removes a dataset from the list of staging requests. Datasets used
427  as staging requests are usually removed automatically by the staging
428  daemon if everything went right, so this command is used mostly to
429  purge a completed staging request when it has some corrupted files.
for(Int_t i=0;i< n;i++)
Definition: legend1.C:18
TString as(SEXP s)
Definition: RExports.h:71
TArc * a
Definition: textangle.C:12
static std::string format(double x, double y, int digits, int width)
static double A[]
RooCmdArg Parameters(const RooArgSet &params)
void run(bool only_compile=false)
Definition: run.C:1
UInt_t Find(std::list< std::pair< const Node< T > *, Float_t > > &nlist, const Node< T > *node, const T &event, UInt_t nfind)
Wrapper for PCRE library (Perl Compatible Regular Expressions).
Definition: TPRegexp.h:97
char name[80]
Definition: TGX11.cxx:109