Operation¶
The PACER currently supports the following operations:
- User and investigation synchronization
- Investigation minting
- Dataset ingestion
These operations are initiated through messages published to the PACER message broker. The message payloads, routing keys, and other relevant details for each operation are described below.
Dataset ingestion¶
This workflow is focused towards ICAT being used in conjunction with DRAC / Data Portal. If you are not using DRAC / Data Portal, you can still use the dataset ingestion, but there are features included in the default workflow that may not be applicable to your implementation.
The dataset ingestion flow is divided into four independent stages. The aim is to try and divide the tasks so that the tasks can be executed in parallel, reducing the overall time required for dataset ingestion and avoiding processing bottlenecks.
graph LR
A[dataset-ingestion] --> B[internal-dataset-ingestion]
B --> C[internal-statistics]
B --> D[internal-dataset-links]
Dataset ingestion messages must be sent only to the first stage. From there, messages are automatically routed by PACER to the subsequent stages. Each stage performs the following tasks:
- dataset-ingestion: Validates the dataset payload, checks for duplicates, and creates the dataset in ICAT.
- internal-dataset-ingestion: Creates dataset datafiles in ICAT, registers dataset parameters, and establishes dataset links (raw–processed) in ICAT.
- internal-statistics: Computes statistical metrics stored as dataset, sample, or investigation parameters. These metrics are used in the Data Portal for data visualization (e.g., total volume of raw datasets, volume of processed datasets per sample, etc.).
- internal-dataset-links: Establishes complete linkage between all raw and processed datasets.
If the ingestion process fails at any stage, the entire dataset creation process will be rolled back.
Note
RabbitMQ can work with priority queues. If you find that the processing of some messages needs to be prioritized, you can configure the PACER to enable priority queue support (see configuration).
Dataset ingestion¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
DatasetsConsumer |
dataset-ingest-exchange | dataset.ingest | JSON, XML | ICAT |
{
"investigation": "20250370148",
"investigation_id": "3456 (1)",
"instrument": "BL06",
"name": "BSample-19X10_w1_01",
"location": "20250370148/BSample-19X10_w1_01",
"start_date": "2026-02-06T14:59:57.000+01:00",
"end_date": "2026-02-06T14:59:57.000+01:00",
"sample": {
"name": "20260206_Sample-19-10",
"type": "biological (2)"
},
"datafiles": [
{ "location": "/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5 (3)" }
],
"parameters": [
{ "name": "scanType (4)", "value": "datacollection" }
]
}
-
This parameter is optional. The PACER will try and match an investigation in ICAT with the investigation's name. If it's not possible it will try to narrow down with the start and end dates of the dataset. If even then there are multiple matching investigations, it will refuse ingestion.
To prevent this from happenning in situations where there are multiple visits of the same investigation scheduled at the same time, specify the investigation using the
investigation_idfield (ICAT's investigation.ID field). -
Sample type enforcement is optional. If enforced, the sample type must exist in ICAT.
- The PACER can automatically index all the files inside of the dataset's root location (see config).
- Parameter types must exist in ICAT.
<?xml version="1.0" encoding="utf-8"?>
<dataset>
<datafile>
<location>/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5</location>
</datafile>
<instrument>bl06</instrument>
<location>/data/bl06/20250370148/BSample-19X10_w1_01</location>
<investigation>20250340249</investigation>
<investigationId>3456</investigationId>
<name>20250340249__POTENTIOSTAT__Ni3nm_0pt1MNaOH_4__Ni_Dummy_855_D134_C02</name>
<startDate>2025-10-19 22:22:19.730125</startDate>
<endDate>2025-10-19 22:22:19.730125</endDate>
<sample>
<name>Ni_Dummy_855_D134_C02</name>
<type>biological</type>
</sample>
<parameter>
<name>scanType</name>
<value>datacollection</value>
</parameter>
</dataset>
Considerations regarding initial dataset ingestion:
- A processed dataset can be associated with one or more input datasets through the
input_datasetsparameter. The value of this parameter must be a comma-separated list containing the locations of all input datasets' locations involved in its creation. - A dataset is classified as processed if the
input_datasetsparameter is present. Otherwise, it is treated as a raw dataset. - When a duplicate raw dataset is ingested, a new dataset is created with a timestamp appended to its name.
- Duplicate processed datasets are not renamed during initial ingestion; they are handled in a subsequent stage of the ingestion pipeline.
Internal dataset ingestion¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
InternalDatasetsConsumer |
dataset-internal-ingest-exchange | dataset.internal_ingest | JSON, XML | ICAT |
This stage creates the dataset's associated datafiles and parameters.
If the automaticDatasetLocationIndex setting is enabled, all
files under the dataset's root location are automatically indexed, up to the limit defined by
maxDatafilesPerDataset. Otherwise, only the explicitly specified
datafiles are linked to the dataset in ICAT.
For duplicate processed datasets, all associated datafiles are recreated; any existing datafiles are removed before the
new ones are created. Dataset parameters may also be updated if the newly ingested dataset has a higher processing
version than the existing dataset (the processing version is defined by the Process_sequence_index dataset parameter).
Internal statistics¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
InternalStatisticsConsumer |
dataset-internal-ingest-exchange | statistics.internal_ingest | JSON, XML | ICAT |
This stage computes the metrics used by the Data Portal to display dataset information. The full list of computed metrics is shown below.
| Scope | Metric |
|---|---|
| Sample | __datasetCount |
| Sample | __acquisitionDatasetCount |
| Sample | __processedDatasetCount |
| Sample | __fileCount |
| Sample | __volume |
| Sample | __acquisitionFileCount |
| Sample | __processedFileCount |
| Sample | __elapsedTime |
| Sample | __volume |
| Sample | __acquisitionVolume |
| Sample | __processedVolume |
| Scope | Metric |
|---|---|
| Investigation | __datasetCount |
| Investigation | __acquisitionInvestigationCount |
| Investigation | __processedInvestigationCount |
| Investigation | __sampleCount |
| Investigation | __volume |
| Investigation | __acquisitionVolume |
| Investigation | __processedVolume |
| Investigation | __elapsedTime |
| Investigation | __fileCount |
| Investigation | __acquisitionFileCount |
| Investigation | __processedFileCount |
| Scope | Metric |
|---|---|
| Dataset | datasetName |
| Dataset | __fileCount |
| Dataset | __volume |
| Dataset | __elapsedTime |
Internal dataset links¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
InternalDatasetsLinksConsumer |
dataset-internal-ingest-exchange | dataset.internal_links | JSON, XML | ICAT |
Processed datasets are linked to their input datasets through the input_datasets parameter. However, this mechanism
only captures direct relationships and does not account for transitive dependencies.
For example, if processed-dataset-3 lists processed-dataset-1 as an input, and processed-dataset-1 was itself
generated from raw-dataset-1, the relationship between processed-dataset-3 and raw-dataset-1 is not captured
directly by input_datasets.
graph LR
R1[raw-dataset-1] --> P1[processed-dataset-1]
R2[raw-dataset-1] --> P1[processed-dataset-1]
R3[raw-dataset-1] --> P2[processed-dataset-2]
R4[raw-dataset-1] --> P2[processed-dataset-2]
P1 --> P3[processed-dataset-3]
R3 --> P3
The Data Portal maintains the complete set of input and output datasets that are related to a given dataset, either directly or through transitive dependencies. This stage computes these relationships for all datasets and stores them in the following parameters:
__full_input_datasetIds__full_output_datasetIds__full_input_datasetNames__full_output_datasetNames
These parameters provide a complete lineage view of each dataset, including both its upstream dependencies and downstream derivatives.
User synchronization¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
UsersConsumer |
uos-sync-exchange | uos_user.sync | JSON | ICAT, VISA |
{
"first_name": "Aitor",
"last_name": "Tilla",
"ORCID": "0000-0000-0000-0000",
"email": "aitortilla@email.test",
"affiliation": {
"id": 18212,
"name": "University of Valencia",
"code": "UV",
"department_name": "Institute of Molecular Science (ICMol)",
"department_code": "ICMol",
"unit": null,
"city": "Valencia",
"country_code": "ES"
},
"is_staff": false,
"enabled": true,
"id": 998123,
"user_list":
[ {"username": "tortilla"} ]
}
The PACER will synchronize users with both ICAT and VISA when both integrations are enabled. If only one integration is enabled, synchronization will be performed exclusively with the enabled system.
Investigation synchronization¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
InvestigationConsumer |
uos-sync-exchange | uos_proposal.sync | JSON | ICAT, VISA |
{
"name": "20260999999",
"start_date": "2026-06-01T14:33:04",
"end_date": "2026-06-01T14:33:04",
"title": "<<Proposal title>>",
"summary": "<<Proposal abstract>>",
"type": "<<ICAT investigation.type.name>>",
"instrument": {
"name": "BL04 - MSPD (1)",
"code": "BL04 (2)"
},
"visit_count": 0,
"is_reimbursed": false,
"user_list": [
{
"username": "<<User's username>>",
"email": "<<User's email>>",
"role": "<<User's role in investigation>> (3)"
}
],
"visa_sync": true,
"icat_sync": true,
"icat_visit_id": "uo_12384",
"visa_visit_id": 10012384,
"sample_acronyms": [
"ABC (4)",
],
"is_industrial": false
}
- The
instrument.namefield is used to match the investigation with an instrument in VISA. - The
instrument.codefield is used to match the investigation with an instrument in ICAT through the instrument'snamefield. - Supported roles are: Principal investigator, Proposal scientist, Participant and Local contact.
- The
sample_acronymsfield is used to populate the 'Short names' columnn in the Data Portal's shipping window.
An investigation's release data is calculated based on the investigation's date and the embargo period in years
specified in the PACER's configuration. If the
is_industrial flag is set to true, the investigation will not have a release date and won't be public ever.
The roles a user can have in an investigation are validated and set by configuration in the PACER. By default, the
allowed values are: Principal investigator, Local contact, Proposal scientist and Participant.
If any additional roles are required, they must be added in the PACER's helpers.static_settings in the following
format
(with the variable name prefixed by ICAT_USER_ROLE_):
ICAT_USER_ROLE_PRINCIPAL_INVESTIGATOR: str = "Principal investigator"
ICAT_USER_ROLE_PROPOSER: str = "Proposal scientist"
ICAT_USER_ROLE_PARTICIPANT: str = "Participant"
ICAT_USER_ROLE_LOCAL_CONTACT: str = "Local contact"
Investigation mint¶
| Consumer | Exchange | Routing key | Payload support | Integrations |
|---|---|---|---|---|
InvestigationOperationsConsumer |
investigation-ops-exchange | investigation.ops | JSON | ICAT, VISA, DataCite, PaNOSC |
The allowed values for the operations field are: mint-proposal and create-panosc-item.
If any additional operations are required, they must be added in the PACER's helpers.static_settings in the following
format
(with the variable name prefixed by INV_OPS_):
DOI creation¶
Before attempting to mint a DOI, PACER performs a series of validation checks on the investigation to ensure it is eligible:
- ❌ The investigation must not be of type industrial.
- ❌ The investigation must not already have a DOI assigned.
- ✅ The investigation must contain at least one dataset.
- ✅ The investigation’s end date must be in the past.
- ✅ The investigation must have valid associated users.
Users associated with ICAT investigations are mapped to DataCite contributor types as follows:
| ICAT investigation role | DataCite contributor type |
|---|---|
| Principal investigator, Participant | Creator |
| Local contact | DataCollector |
| Principal investigator | ProjectManager |
| Proposal scientist | ProjectMember |
All remaining DOI metadata is populated using the configuration defined in the PACER DataCite integration settings (see configuration).
If VISA integration is enabled, the newly minted DOI will also be registered in the VISA database in the corresponding investigation.
PSS item creation¶
If enabled and specified in the message, the investigation will also have its corresponding item created in the PaNOSC Search Scoring service.