Skip to content

Operation

The PACER currently supports the following operations:

  • User and investigation synchronization
  • Investigation minting
  • Dataset ingestion

These operations are initiated through messages published to the PACER message broker. The message payloads, routing keys, and other relevant details for each operation are described below.

Dataset ingestion

This workflow is focused towards ICAT being used in conjunction with DRAC / Data Portal. If you are not using DRAC / Data Portal, you can still use the dataset ingestion, but there are features included in the default workflow that may not be applicable to your implementation.

The dataset ingestion flow is divided into four independent stages. The aim is to try and divide the tasks so that the tasks can be executed in parallel, reducing the overall time required for dataset ingestion and avoiding processing bottlenecks.

graph LR
  A[dataset-ingestion] --> B[internal-dataset-ingestion]
  B --> C[internal-statistics]
  B --> D[internal-dataset-links]

Dataset ingestion messages must be sent only to the first stage. From there, messages are automatically routed by PACER to the subsequent stages. Each stage performs the following tasks:

  • dataset-ingestion: Validates the dataset payload, checks for duplicates, and creates the dataset in ICAT.
  • internal-dataset-ingestion: Creates dataset datafiles in ICAT, registers dataset parameters, and establishes dataset links (raw–processed) in ICAT.
  • internal-statistics: Computes statistical metrics stored as dataset, sample, or investigation parameters. These metrics are used in the Data Portal for data visualization (e.g., total volume of raw datasets, volume of processed datasets per sample, etc.).
  • internal-dataset-links: Establishes complete linkage between all raw and processed datasets.

If the ingestion process fails at any stage, the entire dataset creation process will be rolled back.

Note

RabbitMQ can work with priority queues. If you find that the processing of some messages needs to be prioritized, you can configure the PACER to enable priority queue support (see configuration).

Dataset ingestion

Consumer Exchange Routing key Payload support Integrations
DatasetsConsumer dataset-ingest-exchange dataset.ingest JSON, XML ICAT
{
  "investigation": "20250370148",
  "investigation_id": "3456 (1)",
  "instrument": "BL06",
  "name": "BSample-19X10_w1_01",
  "location": "20250370148/BSample-19X10_w1_01",
  "start_date": "2026-02-06T14:59:57.000+01:00",
  "end_date": "2026-02-06T14:59:57.000+01:00",
  "sample": {
    "name": "20260206_Sample-19-10",
    "type": "biological (2)"
  },
  "datafiles": [
    { "location": "/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5 (3)" }
  ],
  "parameters": [
    { "name": "scanType (4)", "value": "datacollection" }
  ]
}
  1. This parameter is optional. The PACER will try and match an investigation in ICAT with the investigation's name. If it's not possible it will try to narrow down with the start and end dates of the dataset. If even then there are multiple matching investigations, it will refuse ingestion.

    To prevent this from happenning in situations where there are multiple visits of the same investigation scheduled at the same time, specify the investigation using the investigation_id field (ICAT's investigation.ID field).

  2. Sample type enforcement is optional. If enforced, the sample type must exist in ICAT.

  3. The PACER can automatically index all the files inside of the dataset's root location (see config).
  4. Parameter types must exist in ICAT.
<?xml version="1.0" encoding="utf-8"?>
<dataset>
  <datafile>
    <location>/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5</location>
  </datafile>
  <instrument>bl06</instrument>
  <location>/data/bl06/20250370148/BSample-19X10_w1_01</location>
  <investigation>20250340249</investigation>
  <investigationId>3456</investigationId>
  <name>20250340249__POTENTIOSTAT__Ni3nm_0pt1MNaOH_4__Ni_Dummy_855_D134_C02</name>
  <startDate>2025-10-19 22:22:19.730125</startDate>
  <endDate>2025-10-19 22:22:19.730125</endDate>
  <sample>
    <name>Ni_Dummy_855_D134_C02</name>
    <type>biological</type>
  </sample>
  <parameter>
    <name>scanType</name>
    <value>datacollection</value>
  </parameter>
</dataset>

Considerations regarding initial dataset ingestion:

  • A processed dataset can be associated with one or more input datasets through the input_datasets parameter. The value of this parameter must be a comma-separated list containing the locations of all input datasets' locations involved in its creation.
  • A dataset is classified as processed if the input_datasets parameter is present. Otherwise, it is treated as a raw dataset.
  • When a duplicate raw dataset is ingested, a new dataset is created with a timestamp appended to its name.
  • Duplicate processed datasets are not renamed during initial ingestion; they are handled in a subsequent stage of the ingestion pipeline.

Internal dataset ingestion

Consumer Exchange Routing key Payload support Integrations
InternalDatasetsConsumer dataset-internal-ingest-exchange dataset.internal_ingest JSON, XML ICAT
    -- Payload is sent over from previous stage --

This stage creates the dataset's associated datafiles and parameters.

If the automaticDatasetLocationIndex setting is enabled, all files under the dataset's root location are automatically indexed, up to the limit defined by maxDatafilesPerDataset. Otherwise, only the explicitly specified datafiles are linked to the dataset in ICAT.

For duplicate processed datasets, all associated datafiles are recreated; any existing datafiles are removed before the new ones are created. Dataset parameters may also be updated if the newly ingested dataset has a higher processing version than the existing dataset (the processing version is defined by the Process_sequence_index dataset parameter).

Internal statistics

Consumer Exchange Routing key Payload support Integrations
InternalStatisticsConsumer dataset-internal-ingest-exchange statistics.internal_ingest JSON, XML ICAT
    -- Payload is sent over from previous stage --

This stage computes the metrics used by the Data Portal to display dataset information. The full list of computed metrics is shown below.

Scope Metric
Sample __datasetCount
Sample __acquisitionDatasetCount
Sample __processedDatasetCount
Sample __fileCount
Sample __volume
Sample __acquisitionFileCount
Sample __processedFileCount
Sample __elapsedTime
Sample __volume
Sample __acquisitionVolume
Sample __processedVolume
Scope Metric
Investigation __datasetCount
Investigation __acquisitionInvestigationCount
Investigation __processedInvestigationCount
Investigation __sampleCount
Investigation __volume
Investigation __acquisitionVolume
Investigation __processedVolume
Investigation __elapsedTime
Investigation __fileCount
Investigation __acquisitionFileCount
Investigation __processedFileCount
Scope Metric
Dataset datasetName
Dataset __fileCount
Dataset __volume
Dataset __elapsedTime
Consumer Exchange Routing key Payload support Integrations
InternalDatasetsLinksConsumer dataset-internal-ingest-exchange dataset.internal_links JSON, XML ICAT
    -- Payload is sent over from previous stage --

Processed datasets are linked to their input datasets through the input_datasets parameter. However, this mechanism only captures direct relationships and does not account for transitive dependencies.

For example, if processed-dataset-3 lists processed-dataset-1 as an input, and processed-dataset-1 was itself generated from raw-dataset-1, the relationship between processed-dataset-3 and raw-dataset-1 is not captured directly by input_datasets.

graph LR
  R1[raw-dataset-1] --> P1[processed-dataset-1]
  R2[raw-dataset-1] --> P1[processed-dataset-1]

  R3[raw-dataset-1] --> P2[processed-dataset-2]
  R4[raw-dataset-1] --> P2[processed-dataset-2]

  P1 --> P3[processed-dataset-3]
  R3 --> P3

The Data Portal maintains the complete set of input and output datasets that are related to a given dataset, either directly or through transitive dependencies. This stage computes these relationships for all datasets and stores them in the following parameters:

  • __full_input_datasetIds
  • __full_output_datasetIds
  • __full_input_datasetNames
  • __full_output_datasetNames

These parameters provide a complete lineage view of each dataset, including both its upstream dependencies and downstream derivatives.

User synchronization

Consumer Exchange Routing key Payload support Integrations
UsersConsumer uos-sync-exchange uos_user.sync JSON ICAT, VISA
{
    "first_name": "Aitor",
    "last_name": "Tilla",
    "ORCID": "0000-0000-0000-0000",
    "email": "aitortilla@email.test",
    "affiliation": {
        "id": 18212,
        "name": "University of Valencia",
        "code": "UV",
        "department_name": "Institute of Molecular Science (ICMol)",
        "department_code": "ICMol",
        "unit": null,
        "city": "Valencia",
        "country_code": "ES"
    },
    "is_staff": false,
    "enabled": true,
    "id": 998123,
    "user_list": 
         [ {"username": "tortilla"} ]
}

The PACER will synchronize users with both ICAT and VISA when both integrations are enabled. If only one integration is enabled, synchronization will be performed exclusively with the enabled system.

Investigation synchronization

Consumer Exchange Routing key Payload support Integrations
InvestigationConsumer uos-sync-exchange uos_proposal.sync JSON ICAT, VISA
{
    "name": "20260999999",
    "start_date": "2026-06-01T14:33:04",
    "end_date": "2026-06-01T14:33:04",
    "title": "<<Proposal title>>",
    "summary": "<<Proposal abstract>>",
    "type": "<<ICAT investigation.type.name>>",
    "instrument": {
        "name": "BL04 - MSPD (1)", 
        "code": "BL04 (2)"
    },
    "visit_count": 0,
    "is_reimbursed": false,
    "user_list": [
        {
        "username": "<<User's username>>",
        "email": "<<User's email>>",
        "role": "<<User's role in investigation>> (3)"
        }
    ],
    "visa_sync": true,
    "icat_sync": true,
    "icat_visit_id": "uo_12384",
    "visa_visit_id": 10012384,
    "sample_acronyms": [
        "ABC (4)",
    ],
    "is_industrial": false
}
  1. The instrument.name field is used to match the investigation with an instrument in VISA.
  2. The instrument.code field is used to match the investigation with an instrument in ICAT through the instrument's name field.
  3. Supported roles are: Principal investigator, Proposal scientist, Participant and Local contact.
  4. The sample_acronyms field is used to populate the 'Short names' columnn in the Data Portal's shipping window.

An investigation's release data is calculated based on the investigation's date and the embargo period in years specified in the PACER's configuration. If the is_industrial flag is set to true, the investigation will not have a release date and won't be public ever.

The roles a user can have in an investigation are validated and set by configuration in the PACER. By default, the allowed values are: Principal investigator, Local contact, Proposal scientist and Participant.

If any additional roles are required, they must be added in the PACER's helpers.static_settings in the following format (with the variable name prefixed by ICAT_USER_ROLE_):

ICAT_USER_ROLE_PRINCIPAL_INVESTIGATOR: str = "Principal investigator"
ICAT_USER_ROLE_PROPOSER: str = "Proposal scientist"
ICAT_USER_ROLE_PARTICIPANT: str = "Participant"
ICAT_USER_ROLE_LOCAL_CONTACT: str = "Local contact"

Investigation mint

Consumer Exchange Routing key Payload support Integrations
InvestigationOperationsConsumer investigation-ops-exchange investigation.ops JSON ICAT, VISA, DataCite, PaNOSC
{
    "name": "20260112299",
    "visit_id": "bl11",
    "operations": [
        "mint-proposal",
        "create-panosc-item"
    ]
}

The allowed values for the operations field are: mint-proposal and create-panosc-item.

If any additional operations are required, they must be added in the PACER's helpers.static_settings in the following format (with the variable name prefixed by INV_OPS_):

INV_OPS_MINT_PROPOSAL: str = "mint-proposal"
INV_OPS_CREATE_PANOSC_ITEM: str = "create-panosc-item"

DOI creation

Before attempting to mint a DOI, PACER performs a series of validation checks on the investigation to ensure it is eligible:

  • ❌ The investigation must not be of type industrial.
  • ❌ The investigation must not already have a DOI assigned.
  • ✅ The investigation must contain at least one dataset.
  • ✅ The investigation’s end date must be in the past.
  • ✅ The investigation must have valid associated users.

Users associated with ICAT investigations are mapped to DataCite contributor types as follows:

ICAT investigation role DataCite contributor type
Principal investigator, Participant Creator
Local contact DataCollector
Principal investigator ProjectManager
Proposal scientist ProjectMember

All remaining DOI metadata is populated using the configuration defined in the PACER DataCite integration settings (see configuration).

If VISA integration is enabled, the newly minted DOI will also be registered in the VISA database in the corresponding investigation.

PSS item creation

If enabled and specified in the message, the investigation will also have its corresponding item created in the PaNOSC Search Scoring service.