Operation¶

The PACER currently supports the following operations:

User and investigation synchronization
Investigation minting
Dataset ingestion

These operations are initiated through messages published to the PACER message broker. The message payloads, routing keys, and other relevant details for each operation are described below.

Dataset ingestion¶

This workflow is focused towards ICAT being used in conjunction with DRAC / Data Portal. If you are not using DRAC / Data Portal, you can still use the dataset ingestion, but there are features included in the default workflow that may not be applicable to your implementation.

The dataset ingestion flow is divided into four independent stages. The aim is to try and divide the tasks so that the tasks can be executed in parallel, reducing the overall time required for dataset ingestion and avoiding processing bottlenecks.

graph LR
  A[dataset-ingestion] --> B[internal-dataset-ingestion]
  B --> C[internal-statistics]
  B --> D[internal-dataset-links]

Dataset ingestion messages must be sent only to the first stage. From there, messages are automatically routed by PACER to the subsequent stages. Each stage performs the following tasks:

dataset-ingestion: Validates the dataset payload, checks for duplicates, and creates the dataset in ICAT.
internal-dataset-ingestion: Creates dataset datafiles in ICAT, registers dataset parameters, and establishes dataset links (raw–processed) in ICAT.
internal-statistics: Computes statistical metrics stored as dataset, sample, or investigation parameters. These metrics are used in the Data Portal for data visualization (e.g., total volume of raw datasets, volume of processed datasets per sample, etc.).
internal-dataset-links: Establishes complete linkage between all raw and processed datasets.

If the ingestion process fails at any stage, the entire dataset creation process will be rolled back.

Note

RabbitMQ can work with priority queues. If you find that the processing of some messages needs to be prioritized, you can configure the PACER to enable priority queue support (see configuration).

Dataset ingestion¶

Consumer	Exchange	Routing key	Payload support	Integrations
`DatasetsConsumer`	dataset-ingest-exchange	dataset.ingest	JSON, XML	ICAT

JSONXML (legacy)

{
  "investigation": "20250370148",
  "investigation_id": "3456 (1)",
  "instrument": "BL06",
  "name": "BSample-19X10_w1_01",
  "location": "20250370148/BSample-19X10_w1_01",
  "start_date": "2026-02-06T14:59:57.000+01:00",
  "end_date": "2026-02-06T14:59:57.000+01:00",
  "sample": {
    "name": "20260206_Sample-19-10",
    "type": "biological (2)"
  },
  "datafiles": [
    { "location": "/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5 (3)" }
  ],
  "parameters": [
    { "name": "scanType (4)", "value": "datacollection" }
  ]
}

This parameter is optional. The PACER will try and match an investigation in ICAT with the investigation's name. If it's not possible it will try to narrow down with the start and end dates of the dataset. If even then there are multiple matching investigations, it will refuse ingestion.

To prevent this from happenning in situations where there are multiple visits of the same investigation scheduled at the same time, specify the investigation using the investigation_id field (ICAT's investigation.ID field).
Sample type enforcement is optional. If enforced, the sample type must exist in ICAT.
The PACER can automatically index all the files inside of the dataset's root location (see config).
Parameter types must exist in ICAT.

<?xml version="1.0" encoding="utf-8"?>
<dataset>
  <datafile>
    <location>/data/bl06/20250370148/BSample-19X10_w1_01/data_1.h5</location>
  </datafile>
  <instrument>bl06</instrument>
  <location>/data/bl06/20250370148/BSample-19X10_w1_01</location>
  <investigation>20250340249</investigation>
  <investigationId>3456</investigationId>
  <name>20250340249__POTENTIOSTAT__Ni3nm_0pt1MNaOH_4__Ni_Dummy_855_D134_C02</name>
  <startDate>2025-10-19 22:22:19.730125</startDate>
  <endDate>2025-10-19 22:22:19.730125</endDate>
  <sample>
    <name>Ni_Dummy_855_D134_C02</name>
    <type>biological</type>
  </sample>
  <parameter>
    <name>scanType</name>
    <value>datacollection</value>
  </parameter>
</dataset>

Considerations regarding initial dataset ingestion:

A processed dataset can be associated with one or more input datasets through the input_datasets parameter. The value of this parameter must be a comma-separated list containing the locations of all input datasets' locations involved in its creation.
A dataset is classified as processed if the input_datasets parameter is present. Otherwise, it is treated as a raw dataset.
When a duplicate raw dataset is ingested, a new dataset is created with a timestamp appended to its name.
Duplicate processed datasets are not renamed during initial ingestion; they are handled in a subsequent stage of the ingestion pipeline.

Internal dataset ingestion¶

Consumer	Exchange	Routing key	Payload support	Integrations
`InternalDatasetsConsumer`	dataset-internal-ingest-exchange	dataset.internal_ingest	JSON, XML	ICAT

JSON

    -- Payload is sent over from previous stage --

This stage creates the dataset's associated datafiles and parameters.

If the automaticDatasetLocationIndex setting is enabled, all files under the dataset's root location are automatically indexed, up to the limit defined by maxDatafilesPerDataset. Otherwise, only the explicitly specified datafiles are linked to the dataset in ICAT.

For duplicate processed datasets, all associated datafiles are recreated; any existing datafiles are removed before the new ones are created. Dataset parameters may also be updated if the newly ingested dataset has a higher processing version than the existing dataset (the processing version is defined by the Process_sequence_index dataset parameter).

Internal statistics¶

Consumer	Exchange	Routing key	Payload support	Integrations
`InternalStatisticsConsumer`	dataset-internal-ingest-exchange	statistics.internal_ingest	JSON, XML	ICAT

JSON

    -- Payload is sent over from previous stage --

This stage computes the metrics used by the Data Portal to display dataset information. The full list of computed metrics is shown below.

Scope	Metric
Sample	`__datasetCount`
Sample	`__acquisitionDatasetCount`
Sample	`__processedDatasetCount`
Sample	`__fileCount`
Sample	`__volume`
Sample	`__acquisitionFileCount`
Sample	`__processedFileCount`
Sample	`__elapsedTime`
Sample	`__volume`
Sample	`__acquisitionVolume`
Sample	`__processedVolume`

Scope	Metric
Investigation	`__datasetCount`
Investigation	`__acquisitionInvestigationCount`
Investigation	`__processedInvestigationCount`
Investigation	`__sampleCount`
Investigation	`__volume`
Investigation	`__acquisitionVolume`
Investigation	`__processedVolume`
Investigation	`__elapsedTime`
Investigation	`__fileCount`
Investigation	`__acquisitionFileCount`
Investigation	`__processedFileCount`

Scope	Metric
Dataset	`datasetName`
Dataset	`__fileCount`
Dataset	`__volume`
Dataset	`__elapsedTime`

Internal dataset links¶

Consumer	Exchange	Routing key	Payload support	Integrations
`InternalDatasetsLinksConsumer`	dataset-internal-ingest-exchange	dataset.internal_links	JSON, XML	ICAT

JSON

    -- Payload is sent over from previous stage --

Processed datasets are linked to their input datasets through the input_datasets parameter. However, this mechanism only captures direct relationships and does not account for transitive dependencies.

For example, if processed-dataset-3 lists processed-dataset-1 as an input, and processed-dataset-1 was itself generated from raw-dataset-1, the relationship between processed-dataset-3 and raw-dataset-1 is not captured directly by input_datasets.

graph LR
  R1[raw-dataset-1] --> P1[processed-dataset-1]
  R2[raw-dataset-1] --> P1[processed-dataset-1]

  R3[raw-dataset-1] --> P2[processed-dataset-2]
  R4[raw-dataset-1] --> P2[processed-dataset-2]

  P1 --> P3[processed-dataset-3]
  R3 --> P3

The Data Portal maintains the complete set of input and output datasets that are related to a given dataset, either directly or through transitive dependencies. This stage computes these relationships for all datasets and stores them in the following parameters:

__full_input_datasetIds
__full_output_datasetIds
__full_input_datasetNames
__full_output_datasetNames

These parameters provide a complete lineage view of each dataset, including both its upstream dependencies and downstream derivatives.

User synchronization¶

Consumer	Exchange	Routing key	Payload support	Integrations
`UsersConsumer`	uos-sync-exchange	uos_user.sync	JSON	ICAT, VISA

JSON

{
    "first_name": "Aitor",
    "last_name": "Tilla",
    "ORCID": "0000-0000-0000-0000",
    "email": "aitortilla@email.test",
    "affiliation": {
        "id": 18212,
        "name": "University of Valencia",
        "code": "UV",
        "department_name": "Institute of Molecular Science (ICMol)",
        "department_code": "ICMol",
        "unit": null,
        "city": "Valencia",
        "country_code": "ES"
    },
    "is_staff": false,
    "enabled": true,
    "id": 998123,
    "user_list": 
         [ {"username": "tortilla"} ]
}

The PACER will synchronize users with both ICAT and VISA when both integrations are enabled. If only one integration is enabled, synchronization will be performed exclusively with the enabled system.

Investigation synchronization¶

Consumer	Exchange	Routing key	Payload support	Integrations
`InvestigationConsumer`	uos-sync-exchange	uos_proposal.sync	JSON	ICAT, VISA

JSON

{
    "name": "20260999999",
    "start_date": "2026-06-01T14:33:04",
    "end_date": "2026-06-01T14:33:04",
    "title": "<<Proposal title>>",
    "summary": "<<Proposal abstract>>",
    "type": "<<ICAT investigation.type.name>>",
    "instrument": {
        "name": "BL04 - MSPD (1)", 
        "code": "BL04 (2)"
    },
    "visit_count": 0,
    "is_reimbursed": false,
    "user_list": [
        {
        "username": "<<User's username>>",
        "email": "<<User's email>>",
        "role": "<<User's role in investigation>> (3)"
        }
    ],
    "visa_sync": true,
    "icat_sync": true,
    "icat_visit_id": "uo_12384",
    "visa_visit_id": 10012384,
    "sample_acronyms": [
        "ABC (4)",
    ],
    "is_industrial": false
}

The instrument.name field is used to match the investigation with an instrument in VISA.
The instrument.code field is used to match the investigation with an instrument in ICAT through the instrument's name field.
Supported roles are: Principal investigator, Proposal scientist, Participant and Local contact.
The sample_acronyms field is used to populate the 'Short names' columnn in the Data Portal's shipping window.

An investigation's release data is calculated based on the investigation's date and the embargo period in years specified in the PACER's configuration. If the is_industrial flag is set to true, the investigation will not have a release date and won't be public ever.

The roles a user can have in an investigation are validated and set by configuration in the PACER. By default, the allowed values are: Principal investigator, Local contact, Proposal scientist and Participant.

If any additional roles are required, they must be added in the PACER's helpers.static_settings in the following format (with the variable name prefixed by ICAT_USER_ROLE_):

ICAT_USER_ROLE_PRINCIPAL_INVESTIGATOR: str = "Principal investigator"
ICAT_USER_ROLE_PROPOSER: str = "Proposal scientist"
ICAT_USER_ROLE_PARTICIPANT: str = "Participant"
ICAT_USER_ROLE_LOCAL_CONTACT: str = "Local contact"

Investigation mint¶

Consumer	Exchange	Routing key	Payload support	Integrations
`InvestigationOperationsConsumer`	investigation-ops-exchange	investigation.ops	JSON	ICAT, VISA, DataCite, PaNOSC

JSON

{
    "name": "20260112299",
    "visit_id": "bl11",
    "operations": [
        "mint-proposal",
        "create-panosc-item"
    ]
}

The allowed values for the operations field are: mint-proposal and create-panosc-item.

If any additional operations are required, they must be added in the PACER's helpers.static_settings in the following format (with the variable name prefixed by INV_OPS_):

INV_OPS_MINT_PROPOSAL: str = "mint-proposal"
INV_OPS_CREATE_PANOSC_ITEM: str = "create-panosc-item"

DOI creation¶

Before attempting to mint a DOI, PACER performs a series of validation checks on the investigation to ensure it is eligible:

❌ The investigation must not be of type industrial.
❌ The investigation must not already have a DOI assigned.
✅ The investigation must contain at least one dataset.
✅ The investigation’s end date must be in the past.
✅ The investigation must have valid associated users.

Users associated with ICAT investigations are mapped to DataCite contributor types as follows:

ICAT investigation role	DataCite contributor type
Principal investigator, Participant	Creator
Local contact	DataCollector
Principal investigator	ProjectManager
Proposal scientist	ProjectMember

All remaining DOI metadata is populated using the configuration defined in the PACER DataCite integration settings (see configuration).

If VISA integration is enabled, the newly minted DOI will also be registered in the VISA database in the corresponding investigation.

PSS item creation¶

If enabled and specified in the message, the investigation will also have its corresponding item created in the PaNOSC Search Scoring service.