Script Runner Guide

Additional TODO

  • Placeholder for additional TODO items.

Introduction

About NLNZ Tools SIP Generation Fairfax

NLNZ Tools SIP Generation Fairfax is specific set of tools for processing Fairfax-specific content. The ultimate output of these tools are SIPs for ingestion into the Rosetta archiving system.

Most of the operations are run on the command line using a set of parameters and a spreadsheet of values that when combined together with the operational code produce an output that is ready for ingestion into Rosetta.

The purpose of these tools is to process the Fairfax files. The long-term goal would be to wrap these tools into a user interface. See the Future milestones section of the Developer Guide for more details.

About this document

This document is the NLNZ Tools SIP Generation Fairfax Script Runner Guide. It describes how to use the command-line tools provided by the project to perform various workflow operations.

The manual is divided into chapters, each of which deals with a particular scripting operation.

See also:

Contents of this document

Following this introduction, this User Guide includes the following sections:

  • ProcessorRunner general usage - Covers general processing parameters.
  • FTP stage - Covers the FTP stage.
  • Pre-processing stage - Covers the pre-processing stage.
  • Ready-for-ingestion stage - Covers ready-for-ingestion stage.
  • Copying ingested loads to ingested folder - Covers copying ingested loads to their final ingested folder.
  • Additional tools - Covers additional scripting tools.
  • Converting the spreadsheet to JSON and vice-versa - Covers converting the parameters spreadsheet between formats.
  • Copying and moves - Covers how copying files and moving files ensure data integrity.

Relationships with other scripting code

Some of this scripting code is related to the codebase nlnz-tools-scripts-ingestion found in the github repository https://github.com/NLNZDigitalPreservation/nlnz-tools-scripts-ingestion . See the documentation for that codebase at https://nlnz-tools-sip-generation.readthedocs.io . There is an expectation that the two codebases will work together.

There is also some additional scripts in the github repository: https://github.com/NLNZDigitalPreservation/nlnz-tools-scripts-ingestion . See the documentation for those scripts found at https://nlnz-tools-scripts-ingestion.readthedocs.io .

ProcessorRunner general usage

ProcessorRunner runs different processors based on command-line options.

Processing for different processing stages

Processing stages are discussed in more detail in Workflow Guide.

Processing stage Description
–preProcess Group source files by date and titleCode. Output is used by readyForIngestion. Requires sourceFolder, targetPreProcessingFolder, forReviewFolder. Uses startingDate, endingDate. Optional createDestination, moveFiles, parallelizeProcessing, numberOfThreads. This is a processing operation and must run exclusively of other processing operations.
–readyForIngestion Process the source files. Output is ready for ingestion by Rosetta. Requires sourceFolder, targetForIngestionFolder, forReviewFolder, processingType. Uses startingDate, endingDate. Optional createDestination. Note that moveFiles is not supported at this time. Optional parallelizeProcessing, numberOfThreads, maximumThumbnailPageThreads. This is a processing operation and must run exclusively of other processing operations.
–copyIngestedLoadsToIngestedFolder Copy the ingested loads to ingested folder. Requires sourceFolder, targetPostProcessedFolder, forReviewFolder. Uses startingDate, endingDate. Optional createDestination, moveFiles, moveOrCopyEvenIfNoRosettaDoneFile. Optional parallelizeProcessing, numberOfThreads, maximumThumbnailPageThreads. This is a processing operation and must run exclusively of other processing operations.

Other types of processing

Other processing Description
–copyProdLoadToTestStructures Copy the production load to test structures. Uses startingDate, endingDate. This is a processing operation and must run exclusively of other processing operations.
–generateThumbnailPageFromPdfs Generate a thumbnail page from the PDFs in the given folder. Requires sourceFolder, targetFolder. Optional startingDate and endingDate will select directories that match dates in yyyyMMdd format. Generates a thumbnail page using the PDFs in the source folder. The name of the jpeg is based on the source folder. This is a processing operation and must run exclusively of other processing operations.

Reports

Reports Description
-l, –listFiles List the source files in an organized way. Requires sourceFolder. This is a reporting operation and cannot be run with any other processing operations.
–extractMetadata Extract and list the metadata from the source files. Requires sourceFolder. This is a reporting operation and cannot be run with any other processing operations.
–statisticalAudit Statistical audit. Search through the source folder and provide a statistical audit of the files found. This is a reporting operation and cannot be run with any processing operations.

General parameters

Parameters - General Description
-b, –startingDate=STARTING_DATE Starting date in the format yyyy-MM-dd (inclusive). Dates are usually based on file name (not timestamp). Default is 2015-01-01.
-e, –endingDate=ENDING_DATE Ending date in the format yyyy-MM-dd (inclusive). Default is today. Files after this date are ignored.
-s, –sourceFolder=SOURCE_FOLDER Source folder in the format /path/to/folder This folder must exist and must be a directory.
–targetFolder=TARGET_FOLDER Target folder in the format /path/to/folder. This is the destination folder used when no other destination folders are specified. Use –createDestination to force its creation.
–targetPreProcessingFolder=TARGET_PRE_PROCESS_FOLDER Target pre-processing folder in the format /path/to/folder Use –createDestination to force its creation.
–targetPostProcessedFolder=TARGET_POST_PROCESSED_FOLDER Target post-processed folder in the format /path/to/folder Use –createDestination to force its creation.
-r, –forReviewFolder=FOR_REVIEW_FOLDER For-review folder in the format /path/to/folder. For processing exceptions, depending on processor. Use –createDestination to force its creation.
–numberOfThreads=NUMBER_OF_THREADS Number of threads when running operations in parallel. The default is 1.
–maximumThumbnailPageThreads=MAXIMUM_THUMBNAIL_PAGE_THREADS Maximum of threads that can be used to generate thumbnail pages when running operations in parallel The default is 1. This limit is in place because in-memory thumbnail pagegeneration can be quite resource intensive and can overload the JVM.
–generalProcessingOptions=GENERAL_PROCESSING_OPTIONS General processing options. A comma-separated list of options. These options will override any contradictory options. These processing options may or may not be applied depending on the processing that takes place. See the class ProcessorOption for a list of what those options are.

Ready-for-ingestion parameters

Parameters - Ready-for-ingestion Description
–targetForIngestionFolder=TARGET_FOR_INGESTION_FOLDER Target for-ingestion folder in the format /path/to/folder Use –createDestination to force its creation.
–forIngestionProcessingTypes=PROCESSING_TYPES Comma-separated list of for-ingestion processing types. A pre-processing titleCode folder should only be processed once for a single processing type. It may be possible for multiple processing types to apply to the same folder, producing different SIPs.
–forIngestionProcessingRules=PROCESSING_RULES For-ingestion processing rules. A comma-separated list of rules. These rules will override any contradictory rules.
–forIngestionProcessingOptions=PROCESSING_OPTIONS For-ingestion processing options. A comma-separated list of options. These options will override any contradictory options.

Options

Options Description
-c, –createDestination Whether destination (or target) folders will be created. Default is no creation (false).
–moveFiles Whether files will be moved or copied. Default is copy (false).
–parallelizeProcessing Run operations in parallel (if possible). Operations that have components that can run in parallel currently are: –preProcess, –readyForIngestion, –generateThumbnailPageFromPdfs
–detailedTimings Include detailed timings (for specific operations).
–moveOrCopyEvenIfNoRosettaDoneFile Whether the move or copy takes place even if there is no Rosetta done file. The Rosetta done files is a file with a titleCode of ‘done’. Default is no move or copy unless there IS a Rosetta done file (false).
–verbose Include verbose output.
-h, –help Display a help message.

General processing options

General processing options are those options specified by the parameter --generalProcessingOptions=GENERAL_PROCESSING_OPTIONS. In the codebase they are represented by the enum ProcessorOption.

The options are as follows:

search_subdirectories
When finding files, also include subdirectories. Overridden by root_folder_only.
root_folder_only
When finding files, only use the specified folder (not subdirectories). Overridden by search_subdirectories.
use_source_subdirectory_as_target
Use the source folder as the target folder. This only works for certain kinds of processing.
show_directory_only
Used when converting a directory path to a file or folder name. In this case only the directory name (without any parent directories) is used. Overridden by show_directory_and_one_parent, show_directory_and_two_parents, show_directory_and_three_parents, show_full_path.
show_directory_and_one_parent
Used when converting a directory path to a file or folder name. In this case only the directory name and one parent directory is used. Overridden by show_directory_only, show_directory_and_two_parents, show_directory_and_three_parents, show_full_path.
show_directory_and_two_parents
Used when converting a directory path to a file or folder name. In this case only the directory name and two parent directories are used. Overridden by show_directory_only, show_directory_and_one_parent, show_directory_and_three_parents, show_full_path.
show_directory_and_three_parents
Used when converting a directory path to a file or folder name. In this case only the directory name and three parent directories are used. Overridden by show_directory_only, show_directory_and_one_parent, show_directory_and_two_parents, show_full_path.
show_full_path
Used when converting a directory path to a file or folder name. In this case the full path is used. Overridden by show_directory_only, show_directory_and_one_parent, show_directory_and_two_parents, show_directory_and_three_parents.

FTP stage

All PDF files are placed in a single FTP folder by the file producer. There are no subfolders.

Pre-processing stage

The pre-processing stage moves the files found in the ftp directory to the pre-processing folder. In the ftp folder all the files sit in the same directory. In the pre-processing directory, the files are separated out by date and title_code, as in the following structure:

<targetPreProcessingFolder>/<date-in-yyyyMMdd>/<TitleCode>/{files for that titleCode and date}

This file structure prepares the files for ready-for-ingestion processing.

Example processing command

The sip-generation-fairfax-fat-all jar is executed with arguments as shown in the following example:

sourceFolder="/path/to/ftp/folder"
targetBaseFolder="/path/to/LD_Sched/fairfax-processing"
targetPreProcessingFolder="${targetBaseFolder}/pre-processing"
forReviewFolder="${targetBaseFolder}/for-review"

startingDate="2019-06-01"
endingDate="2019-06-15"

# Note that the number of threads increases processing speed due to ODS poor single-thread performance
numberOfThreads=800

maxMemory="2048m"
minMemory="2048m"

java -Xms${minMemory} -Xmx${maxMemory} \
    -jar fat/build/libs/sip-generation-fairfax-fat-all-<VERSION>.jar \
    --preProcess \
    --startingDate="${startingDate}" \
    --endingDate="${endingDate}" \
    --sourceFolder="${sourceFolder}" \
    --targetPreProcessingFolder="${targetPreProcessingFolder}" \
    --forReviewFolder="${forReviewFolder}" \
    --createDestination \
    --moveFiles \
    --parallelizeProcessing \
    --numberOfThreads ${numberOfThreads}

For-review

If a file or set of files is unable to be processed for some reason, it will be placed in the For-review folder. There is no processor that operates on the For-review stage. Processors that output to the For-review folder use the parameter forReviewFolder to set the location of the For-review folder.

FTP files with identifiable title_code

If the files come from the FTP folder and the TitleCode and date are identifiable from the filename, the files are in the following structure:

<forReviewFolder>/<date-in-yyyyMMMdd>/<TitleCode>/{files}

FTP files without identifiable title_code and identifiable date

If the files come from the FTP folder and the TitleCode is not identifiable from the filename (but the date is), the files are in the following structure:

<forReviewFolder>/UNKNOWN-TITLE-CODE/<date-in-yyyyMMdd>/{files-that-have-no-title-code-mapping-for-that-date}

FTP files without identifiable title_code and without identifiable date

If the files come from the FTP folder and the TitleCode and date are not identifiable from the filename, the files are in the following structure:

<forReviewFolder>/UNKNOWN-TITLE-CODE/UNKNOWN-DATE/{files-that-have-no-title-code-mapping-for-that-date}

Ready-for-ingestion stage

The second state of processing where files are aggregated into specific SIPs ready for ingestion into Rosetta.

Note that the --moveFiles option is currently not supported, as multiple processing types operate on the same set of files.

The Ready-for-ingestion folder structure is how Rosetta ingests the files. Magazines and newspapers have different Material Flows, so ingestion of those different IEEntity types must be in different folders.

Processing spreadsheet

The processing spreadsheet is used in the ready-for-ingestion stage to determine how a particular set of files associated with a title code are processed.

Default spreadsheet

A spreadsheet exists that determines how a given title code is processed for a given processing type. A default spreadsheet exists in the codebase under src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-spreadsheet.csv. This spreadsheet uses a column delimiter of |.

Spreadsheet conversion to JSON

Build script tasks exist to conver a .csv spreadsheet to a .json file. See the section Converting the spreadsheet to JSON and vice-versa for an explanation on how that conversion is done.

The ready-for-ingestion processing operates on the JSON version of the spreadsheet information. For this reason, any changes to the csv spreadsheet must be converted to JSON for the processing to use those changes.

Spreadsheet structure

The structure of the spreadsheet is discussed in the Librarian Guide.

JSON file structure

The JSON-file structure lays out the same parameters in a JSON format. The actual processing uses the JSON file as its processing input. For example, the Taupo Times has the following entry:

{
    "row-0246": {
        "MMSID": "9917962373502836",
        "title_parent": "Taupo Times",
        "processing_type": "parent_grouping",
        "processing_rules": "",
        "processing_options": "numeric_before_alpha",
        "publication_key": "title_code",
        "title_code": "TAT",
        "edition_discriminators": "",
        "section_codes": "ED1+TAB+QFS",
        "Access": "200",
        "Magazine": "0",
        "ingest_status": "STA",
        "Frequency": "",
        "entity_type": "PER",
        "title_mets": "Taupo Times",
        "ISSN online": "",
        "Bib ID": "",
        "Access condition": "",
        "Date catalogued": "",
        "Collector_folder": "Taupo_Times",
        "Cataloguer": "",
        "Notes": "Fairfax updated title code",
        "first_issue_starting_page": "",
        "last_issue_starting_page": "",
        "has_volume_md": "0",
        "has_issue_md": "0",
        "has_number_md": "0",
        "previous_volume": "",
        "previous_volume_date": "",
        "previous_volume_frequency": "",
        "previous_issue": "",
        "previous_issue_date": "",
        "previous_issue_frequency": "",
        "previous_number": "",
        "previous_number_date": "",
        "previous_number_frequency": ""
    }
}

Folder structure

The structure of the ready-for-ingestion output is discussed in the Librarian Guide.

Deciding how to process: Processing types, spreadsheets and folders

When the ready-for-ingestion processing takes place, each folder that gets processed has a title_code (which is the name of the folder itself. The ready-for-ingestion processing takes that title_code and matches it with a spreadsheet for the given processing_type. If there is no spreadsheet row that matches the title_code and processing_type, then no processing for that type takes place. There may be other processing types that match a specific spreadsheet row.

Processing types

There are different processing types that have slightly different ways of dealing with the files in a title_code folder. When multiple processing types are specified, the processing types checked in order until a spreadsheet row is found that matches. Processing types themselves correspond to the class ProcessingType.

The processing types are checked in the following order: parent_grouping_with_edition, parent_grouping, supplement_grouping and finally create_sip_for_folder.

parent_grouping_with_edition

This is for processing where the title code and edition discriminator combine to form a unique key. There are some publications where this is the case. One example is the title code ADM, which has two different editions, NEL and MEX, each with their own MMSID. The title_parent is used as the publication title.

parent_grouping_with_edition
The title_code is combined with the first edition_discriminators to produce a spreadsheet row match.
parent_grouping_with_edition default rules:
skip_ignored, skip_unrecognised, skip_invalid, automatic, required_all_sections_in_sip, missing_sequence_is_error, missing_sequence_double_wide_is_ignored, ignore_editions_without_files, zero_length_pdf_replaced_with_page_unavailable, do_not_force_skip, numeric_starts_in_hundreds_not_considered_sequence_skips, do_not_require_first_section_code_for_match.
parent_grouping_with_edition default options:
numeric_before_alpha, generate_processed_pdf_thumbnails_page, skip_generation_thumbnail_page_when_error_free, use_in_memory_pdf_to_thumbnail_generation.

parent_grouping

This is the most common grouping where the title code by itself is enough to determine the publication. The title_parent is used as the publication title.

parent_grouping
The title_code is used to produce a spreadsheet row match.
parent_grouping default rules:
skip_ignored, skip_unrecognised, skip_invalid, automatic, required_all_sections_in_sip, missing_sequence_is_error, missing_sequence_double_wide_is_ignored, ignore_editions_without_files, zero_length_pdf_replaced_with_page_unavailable, do_not_force_skip, numeric_starts_in_hundreds_not_considered_sequence_skips, do_not_require_first_section_code_for_match.
parent_grouping default options:
numeric_before_alpha, generate_processed_pdf_thumbnails_page, skip_generation_thumbnail_page_when_error_free, use_in_memory_pdf_to_thumbnail_generation.

supplement_grouping

For some publications we want to extract a subset of the title_parent publication into a separate publication that is loaded with its own separate MMSID. The title_mets is used as the publication title.

TODO The code for this extraction is not complete and will require some more tweaking and default spreadsheet changes. For example, some supplements are based on having certain sequence letters. There may be multiple supplements that match on the same set of files (for example, the TAB section code, which often maps to a different supplement). They may rely on being on a certain day of the week or month of the year. Much of the determination of what the publication maps to may rely on human intervention.

TODO One approach for dealing with extracting supplements that are specific to certain sequence letters is to add a new spreadsheet column sequence_letters and the supplement grouping would only select the files for processing if the given set of sequence letters existed in the files in the title code folder. This is similar to how parent_grouping_with_edition works with editions. In other words, if the sequence letters have been set in the spreadsheet row and they do exist in the set of files, then process the supplement grouping against the set of files. Otherwise, there isn’t a match and that supplement grouping is skipped. This would likely require an additional rule so that the sequence letters would be used as a filter for processing files.

TODO The use of sequence_letters could also be used to determine the ordering of the pages if a non-alphabetical ordering is required. This would likely require an additional rule so that ordering would be used.

supplement_grouping
The title_code and section_code is used to produce a spreadsheet row match. This is generally used for publications that are part of a parent publication (for example, a parent publication might have a special section that can be extracted with its own MMSID).
supplement_grouping default rules:
skip_ignored, skip_unrecognised, skip_invalid, automatic, optional_all_sections_in_sip, missing_sequence_is_error, missing_sequence_double_wide_is_ignored, ignore_editions_without_files, zero_length_pdf_replaced_with_page_unavailable, do_not_force_skip, numeric_starts_in_hundreds_not_considered_sequence_skips, require_first_section_code_for_match.
supplement_grouping default options:
numeric_before_alpha, generate_processed_pdf_thumbnails_page, skip_generation_thumbnail_page_when_error_free, use_in_memory_pdf_to_thumbnail_generation.

create_sip_for_folder

This is a catch-all for all the publications that don’t have a corresponding spreadsheet row. The mets.xml will still be created, but it will need to be edited to have the correct MMSID and publication title. It can be helpful to include this processing type in the set of processing types so that much of the work processing one-off publications can be done automatically without having to make changes to the parameters spreadsheet.

create_sip_for_folder
This a catch all for when there is no spreadsheet row match. The title_code is still used to produce an output folder structure with the given files. However, the mets.xml does not have MMSID, publication name, access value. All those values would need editing before the folder could be ingested into Rosetta.
create_sip_for_folder default rules:
skip_ignored, skip_unrecognised, skip_invalid, automatic, required_all_sections_in_sip, missing_sequence_is_error, missing_sequence_double_wide_is_ignored, ignore_editions_without_files, zero_length_pdf_replaced_with_page_unavailable, do_not_force_skip, numeric_starts_in_hundreds_not_considered_sequence_skips, do_not_require_first_section_code_for_match.
create_sip_for_folder default options:
numeric_before_alpha, generate_processed_pdf_thumbnails_page, skip_generation_thumbnail_page_when_error_free, use_in_memory_pdf_to_thumbnail_generation.

Processing rules

Processing rules determine how certain aspects of the workflow take place. Each processing rule has an opposite rule that can be used to override its value.

handle_ignored
Ignored files are placed in a separate for-review folder called IGNORED/date/title_code. Override is skip_ignored.
skip_ignored
Ignored files are not placed in any separate folders. Override is handle_ignored.
handle_unrecognised
Unrecognised files are placed in a separate for-review folder called UNRECOGNIZED/date/title_code. Override is skip_unrecognised.
skip_unrecognised
Unrecognised files are not placed in any separate folders. Override is handle_unrecognised.
handle_invalid
Invalid files are placed in a separate for-review folder called INVALID/date/title_code. Override is skip_invalid.
skip_invalid
Invalid files are not placed in any separate folders. Override is handle_invalid.
manual
The generated file structure is always sent to for-review if there are no errors. Override is automatic.
automatic
The generated file structure is set to ready-for-ingestion if there are no errors. Override is manual.
force_skip
Skips the processing of the given type/date/title_code combination. Useful for spreadsheet rows that are not being processed correctly. Override is do_not_force_skip.
do_not_force_skip
Processes the given type/date/title_code combination. Override is force_skip.
process_all_editions
Process all the editions for a given title_code, even if there are no specific edition files. Override is ignore_editions_without_files.
ignore_editions_without_files
Only processes edition for a given title_code that has actual edition-specific files. For example, there might be edition_discriminators ED1+ED2+ED3, but only ED1 and ED2 files exist. In that case, only ED1 and ED2 output would be created. Override is process_all_editions.
require_first_section_code_for_match
The sorted file list’s first file’s section code must match the first section code in the list of section_codes. Otherwise the spreadsheet row will not match. This rule only exists for situations where a particular section code for a supplement sometimes comes on its own and needs to be processed with its own MMSID. For example, MEXTAB. Use this rule carefully because of possible non-matching side effects. Override is do_not_require_first_section_code_for_match.
do_not_require_first_section_code_for_match
Do not require the sorted file list’s first file’s section code must match the first section code in the list of section_codes. This is the usual default. Override is require_first_section_code_for_match.
edition_discriminators_using_smart_substitute
For processing type parent_grouping_with_edition, the title_code and a specific section_code form the spreadsheet row key. edition_discriminators_using_smart_substitute is for something like the following situation: For the title_code QCM we want to make edition substitutions, but eachedition discriminator has its own section code. We have titleCode: QCM, with 3 separate editions: edition discriminator: ED1, section_codes: ED1; edition discriminator: ED2, section_codes: ED2; and editionDiscriminator: ED3, section_codes: ED3. We still want to substitute the pages in ED2 and ED3 over the ED1 pages. In order to do that, we find the FIRST edition discriminator and set the edition discriminators to the FIRST edition discriminator and the current edition (section code). That means for ED2, we would use the ED1 pages and substitute in the ED2 pages. Override is edition_discriminators_not_using_smart_substitute.
required_all_sections_in_sip
All sections are required to appear in the SIP. If they are not included based on the spreadsheet row, then an exception is generated. Override is optional_all_sections_in_sip.
optional_all_sections_in_sip
Not all sections are required to appear in the SIP. Override is required_all_sections_in_sip.
missing_sequence_is_ignored
Missing sequences in page numbering (such as skipping from page 1 to 3) are ignored. Override is missing_sequence_is_error.
missing_sequence_is_error
Missing sequences are not treated as an error. Override is missing_sequence_is_ignored.
missing_sequence_double_wide_is_ignored
A missing sequence whose previous page is either double the width or half the width or the current page is treated as if there is no missing sequence. This is to handle the common situation of double-wide pages. Override is missing_sequence_double_wide_is_error.
missing_sequence_double_wide_is_error
Even if the previous page is double the width or half the width of the current page, the missing sequence is still treated as an error (if missing_sequence_is_error is a rule). Override is missing_sequence_double_wide_is_ignored.
zero_length_pdf_replaced_with_page_unavailable
A zero-length PDF file (a file with a size of 0) is replaced with the standard page unavailable PDF file. This file is found in the codebase under core/src/main/resources/page-unavailable.pdf. Override is zero_length_pdf_skipped.
zero_length_pdf_skipped
A zero-length PDF file (a file with a size of 0) is skipped (not replaced by any other file). Override is zero_length_pdf_replaced_with_page_unavailable.
numeric_starts_in_hundreds_not_considered_sequence_skips
There are some cases where a wrap starts in the 400’s. Normally this would be considered a skipped sequence, but with this option sequence numbering starting in the 400’s or more (so starting with 400 or 401, or 500 or 501, and so on) is not considered a sequence numbering skip. Override is numeric_starts_in_hundreds_considered_sequence_skips.
numeric_starts_in_hundreds_considered_sequence_skips
Sequence numbering skips that start with 400 or 401 or 500 or 501 and so on are still treated as a sequence numbering skip. Override is numeric_starts_in_hundreds_not_considered_sequence_skips.

Processing options

Processing options determine how certain aspects of the workflow take place. Each processing option has an opposite option that can be used to override its value. In general options don’t have side effects, but rules do.

alpha_before_numeric
Sequences are sorted with sequence letters sorted before sequence numbers only. So, we would have ordering A01, A02, B01, B02, 01, 02. Override is numeric_before_alpha.
numeric_before_alpha
Sequences are sorted with sequence numbers only sorted before sequence letters only. So, we would have ordering 01, 02, A01, A02, B01, B02. Override is alpha_before_numeric.
generate_processed_pdf_thumbnails_page
Generates a thumbnail page of each PDF that is included in the SIP. This can be a resource (memory and CPU) intensive operation. Override is do_not_generate_processed_pdf_thumbnails_page.
do_not_generate_processed_pdf_thumbnails_page
Does not generate a thumbnail page of each PDF that is included in the SIP. Override is generate_processed_pdf_thumbnails_page.
skip_generation_thumbnail_page_when_error_free
Skip thumbnail page generation when there are no processing errors. Override is always_generate_thumbnail_page.
always_generate_thumbnail_page
Always generate thumbnail page. Override is skip_generation_thumbnail_page_when_error_free.
use_in_memory_pdf_to_thumbnail_generation
Use the in-memory pdf to thumbnail page generation. This can be a resource (memory and CPU) intensive operation. Override is use_command_line_pdf_to_thumbnail_generation.
use_command_line_pdf_to_thumbnail_generation
On linux-based systems, this option will use the command-line tool pdftoppm to generate the pdf thumbnails. This is a much faster (and much higher quality) operation. Override is use_in_memory_pdf_to_thumbnail_generation.

Overrides for rules and options

Processing rules and options can be overridden on several different levels.

Each processing type has a set of default processing rules and processing options.

The processing type rules and options are overridden by the rules and options in the given spreadsheet row that is matched for processing a given title_code folder.

Finally, the command-line processing rules and processing options are applied and will override all previous options.

For example, the parent_grouping processing type has default processing option, numeric_before_alpha. When processing the title code DPT, this default option is overridden by alpha_before_numeric for the DPT row for parent_grouping. Finally, it is possible to specify a processing option numeric_before_alpha on the command line, which would mean that all processing sorts the ordering of PDFs as numeric_before_alpha.

File processed indicator: ready-for-ingestion-FOLDER-COMPLETED file

Currently the ready-for-ingestion processing runs each separate title code folder on its own individual thread. When an exception occurs that halts processing for a specific thread, other threads will continue processing. It is possible for processing for many folders to be incomplete while at the same time others have completed. For example, the processing may lose its connection to the source and target folders in the middle of processing. To help determine which processing has successfully completed, the ready-for-ingestion processor will write an empty file ready-for-ingestion-FOLDER-COMPLETED in the target folder to indicate that all processing stages were successfully completed. If this file is not present it means that the processing for that folder was interrupted for some reason and will need to be re-run.

Example processing command

The following snippet illustrates a ready-for-ingestion processing command:

sourceFolder="path/to/LD_Sched/fairfax-processing/pre-processing"
targetBaseFolder="/path/to/LD_Sched/fairfax-processing"
targetForIngestionFolder="${targetBaseFolder}/for-ingestion"
forReviewFolder="${targetBaseFolder}/for-review"

startingDate="2019-06-03"
endingDate="2019-06-09"

forIngestionProcessingTypes="parent_grouping,parent_grouping_with_edition,create_sip_for_folder"
forIngestionProcessingOptions="use_command_line_pdf_to_thumbnail_generation"

numberOfThreads=60
# Note we ware using command-line pdf-to-thumbnail generation, which can handle higher throughput
maximumThumbnailPageThreads=60

maxMemory="3048m"
minMemory="3048m"

java -Xms${minMemory} -Xmx${maxMemory} \
    -jar fat/build/libs/sip-generation-fairfax-fat-all-<VERSION>.jar \
    --readyForIngestion \
    --startingDate="${startingDate}" \
    --endingDate="${endingDate}" \
    --sourceFolder="${sourceFolder}" \
    --targetForIngestionFolder="${targetForIngestionFolder}" \
    --forReviewFolder="${forReviewFolder}" \
    --createDestination \
    --parallelizeProcessing \
    --numberOfThreads=${numberOfThreads} \
    --maximumThumbnailPageThreads=${maximumThumbnailPageThreads} \
    --forIngestionProcessingTypes="${forIngestionProcessingTypes}" \
    --forIngestionProcessingRules="${forIngestionProcessingRules}" \
    --forIngestionProcessingOptions="${forIngestionProcessingOptions}"

Terminating or stopping ready-for-ingestion processing with ready-for-ingestion-STOP file

Sometimes it may be necessary to terminate the ready-for-ingestion processing prematurely, before it has completed processing all of its folders. There is some code in the processor that attempts to trap a ^C or kill signal and attempt a graceful shutdown, but that code does not seem functional at the moment.

The other approach is to create a file in the targetForIngestionFolder with the name ready-for-ingestion-STOP. When this file appears all existing processing will complete and all subsequent processing will be skipped. At the end of all processing the log will provide a list of skipped folders.

Note that it’s quite possible to delete the ReadyForIngestionProcessor_STOP file, in which case processing will continue. However, there is no attempt to run any skipped processing.

Managing errors in processing

Sometimes processing for a specific folder may fail for some reason. For example, if the source and/or target folders are NFS shares, the connection to the source or target may be interrupted, throwing some kind of IO exception. This exception will halt the processing for that particular source folder. However, if the problem is intermittent (in other words, the connection is lost but then comes back), then other processing may work fine.

At the end of a processing run the list of failed folders will be provided with the reason for that folder’s processing failing. The suggestion is to copy those failed folders to a separate location and process them again.

Note as well that if there is an failure in processing a folder, the ready-for-ingestion-FOLDER-COMPLETED file will not be present in the target location. The folders that do not have the ready-for-ingestion-FOLDER-COMPLETED will need to be deleted so that they are not ingested into Rosetta by mistake.

For-review

See the Librarian Guide for a discussion of the for-review output and how a librarian handles the different exceptions to processing.

Copying ingested loads to ingested folder

Once files have been ingested into Rosetta, a file with the name of done is placed in the root folder. The path of the root folder is of the format:

<magazine|newspaper>/<date-in-yyyyMMdd>_<title_code>_<processing_type>_<optional-edition>__<full-name-of-publication>

After the folder has been ingested into Rosetta the folder can be moved to the post-processed folder.

post-processed folder structure

The folder structure for the ingested (post-processed) stage is as follows:

<targetFolder>/<magazines|newspapers>/<title_code>/<yyyy>/<folder-containing-done-file>

The naming of the folder containing of the done file is determined by the processing rules for the ready-for-ingestion processor. See Ready-for-ingestion stage for more details. In this folder, the file structure matches the same structure that was ingested into Rosetta, namely:

<folder-specific-naming>
   |- done
   |- content/
           |- mets.xml
           |- streams/
                   |- <pdf-files>

Note that the mets.xml file is placed in the content folder. The done files is in the root folder.

Example processing command

The following snippet illustrates a --copyIngestedLoadsToIngestedFolder processing command:

baseFolder="/path/to/LD_Sched/fairfax-processing"
sourceFolder="${baseFolder}/for-ingestion"
targetPostProcessedFolder="${baseFolder}/post-processed"
forReviewFolder="${baseFolder}/for-review"

startingDate="2019-06-03"
endingDate="2019-06-09"

# Currently the processing is not multithreaded, but eventually it would be
numberOfThreads=60

maxMemory="2048m"
minMemory="2048m"

java -Xms${minMemory} -Xmx${maxMemory} \
    -jar fat/build/libs/sip-generation-fairfax-fat-all-<VERSION>.jar \
    --copyIngestedLoadsToIngestedFolder \
    --startingDate="${startingDate}" \
    --endingDate="${endingDate}" \
    --sourceFolder="${sourceFolder}" \
    --targetPostProcessedFolder="${targetPostProcessedFolder}" \
    --forReviewFolder="${forReviewFolder}" \
    --createDestination \
    --parallelizeProcessing \
    --numberOfThreads=${numberOfThreads}

Important notes

The --moveFiles option is not included in the example, but in general you would be moving the files to the post-processed location.

The the done file must exist or the files will not be copied/moved. If files must be copied regardless of the existence of the done file, use the option --moveOrCopyEvenIfNoRosettaDoneFile.

For-review

If a file or set of files is unable to be processed for some reason, it will be placed in the For-review folder. There is no processor that operates on the For-review stage. Processors that output to the For-review folder use the parameter forReviewFolder to set the location of the For-review folder.

If the files come from the Ready-for-ingestion stage but are not ingested into Rosetta properly, then there is no done file placed in the root folder. There’s no other way to tell that the ingestion has failed. For this reason, the copyIngestedLoadsToIngestedFolder processing usually only moves/copies the folders that contain a done file.

After an ingestion takes place the ingested folders (those containing the done file) can be moved to the targetPostProcessedFolder. The folders that remain can be reviewed to determine the reason for failure.

Additional tools

listFiles: list files based on source folder

listFiles simply lists files by title code, section code and date:

java -jar sip-generation-fairfax-fat-all-<VERSION>.jar \
    --listFiles \
    --startingDate="yyyy-MM-dd" \
    --endingDate="yyyy-MM-dd" \
    --sourceFolder="/path/to/source/folder"

extractMetadata: extract metadata from the pdf files based on source folder

Extracts metadata from the pdf files:

java -jar sip-generation-fairfax-fat-all-<VERSION>.jar \
    --extractMetadata \
    --startingDate="yyyy-MM-dd" \
    --endingDate="yyyy-MM-dd" \
    --sourceFolder="/path/to/source/folder"

copyProdLoadToTestStructures: Copy production load files

Copies files from previous production loads into Rosetta into Pre-processing and Ready-for-ingestion structures for testing. The structures are as follows:

1. preProcess structure. This is to mimic the input to readyForIngestion processing. The folder structures are the same as the output to preProcess, with the folder structure starting with <targetFolder>/preProcess. 2. readyForIngestion structure. This is the structure that gets ingested into Rosetta. The folder structures are the same as the output to readyForIngestion, with the folder structure starting with <targetFolder>/readyForIngestion.

These structures provide for testing the Fairfax processor, to see if its outputs match the work done previously:

java -jar sip-generation-fairfax-fat-all-<VERSION>.jar \
    --copyProdLoadToTestStructures \
    --startingDate="yyyy-MM-dd" \
    --endingDate="yyyy-MM-dd" \
    --sourceFolder="/path/to/source/folder" \
    --targetFolder="/path/to/target/folder" \
    --createDestination

Converting the spreadsheet to JSON and vice-versa

From time to time the spreadsheet that defines how the Fairfax files are ingested will changed based on new information. When this happens, the json file found at core/src/main/resources/default-fairfax-import-parameters.json needs updating to reflect the changes in the source spreadsheet.

Converting the csv spreadsheet to JSON

  1. First, export the original spreadsheet in .csv format with the file separator as | and save it.

  2. Copy the exported csv spreadsheet to: core/src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-spreadsheet.csv.

  3. Execute the gradle task updateDefaultFairfaxImportParameters, which takes the csv spreadsheet and converts it to a JSON file, which is then used for the actual processing:

    gradle updateDefaultFairfaxImportParameters \
      -PfairfaxSpreadsheetImportFilename="core/src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-spreadsheet.csv" \
      -PfairfaxSpreadsheetExportFilename="core/src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-parameters.json"
    

Note that there is no requirement to use the filenames given in the example. The given filenames are the ones the code uses.

Converting the JSON parameters to csv spreadsheet

The JSON file can be converted to a csv spreadsheet using the build task exportDefaultFairfaxImportParameters:

gradle exportDefaultFairfaxImportParameters \
  -PfairfaxSpreadsheetImportFilename="core/src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-parameters.json" \
  -PfairfaxSpreadsheetExportFilename="core/src/main/resources/nz/govt/natlib/tools/sip/generation/fairfax/default-fairfax-import-spreadsheet.csv"

Note that there is no requirement to use the filenames given in the example. The given filenames are the ones the code uses.

Check in the changes and build a new version of the jar

Once both the .csv and .json files have been updated, changes should then be checked in and a new version of this the processor jar built, which will have the new JSON processing resource file.

Copying and moves

File copying

File copies are done in 2 steps: - The file is copied to its new target with a file extension of .tmpcopy. - The file is renamed to the target name.

This means that the target does not have its correct name until the copy is complete. Subsequent runs on the same source do checks to see if the target’s MD5 hash is the same. If the hash is the same, the copy is not done.

Atomic file moves

Some processing has a --moveFiles option. Note that when moving files across file systems (in other words, from one file system to another), it’s not possible to have truly atomic operations. If the move operation is interrupted before it completes, what can happen is that a file of the same name will exist on both filesystems, with the target file system having an incomplete file.

With that in mind, file moves have the following characteristics:

  • If a file move can be done atomicly (as determined by the Java runtime), it is done atomicly.
  • If the file move cannot be done atomicly (as determined by the Java runtime), the file moves take the following steps:
    1. The file is copied across to the target file system with a .tmpcopy extension.
    2. The file is renamed to the target file name.
    3. The source file is deleted.

This means that if at any point the operation is interrupted, a recovery can take place. A move when the file already exists in the target folder will trigger a MD5 hash comparison. If the source file and the target file are identical, the source file is deleted. Otherwise, the target file is moved across (using the steps above) with a -DUPLICATE-# in the filename. These -DUPLICATE-# files need to be checked manually to determine which file is correct.

We hope these mitigations will prevent any data loss.