Library

Library Status

This screen provides a detailed breakdown of all the document libraries currently configured in Document Searchability. Each document library will have detailed information about each of the documents it contains and details about each document.

Library Settings

SettingDescription
Document Library NameName/Title/Description of the document library
Document Library TypeThe type of the document library:
- File System
- SharePoint
- Office 365
- Azure Blob Storage
- Azure File Storage
LocationsOne or more locations (of the same type) to be processed.
Excluded Specific LocationsSelect this if you want to exclude specific locations from being processed.
Site collections, sites and libraries that match the specified URLs are excluded.
Filter Locations by Regular ExpressionSelect this to only include locations whose URLs match specific regular expressions.
Choose Library IconChoose an icon to associate to the library.
Processing Mode- Audit Only: Analyse the document library to find out the documents that need to be converted without actually converting them.
- Audit & OCR: Perform audit on the document library and OCR the documents that have been identified as candidates for processing
CoresThis determines the maximum number of CPU cores that will be used when running the job.
Process SharePoint ListsWhether or not to process SharePoint lists. NOTE: Process SharePoint lists can be very time consuming if the lists being processed are very large
SharePoint VersioningThis setting can be used to automatically turn versioning on.
Publish Major VersionPublish major version after OCR
Check-in CommentThe check-in comment applied to the updated SharePoint file version. There is also the option of specifying the following templates in the check-in comment: %DATE% : will be replaced by the date the document OCRed
%TIME% : will be replaced by the time the document OCRed
Custom Check-in ColumnOptionally, specify a SharePoint column to add a custom comment to after OCR. NOTE: This is case sensitive.
CommentThe comment to add to the Custom Check-in Column. There is also the option of specifying the following templates in the comment: %DATE% : will be replaced by the date the document OCRed
%TIME% : will be replaced by the time the document OCRed

Document Settings

SettingDescription
Process PDFWhether or not to process PDF documents
Image OnlyWhether or not to process Image-only PDFs. An Image-only PDF is a PDF that originated from a scanned document or other digital image. An Image-only PDF does not contain any text, just pictures.
Partially SearchableWhether or not to process PDF documents that are partially searchable, i.e., some pages are searchable, and some are image-only.
Fully SearchableWhether or not to process PDF documents that are fully searchable.
Hidden TextWhether or not process PDF documents with hidden text in them. A Hidden Text PDF has pages that are Image-only with hidden (type 3) text.
- Such files are typically the output of running an OCR PDF process on an Image Only PDF.
- Note: If you set this setting to true, you might want to consider setting Remove Hidden Text to true in the “OCR Settings > PDF Source Settings”, otherwise you will have multiple OCR text layers per page.
Process TIFF FilesWhether or not to process TIFF files
Delete Original TIFFWhether or not to delete the original TIFF files after they have been converted to searchable PDFs.
Process BMP DocumentsWhether or not to process BMP files.
Delete Original BMPWhether or not to delete the original BMP files after they have been converted to searchable PDFs.
Process JPEG FilesWhether or not to process JPEG files
Delete Original JPEGWhether or not to delete the original JPEG files after they have been converted to searchable PDFs.
Process PNG FilesWhether or not to process PNG files.
Delete Original PNGWhether or not to delete the original PNG files after they have been converted to searchable PDFs.
Process PDF AttachmentsWhether or not to process PDF attachments inside MSG files.
Temp Folder LocationThe folder used to save documents temporarily for Audit and OCR processing.
Date FilterFilter out documents by modified or creation date. Documents that fall within the specified “From” and “To” date will be excluded.
Exclude Specific DocumentsSelect this if you want to exclude specific documents by their paths. Documents that match the specified paths are excluded.
Filter Documents by Regular ExpressionSelect this to only include documents whose properties match specific regular expressions. E.g., Only include documents whose name matches a specific regular expression.
Document Error RuleThe operation to perform if a document fails to process:
- Copy to error folder
- Move to error folder (for file system library type only)
Retain Folder StructureOption to retain document’s folder structure when copied to error location
Document Error LocationThe path of the error location
Document Error Location TypeFile System
- SharePoint
- Office 365
- Azure Blob Storage
- Azure File Storage
RetryWhether or not to re-process documents that have previously failed to convert
OCR Document LimitLimit the number of documents to OCR (not Audit) per run. Set to ‘0’ for no limits.
Retain Creation Date*Retain the creation date of the source document (SharePoint creation date, FileSystem creation date and created date in PDF properties)
Retain Modified Date*Retain the modified date of the source document (SharePoint modified date, FileSystem modified date and modified date in PDF properties)
Retain Created By*Retain the created user of the source document (SharePoint created by FileSystem owner and author in PDF properties)
Retain Modified By*Retain the created user of the source document (SharePoint modified by)

* See the sections 6.3.3.1, 6.3.3.2 and 6.3.3.3 for more details about these settings.

Retain Creation/Modified Date/User

Creation DateCreated UserModified DateModified User
SharePoint metadata**
PDF metadata**N/A
Windows File System✔*N/A

* “Create User” maps best to “Owner” in Windows File System metadata.

For this to be manipulated, the Document Searchability service must be running with sufficient administrative privileges.

** SharePoint metadata vs. PDF metadata

SharePoint metadata refers to the ‘columns’ available in SharePoint that stores information about each document.

PDF metadata refers to the document properties (File > Properties) of a PDF document.

SharePoint Libraries

The behaviour of Retain Creation/Modified Date/User can vary depending on the settings used in SharePoint and Document Searchability. The table below summarises when these will and will not be retained in SharePoint.

n/a* - To publish major version, both major and minor versioning must be on in SharePoint.

SharePoint Lists

Document Archive Settings

SettingDescription
Archive TemplateThe template to use to rename the archived file name. The default is: %FILENAME%%TIMESTAMP%.%EXT%
Archive LocationThe folder location where original documents will be archived
Archive source Images to Archive folderIf enabled, this will Archive your source Images (TIFF, BMP, JPEG, PNG) to the Archive folder specified above.
Archive source PDF & MSG files to Archive folderIf enabled, this will Archive the source PDFs and MSG files that have PDF attachments to the Archive folder (even when versioning is enabled within SharePoint). A file is only archived before it is OCRed.
Archive Location TypeFile System
- SharePoint
- Office 365
- Azure Blob Storage
- Azure File Storage
Retain Folder StructureOption to retain document’s folder structure when file is archived

OCR Settings

As described in section 5.1.4, Document Searchability has 2 OCR engines. When creating a new library, the default OCR settings are loaded from the Properties.xml file for each OCR engine.

  • Nutrient engine: “[installation path]\ocr\Properties.xml”

  • Extended (IRIS) engine: “[installation path]\extendedocr\Properties.xml”

This can be useful if you have a set of OCR settings that work best for the type of documents you have and want to use the same OCR settings for all newly created document libraries.

Note: Document Searchability does not modify the Properties.xml file. To set default values, you need to manually update the relevant Properties.xml file.

Standard OCR Settings

General Settings

SettingDescription
General Settings
Auto RotateAutomatically rotate pages so that text flows left to right
DeskewStraighten the image
Remove LinesRemove lines and boxes during OCR processing to improve recognition – particularly in cases where characters touch lines
DespeckleRemove specks below the specified pixel size from the image
Box/Graphics ProcessingBy default, if an area of the document is identified as a graphic area, then no OCR processing is run on that area. However, certain documents may include areas or boxes that are identified as “graphic” or “picture” areas but that actually do contain useful text. To ensure that the OCR engine can be forced to process such areas there are two options:
- “Treat all Graphics Areas as Text”. This option will ensure the entire document is processed as text.
- “Remove Box Lines in OCR Processing”. This option is ideal for forms where sometimes boxes around text can cause an area to be identified as graphics. This option removes boxes from the temporary copy of the imaged used by the OCR engine. It does not remove boxes from the final image. Technically, this option removes connected elements with a minimum area (by default 100 pixels).
Advanced FlagsCommand line flags to be passed through to the underlying executable. Contact our support team for details on using this field.
PDF Source Settings

PDF Source Settings
Re-Image PDFEach page of the source PDF is rasterized to an image and appended to a new PDF document.
DPISets the DPI of rasterized images. If 'Re-image PDF' is used, these images will be added to the output file.
Retain BookmarksRetains any bookmarks from the source file in the output PDF document when using 'Re-Image PDF'.
Retain MetadataRetains any metadata from the source file in the output PDF document when using 'Re-Image PDF'.
Retain Viewer PrefsRetains any PDF Viewer Preferences, Page Mode and Page Layout from source file in the output when using 'Re-Image PDF'
CompressionThe image(s) in the output PDF file will be compressed using JBIG2 (for black and white image) or MRC (for color images) which can dramatically reduce the output size of PDFs.
Remove Hidden TextRemove existing hidden text (text that was added as a result of a previous OCR) from the PDF file so that the resulting searchable PDF file does not have two layers of the same text.
Force Vector CheckThis setting is useful when dealing with documents that contains vector objects (e.g., CAD drawings). By default, pages that contain only vector objects are rasterized. Pages that do not have any images but contains vector objects as well as electronic text are skipped from rasterization. However, sometimes there can be a page that contain vector objects (CAD drawings), but its title may be in electronic text. To force rasterizing pages like these, set this property to true.
PDF/ASwitch on to make sure the output PDF conforms to the PDF/A standards.
PDF/A VersionThis determines the PDF/A version of the generated PDF.
Validate PDF/AValidate the PDF as conforming to PDF/A.
Image Source Settings

Image Source Settings
CompressionThe image(s) in the output PDF file will be compressed using JBIG2 (for black and white image) or MRC (for color images) which can dramatically reduce the output size of PDFs.
PDF/ASwitch on to make sure the output PDF conforms to the PDF/A standards.
PDF/A VersionThis determines the PDF/A version of the generated PDF.

Extended OCR Settings

General Settings

SettingDescription
Auto RotateDetect page orientation and correct if required
DeskewRotates the image to correct its skew angle.
Remove Dark BordersRemoves the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened.
- Note: The dark border should be touching the edge of the image/page for this to work.
Keep Original ImageYes, to keep the original image as it is. No to output the image generated after selected pre-processing has been applied.
- Note: This only applies when the source document is an image (TIFF, BMP, JPEG, PNG) or 'Re-Image PDF' is used when the source is a PDF document.
DespeckleRemoves all the groups of connected pixels with a number of pixels below the parameter.
Advanced DespeckleThe size of the speckles to remove.
Remove White PixelsBy default, despeckle removes black pixels. If set to true, despeckle will remove white pixels rather than black pixels.
Work DepthThis parameter (0 – 255) defines how deeply the OCR engine will analyse a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results.
Remove Blank PagesSet this to true to remove blank pages from output PDF documents. A value needs to be set for sensitivity (see below).
SensitivityThe sensitivity, from 1 to 100. With a high sensitivity, fewer blank pages are detected.
LanguageSet the language(s) to use for OCR.
- Note: Only a maximum of 8 languages can be selected.
- Only the English language can be used in conjunction with an Asian language
PDF Source Settings

PDF Source Settings
Re-Image PDFEach page of the source PDF is rasterized to an image and appended to a new PDF document.
Output PDF VersionThis determines the PDF version of the generated PDF.
Retain BookmarksRetains any bookmarks from the source file in the output PDF document when using 'Re-Image PDF'.
Retain MetadataRetains any metadata from the source file in the output PDF document when using 'Re-Image PDF'.
Remove Hidden TextRemove existing hidden text (text that was added as a result of a previous OCR) from the PDF file so that the resulting searchable PDF file does not have two layers of the same text.
Remove Visible TextWhether or not to re-OCR existing visible text.
DPISets the DPI of rasterized images. If 'Re-image PDF' is used, these images will be added to the output file. However, applying 'Image Compression' or 'iHQC Compression' may reduce the DPI in the output PDF.
Force Vector CheckThis setting is useful when dealing with documents that contains vector objects (e.g., CAD drawings). By default, pages that contain only vector objects are rasterized. Pages that do not have any images but contains vector objects as well as electronic text are skipped from rasterization. However, sometimes there can be a page that contain vector objects (CAD drawings), but its title may be in electronic text. To force rasterizing pages like these, set this property to true.
Image CompressionCompress color JPEG images in generated PDFs
JPEG QualityThis parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality.
JPEG2000 CompressionUse JPEG 2000 compression
Compression ModeThe JPEG 2000 compression mode to use.
Compression ValueThe value to use for the selected compression mode.
iHQC CompressionApply intelligent High-Quality Compression
Quality FactorThe IHQC quality factor.
Compression LevelThe iHQC compression level to be used. Level 1 is the basic compression level. Level 3 is the most advanced intelligent High-Quality Compression mode.
Image Source Settings

Image Source Settings
Output PDF VersionThis determines the PDF version of the generated PDF.
Image CompressionCompress color JPEG images in generated PDFs
JPEG QualityThis parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality.
JPEG2000 CompressionUse JPEG 2000 compression
Compression ModeThe JPEG 2000 compression mode to use.
Compression ValueThe value to use for the selected compression mode.
iHQC CompressionApply intelligent High-Quality Compression
Quality FactorThe IHQC quality factor.
Compression LevelThe iHQC compression level to be used. Level 1 is the basic compression level. Level 3 is the most advanced intelligent High-Quality Compression mode.
Advanced Pre-processing Settings

Advanced Pre-processing Settings
Remove LinesWhether or not to remove lines from an image (The image must be black and white).
Horizontal Clean XThe parameter for cleaning noisy pixels attached to the horizontal lines.
Horizontal Clean YThe parameter for cleaning noisy pixels attached to the horizontal lines.
Vertical Clean XThe parameter for cleaning noisy pixels attached to the vertical lines.
Vertical Clean YThe parameter for cleaning noisy pixels attached to the vertical lines.
Horizontal DilateThe dilate parameter that helps the detection of horizontal lines.
Vertical DilateThe dilate parameter that helps the detection of vertical lines.
Horizontal Max GapThe maximum horizontal line gap to close. It is useful to remove broken lines.
Vertical Max GapThe maximum vertical line gap to close. It is useful to remove broken lines.
Horizontal Max ThicknessThe maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes.
Vertical Max ThicknessThe maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes.
Horizontal Min LengthThe minimum length of the horizontal lines to remove.
Vertical Min LengthThe minimum length of the vertical lines to remove.
BinarizeWhether or not to perform binarization on the document.
BrightnessThe brightness (higher values will darker the result).
ContrastThe contrast (lower values will darker the result).
Smoothing LevelSmoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more).
ThresholdSets the threshold for fixed threshold binarization (0 for automatic threshold computation).
InterpolateWhether or not to interpolate.
Interpolation ModeSets the interpolation mode.
Interpolation ValueInterpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image's resolution.

Run Details

Previous runs carried out on a particular document library are listed under the Run History section. The Run Details list provide detailed information about each run. Both the Run History and Run Details have columns where filters can be applied to limit what is displayed.

Use Export to CSV to export the run details to CSV file.

The View Full Log button can be used to display the full log file of a specific run.

Run Details Context Menu

Use the right-click context menu to:

  • Copy the file path of the selected document.

  • Open the file (File System and SharePoint only)

  • Open the location of the file (File System and SharePoint only)

Scheduler Settings

SettingDescription
ManualThis means that the document library must be run manually by clicking on the “Run” button on the dashboard.
Once per dayThis allows the document library to be scheduled to run at a specified time each day.
ContinuousThis allows the document library to be scheduled to run periodically between a start time and end time each day. The periods may be minutes, hours, days, or months. For example, a document library may be specified to run every 1 hour between 9:00 and 17:00.
Run OnceThis allows the document library to be scheduled to run only once at a specified time.

Alert Settings

Action

SettingDescription
Action
Send an emailSelect this if you want to send an email
Generate a CSV reportSelect this if you want to generate a report
Attach the CSV report to the emailWhether or not to attach the CSV report to the email
Save ReportSave the report locally
Location TypeThe type of storage used to save the report:
- File System
- SharePoint
- Office 365
- Azure Blob Storage
- Azure File Storage
LocationThe location to save the report

Email

Email
FromThe email address to send the email from.
To
- Cc
- Bcc
The email address(es) to send the email to. Multiple email addresses can be specified by separating each one with a semicolon in the “To”, “Cc” and “Bcc” fields.
SubjectThe email subject. You can use the following templates:
- %LIBRARYNAME% - will be replaced by the name of the library
- %STATUS% - will be replaced by “success” or “error” depending on whether the job ran successfully or not
MessageThe email message to send. You can use the following templates within the email message:
- %LIBRARYNAME% - will be replaced by the name of the library
- %STATUS% - will be replaced by “success” or “error” depending on whether the job ran successfully or not
- %LOGFILEPATH% - will be replaced by the path of the log file for the library
- %ERRORMESSAGE% - will be replaced by any error messages that occurred during the library run

Report

Report
Show library audit summary in reportThe library audit summary contains statistics about the current searchability status of the library as a whole, as well as individual statistics about each document type in the library.
Run Details Summary (OCR only)
Show run details summary in report.The run details summary lists all the documents that were processed in a particular run including:
- Number of documents OCRed
- Number of documents that failed to OCR
Show details of individual documents that were processedInclude in the report individual document details (for the columns to be included see below)
LimitSet the maximum number of documents reported. This value needs to be set by the user.
Choose columns that will appear in the report:The columns include:
- Document Path
- Searchability
- Document Type
- Number of pages
- Number of searchable pages
- Number of image pages
- Conversion status

Trigger

Trigger
Alert is triggeredEvery time the library runs successfully.
- Every time the library fails to run.
- Every time there is a SharePoint or Azure connection error
Advanced SettingsIndependent of the above trigger settings, the alert can be scheduled to run daily, weekly (on selected days), monthly or once.
ExpiresWhether or not the trigger expires
ExpiryThe expiry date of the trigger. The alert task will not run after this date.