Compare PDF text using JavaScript

Text Comparison is used to visually compare the text of pages of different documents. It’s helpful for comparison of different versions of the same document, or for comparing different documents that have similar content, and it allows for the analysis of textual content between different documents. It’s particularly useful for documents that have undergone edits, enabling users to spot changes swiftly. The comparison is done on a per-page basis, and the differences are highlighted in the user interface (UI).

Information

Comparing documents and text is available when using the Web SDK in standalone operational mode.

Text comparison is possible in Nutrient Web SDK with the corresponding license component. Contact Sales if you’re interested in this functionality.

To process two documents for comparison, provide the documents to the loadTextComparison method. The method takes an object with the following properties:

PSPDFKit.loadTextComparison({
  ...defaultConfiguration,
  documentA: "text-comparison/static/documentA.pdf",
  documentB: "text-comparison/static/documentB.pdf"
});

In the configuration object above, set the following properties for the comparison:

  • documentA — The path to the first document to compare.

  • documentB — The path to the second document to compare.

Default UI

The default UI consists of the following components:

  • Primary toolbar, which contains the main actions, like showing the comparison sidebar, text comparison navigation, scroll-lock, and any other primary toolbar item from the allowed list.

  • Secondary toolbars, which contains the page navigation, pan mode, zooming, and any other secondary toolbar item from the allowed list.

  • Sidebar, which contains the text comparison navigation and the list of pages with differences.

Default UI of Text Comparison.

Customizing the UI

Every part of the UI is customizable and can be hidden or shown based on a user’s requirements using the options from PSPDFKit.textComparisonDefaultToolbarItems and PSPDFKit.textComparisonInnerToolbarItems.

To customize the primary toolbar, use the toolbarItems configuration option and pass the items you want to show in the toolbar. The toolbar items are defined in PSPDFKit.textComparisonDefaultToolbarItems. The secondary inner toolbar items are defined in PSPDFKit.textComparisonInnerToolbarItems and can be customized using the innerToolbarItems configuration option:

PSPDFKit.loadTextComparison({
  ...restOfConfigurations,
  toolbarItems: [
    { type: "prev-change" },
    { type: "next-change" },
    { type: "comparison-changes" },
    { type: "scroll-lock" }
  ]
});

Customizing comparison highlights

The default colors chosen by Nutrient Web SDK show a good contrast level when overlaid. However, it’s possible to choose which colors will be used to highlight the differences between the two documents. This can be done by setting the comparisonSidebarConfig.diffColors configuration option in the loadTextComparison method. The diffColors option accepts a DiffColors object with the following properties:

PSPDFKit.loadTextComparison({
  ...restOfConfigurations,
  comparisonSidebarConfig: {
    diffColors: {
      deletionColor: new PSPDFKit.Color({ r: 255, g: 218, b: 185 }),
      insertionColor: new PSPDFKit.Color({ r: 200, g: 255, b: 200 })
    }
  }
});

Programmatic text comparison

Text Comparison can also be used programmatically to compare the text of pages of different documents without loading the UI. To perform a text comparison operation, provide two documents and a set of options. The options are used to configure the comparison operation.

Describing your documents

The PSPDFKit.DocumentDescriptor class is used to provide all the necessary details about your documents for comparison:

  • filePath — Path to the document or an ArrayBuffer.

  • password— Optional password if the document is encrypted.

  • pageIndexes — An array of page indexes, or an array of ranges where an array is [min, max]. If omitted, all pages will be staged for comparison.

const originalDocument = new PSPDFKit.DocumentDescriptor({
  filePath: "document-comparison/static/documentA.pdf",
  pageIndexes: [0]
});

const changedDocument = new PSPDFKit.DocumentDescriptor({
  filePath: "document-comparison/static/documentB.pdf",
  pageIndexes: [0]
});

Defining the comparison operation

The PSPDFKit.ComparisonOperation class outlines the comparison type and optional settings:

  • type — Type of comparison. The default is ComparisonOperationType.TEXT. Use PSPDFKit.ComparisonOperationType to check for available comparison types. As of now, only ComparisonOperationType.TEXT is supported.

  • options — The settings for the operation. Currently only numberOfContextWords, which specifies the number of context words for the comparison, is supported.

const textComparisonOperation = new PSPDFKit.ComparisonOperation(
  PSPDFKit.ComparisonOperationType.TEXT,
  {
    numberOfContextWords: 2
  }
);

Text comparison

The final step is to call the instance#compareDocuments method:

const comparisonResult = await instance.compareDocuments(
  { originalDocument, changedDocument },
  textComparisonOperation
);

console.log(comparisonResult);

Understanding the comparison result

The comparison provides a PSPDFKit.DocumentComparisonResult, which outlines:

  • type — The type of comparison (currently only ComparisonOperationType.TEXT is supported).

  • hunks — Hunks of detected text changes.

A hunk groups operations that describe how to transform the original text to the changed text. For instance, if a word is replaced, the hunk will include operations to delete the original word and insert the changed word. The structure of a hunk is:

  • originalRange — The range the hunk represents on the original page.

  • changedRange — The range the hunk represents on the changed page.

  • operations — The operations the hunk contains.

An operation represents a single insertion, a single deletion, or no change between the original and changed text. It’s composed of:

  • type — The operation type (“insert”, “delete”, or “equal”).

  • text — The text the operation is based upon.

  • originalTextBlocks — The rectangles the operation relates to in the original document.

  • changedTextBlocks — The rectangles the operation relates to in the changed document.

A text block relates text to a specific region in a document:

  • range — The range in the document page the text block relates to.

  • rects — The rectangles on the document page the text block refers to.

Example Result

The result will be structured similarly to the following:

[{
  "documentComparisonResults": [{
    "changedPageIndex": 1,
    "comparisonResults": [{
      "hunks": [{
        "changedRange": {
          "length": 1,
          "position": 1
        },
        "operations": [{
          "changedTextBlocks": {
            "range": {
              "length": 1,
              "position": 0
            },
            "rects": [
              [
                341.1,
                265.2,
                0,
                0
              ]
            ],
          },
          "originalTextBlocks": {
            "range": {
              "length": 1,
              "position": 1
            },
            "rects": [
              [
                341.1,
                265.2,
                74.4,
                288.0
              ]
            ],
          },
          "text": "1",
          "type": "delete"
        }],
        "originalRange": {
          "length": 1,
          "position": 1
        }
      }],
      "type": "text"
    }],
    "originalPageIndex": 0
  }]
}]

These steps allow you to pinpoint changes between documents with ease and to build your own custom user interface to display the results, as demonstrated in this sample project. Refer to our public API documentation to read more technical details about the Text Comparison API and learn how to use it in your implementation.