Indexing PDF Documents on iOS
PSPDFKit supports efficient and fast full-text search in PDF documents through PDFLibrary
. This guide describes how to get started with PDFLibrary
.
Getting Started
PDFLibrary
relies on a data source to retrieve information about the documents that are to be indexed. The LibraryDataSource
protocol specifies the methods the data source needs to implement. Generally, you won’t need to implement your own data source, but instead use the LibraryFileSystemDataSource
class provided to you. You use it as follows:
guard let library = PSPDFKit.SDK.shared.library else { // FTS feature isn't enabled in your license. return } // Assume that you have a directory of PDF documents you want to index. let directoryURL = ... let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: directoryURL) { document, stopPointer in // If you want to skip a specific document, return `false` here. // If you want to stop the directory enumeration, set `stopPointer.pointee` to `true`. return true } library.dataSource = dataSource // Note that `PDFLibrary` holds the data source with a strong reference. // Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous. // For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue. DispatchQueue.global(qos: .background).async { library.updateIndex() }
PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library; if (!library) { // FTS feature isn't enabled in your license. return; } // Assume that you have a directory of PDF documents you want to index. NSURL *directoryURL = ...; PSPDFLibraryFileSystemDataSource *fileDataSource = [[PSPDFLibraryFileSystemDataSource alloc] initWithLibrary:library documentsDirectoryURL:directoryURL documentHandler:^(PSPDFDocument *document, BOOL *stop) { // If you want to skip a specific document, return `NO` here. // If you want to stop the directory enumeration, set `*stop` to `YES`. return YES; }]; library.dataSource = fileDataSource; // Note that `PSPDFLibrary` holds the data source with a strong reference. // Begins the indexing operation. This method performs some initial work synchronously and then starts the indexing, which is asynchronous. // For large amounts of documents, even the initial work could be slow, which is why this should always be called on a background queue. dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND), ^{ [library updateIndexWithCompletionHandler:nil]; });
Note that you should always set the library’s data source, and not just when you want to update the index. A good place do to this is your app delegate’s application(_:willFinishLaunchingWithOptions:)
.
PDFLibrary
posts notifications as the index status changes. The following notifications are available:
You’ll usually observe PSPDFLibraryDidFinishIndexingDocument
to perform a search as more and more documents become available:
// Assume that `libraryDidFinishIndexing(_:)` has been registered with `NotificationCenter.default`. func libraryDidFinishIndexing(notification: Notification) { guard let library = PSPDFKit.SDK.shared.library else { // FTS feature isn't enabled in your license. return } if !library.isIndexing { // All documents have been indexed. } library.documentUIDs(matching: "PSPDFKit", options: nil) { searchString, resultSet in for (UID, indexSet) in resultSet { print("Found the following matches in document \(UID): \(indexSet)") } } }
// Assume that `libraryDidFinishIndexing:` has been registered with `NSNotificationCenter.defaultNotificationCenter`. - (void)libraryDidFinishIndexing:(NSNotification *)notification { PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library; if (!library.isIndexing) { // All documents have been indexed. } [library documentUIDsMatchingString:@"PSPDFKit" options:nil completionHandler:^(NSString *searchString, NSDictionary *resultSet) { for (NSString *UID in resultSet) { NSIndexSet *indexSet = resultSet[UID]; NSLog(@"Found the following matches in document %@: %@", UID, indexSet); } }]; }
You can decide to only query the library if all documents have been indexed by using isIndexing
. You can also check the current status of individual documents by using indexStatus(forUID:withProgress:)
. The results are delivered to you in a Dictionary
that maps the UID of documents as a String
to a IndexSet
. An index is set in the IndexSet
of a given document if the search string occurs on that page.
Indexing Priority
You can also specify the priority of the background queue used for indexing. This can only be changed on creation of the library, and it defaults to PDFLibrary.IndexingPriority.low
. If you require faster indexing, you can do one of two things:
-
Create your own
PDFLibrary
instance as described above. -
Specify a
.libraryIndexingPriority
in the options passed intosetLicenseKey(_:options:)
to change the priority used by the default library.
SQLite FTS Version
The default PDFLibrary
(available via PSPDFKit.SDK.shared
) uses the highest version of SQLite’s full-text search available. The version of SQLite shipping with iOS 9 and 10 doesn’t have FTS5 enabled, and therefore will only use FTS4. FTS5 will be automatically enabled if you use a custom version of SQLite with the correct compile flags. You can also specify which version of FTS to use by creating a new instance with the PDFLibrary(path:ftsVersion:tokenizer:)
method.
File System Data Source with Encrypted or Locked Documents
If you need your locked documents to be indexed, you can set the file system data source’s documentProvider
property to an object that implements the LibraryFileSystemDataSourceDocumentProvider
protocol. You can then use it as follows:
class LibraryDocumentProvider: NSObject, LibraryFileSystemDataSourceDocumentProvider { public func dataSource(_ dataSource: LibraryFileSystemDataSource, documentWithUID UID: String?, at fileURL: URL) -> Document? { // Create the document as required, ensuring it is decrypted and unlocked and ready to index. let document = ... if document.isLocked { // Unlock document as required. } return document } } let library = PSPDFKit.SDK.shared.library! // Replace this with your custom library, if you use one. let dataSource = LibraryFileSystemDataSource(library: library, documentsDirectoryURL: URL(), documentHandler: nil) self.libraryDocumentProvider = LibraryDocumentProvider() dataSource.documentProvider = libraryDocumentProvider library.dataSource = dataSource library.updateIndex()
@interface LibraryDocumentProvider : NSObject <PSPDFLibraryFileSystemDataSourceDocumentProvider> @end @implementation LibraryDocumentProvider - (PSPDFDocument *)dataSource:(PSPDFLibraryFileSystemDataSource *)dataSource documentWithUID:(NSString *)UID atURL:(NSURL *)fileURL { // Create the document as required, ensuring it is decrypted and unlocked and ready to index. PSPDFDocument *document = ...; if (document.isLocked) { // Unlock document as required. } return document; } @end self.documentProvider = [LibraryDocumentProvider new]; PSPDFLibraryFileSystemDataSource *dataSource = ...; dataSource.documentProvider = documentProvider; PSPDFLibrary *library = PSPDFKitGlobal.sharedInstance.library; // Replace this with your custom library, if you use one. library.dataSource = dataSource; [library updateIndexWithCompletionHandler:nil];
File System Data Source Performance
In most cases, LibraryFileSystemDataSource
is fast enough, and it automatically detects changes to the file system when requested by a PDFLibrary
. However, each call to PDFLibrary.updateIndex(completionHandler:)
makes the data source traverse its documents directory to detect changes. If this is called rapidly, it could result in a slowdown if the number of files in the directory is large. If your app is responsible for changes in the directory, you can manually specify these changes to the LibraryFileSystemDataSource
object by enabling Explicit Mode (starting with PSPDFKit 6.2.2 for iOS). This can be done as follows:
let dataSource = ... dataSource.isExplicitModeEnabled = true // Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location. let addedDocumentURL = ... dataSource.didAddOrModifyDocument(at: addedDocumentURL) // Similarly, if a document has been removed: let removedDocumentURL = ... dataSource.didRemoveDocument(at: removedDocumentURL)
PSPDFLibraryFileSystemDataSource *dataSource = ...; dataSource.explicitModeEnabled = YES; // Consider a case where you know that a document has been added to or changed in the data source's documents directory and already have the location. NSURL *addedDocumentURL = ...; [dataSource didAddOrModifyDocumentAtURL:addedDocumentURL]; // Similarly, if a document has been removed: NSURL *removedDocumentURL = ...; [dataSource didRemoveDocumentAtURL:removedDocumentURL];
Note that using these methods on the data source doesn’t automatically add or remove documents from the library. The data source notes the changes made, and it then specifies them to the PDFLibrary
object when requested during the next call to PDFLibrary.updateIndex(completionHandler:)
.
Explicit mode should only be enabled in cases where you need to call PDFLibrary.updateIndex(completionHandler:)
multiple times in a short period of time, and where you also know the changes being made on the file system. In all other cases, let the data source handle the change detection, and keep [isExplicitModeEnabled
] set to false
.