Indexing PDF documents on Android

Nutrient supports fast and efficient full-text search in PDF documents through PdfLibrary. This guide describes how to get started with PdfLibrary alongside the LibraryDataSource API.

This guide covers the recommended approach using LibraryDataSource and LibraryFileSystemDataSource. For the legacy approach using direct document enqueueing, refer to the legacy approach section below.

The recommended way to use PdfLibrary is through the LibraryDataSource API, which provides automatic directory monitoring and more efficient indexing. This approach uses LibraryFileSystemDataSource to automatically index all PDFs in a specified directory:

// Create the library instance.
val libraryDbPath = File(context.filesDir, "pdf_library.db").absolutePath
val library = PdfLibrary(libraryDbPath)
// Set up the file system data source for a directory.
val documentsDirectory = File(context.filesDir, "documents")
val dataSource = LibraryFileSystemDataSource(library, documentsDirectory)
// Configure the library to use the data source.
library.dataSource = dataSource
// Index all documents in the directory.
library.updateIndexFromDataSource()

Check out our Catalog example(opens in a new tab) to see it in action.

Advanced configuration

LibraryFileSystemDataSource provides several configuration options:

val dataSource = LibraryFileSystemDataSource(library, documentsDirectory) { documentSource ->
// Optional document handler — return `true` to index, `false` to skip.
// You can use this to filter documents based on custom criteria.
!documentSource.title?.contains("draft", ignoreCase = true) ?: true
}
// Configure file extension filtering (defaults to "pdf" only).
// At the time of release, only PDF files have been tested.
dataSource.allowedPathExtensions = setOf("pdf")
// Enable explicit mode for manual change notifications (advanced).
dataSource.isExplicitModeEnabled = false // Default: automatic directory monitoring

Searching with LibraryDataSource

Once your PdfLibrary is set up with a LibraryDataSource, you can search across all indexed documents. The library enables you to query for the current indexing state using isIndexing() and check individual document status with getIndexStatusForUID().

The search results are delivered through the QueryResultListener callbacks. Basic results (document UID to page numbers mapping) come through onSearchCompleted, while text preview snippets are delivered via onSearchPreviewsGenerated when generateTextPreviews() is enabled.

Here’s an example search with LibraryDataSource:

// Set up search result options.
val options = QueryOptions.Builder()
.generateTextPreviews(true)
.previewRange(20, 120)
.build()
// Run the search. The search will run on a background thread and the callbacks will be called
// from the background thread as well.
library.search("looking for this text", options, object : QueryResultListener {
override fun onSearchCompleted(p0: String, p1: Map<String, Set<Int>>) {
// Results contain UID → set of pages mapping.
}
override fun onSearchPreviewsGenerated(p0: String, p1: Map<String, Set<QueryPreviewResult>>) {
// Previews contain UID → set of `QueryPreviewResult` mappings.
}
})

Legacy approach

The following approach using enqueueDocuments() is legacy and isn’t recommended for new implementations. Use the LibraryDataSource approach above instead.

The legacy approach requires manually managing individual PdfDocument instances:

// Assume that you have two valid `PdfDocument`s.
val doc1 : PdfDocument = ...
val doc2 : PdfDocument = ...
// The library will be saved in your application's files directory.
val library = PdfLibrary.get(File(context.filesDir, "library.db").absolutePath)
library.enqueueDocuments(listOf(doc1, doc2))

Resource management

When using LibraryFileSystemDataSource, remember to clean up resources when done:

// In your activity's `onDestroy()` or when finished.
dataSource.cleanup()