PDF Index Tool - Extract text from PDFs and build searchable index

class ml_research_tools.doc.pdf_index.PDFDocument(pdf_path, file_mtime, file_size)[source]#

Bases: object

Represents a PDF document in the index.

Parameters:
pdf_path: str#
file_mtime: float#
file_size: int#
class ml_research_tools.doc.pdf_index.SearchResult(pdf_path, page_num, snippet, rank)[source]#

Bases: object

Represents a search result.

Parameters:
pdf_path: str#
page_num: int#
snippet: str#
rank: float#
class ml_research_tools.doc.pdf_index.PDFIndexDB(index_path)[source]#

Bases: object

Handles SQLite FTS5 database operations.

Parameters:

index_path (Path)

connect()[source]#

Connect to database and initialize schema.

close()[source]#

Close database connection.

document_exists(pdf_path, file_mtime)[source]#

Check if document is already indexed with same mtime.

Return type:

bool

Parameters:
get_indexed_count()[source]#

Get total number of indexed documents.

Return type:

int

get_page_count()[source]#

Get total number of indexed pages.

Return type:

int

remove_document(pdf_path)[source]#

Remove document and its content from index.

Parameters:

pdf_path (str)

add_document(doc, pages_text)[source]#

Add document and its pages to index.

Parameters:
search(query, limit)[source]#

Search index with FTS5 query.

Return type:

List[SearchResult]

Parameters:

Search using regex pattern (slower, scans all content).

Return type:

List[SearchResult]

Parameters:
class ml_research_tools.doc.pdf_index.PDFIndexTool(services)[source]#

Bases: BaseTool

Initialize the PDF index tool.

name: str = 'pdf-index'#
description: str = 'Build searchable index of PDF documents'#
__init__(services)[source]#

Initialize the PDF index tool.

Return type:

None

classmethod add_arguments(parser)[source]#

Add tool-specific arguments to the parser.

Return type:

None

Parameters:

parser (ArgumentParser)

execute(config, args)[source]#

Execute the PDF indexing tool.

Return type:

int

Parameters: