<no title> — ML Research Tools 1.1

PDF Index Tool - Extract text from PDFs and build searchable index

class ml_research_tools.doc.pdf_index.PDFDocument(pdf_path, file_mtime, file_size)[source]#

Bases: object

Represents a PDF document in the index.

Parameters:

pdf_path (str)
file_mtime (float)
file_size (int)

pdf_path: str#

file_mtime: float#

file_size: int#

class ml_research_tools.doc.pdf_index.SearchResult(pdf_path, page_num, snippet, rank)[source]#

Bases: object

Represents a search result.

Parameters:

pdf_path (str)
page_num (int)
snippet (str)
rank (float)

pdf_path: str#

page_num: int#

snippet: str#

rank: float#

class ml_research_tools.doc.pdf_index.PDFIndexDB(index_path)[source]#

Bases: object

Handles SQLite FTS5 database operations.

Parameters:: index_path (Path)

connect()[source]#: Connect to database and initialize schema.

close()[source]#: Close database connection.

document_exists(pdf_path, file_mtime)[source]#

Check if document is already indexed with same mtime.

Return type:

bool

Parameters:

pdf_path (str)
file_mtime (float)

get_indexed_count()[source]#

Get total number of indexed documents.

Return type:: int

get_page_count()[source]#

Get total number of indexed pages.

Return type:: int

remove_document(pdf_path)[source]#

Remove document and its content from index.

Parameters:: pdf_path (str)

add_document(doc, pages_text)[source]#

Add document and its pages to index.

Parameters:

doc (PDFDocument)
pages_text (List[Tuple[int, str]])

search(query, limit)[source]#

Search index with FTS5 query.

Return type:

List[SearchResult]

Parameters:

query (str)
limit (int)

regex_search(pattern, limit)[source]#

Search using regex pattern (slower, scans all content).

Return type:

List[SearchResult]

Parameters:

pattern (str)
limit (int)

class ml_research_tools.doc.pdf_index.PDFIndexTool(services)[source]#

Bases: BaseTool

Initialize the PDF index tool.

name: str = 'pdf-index'#

description: str = 'Build searchable index of PDF documents'#

__init__(services)[source]#

Initialize the PDF index tool.

Return type:: None

classmethod add_arguments(parser)[source]#

Add tool-specific arguments to the parser.

Return type:: None
Parameters:: parser (ArgumentParser)

execute(config, args)[source]#

Execute the PDF indexing tool.

Return type:

int

Parameters:

config (Config)
args (Namespace)

Contents