PDF Index Tool - Extract text from PDFs and build searchable index
-
class ml_research_tools.doc.pdf_index.PDFDocument(pdf_path, file_mtime, file_size)[source]
Bases: object
Represents a PDF document in the index.
- Parameters:
-
-
pdf_path:
str
-
file_mtime:
float
-
file_size:
int
-
class ml_research_tools.doc.pdf_index.SearchResult(pdf_path, page_num, snippet, rank)[source]
Bases: object
Represents a search result.
- Parameters:
-
-
pdf_path:
str
-
page_num:
int
-
snippet:
str
-
rank:
float
-
class ml_research_tools.doc.pdf_index.PDFIndexDB(index_path)[source]
Bases: object
Handles SQLite FTS5 database operations.
- Parameters:
index_path (Path)
-
connect()[source]
Connect to database and initialize schema.
-
close()[source]
Close database connection.
-
document_exists(pdf_path, file_mtime)[source]
Check if document is already indexed with same mtime.
- Return type:
bool
- Parameters:
-
-
get_indexed_count()[source]
Get total number of indexed documents.
- Return type:
int
-
get_page_count()[source]
Get total number of indexed pages.
- Return type:
int
-
remove_document(pdf_path)[source]
Remove document and its content from index.
- Parameters:
pdf_path (str)
-
add_document(doc, pages_text)[source]
Add document and its pages to index.
- Parameters:
-
-
search(query, limit)[source]
Search index with FTS5 query.
- Return type:
List[SearchResult]
- Parameters:
-
-
regex_search(pattern, limit)[source]
Search using regex pattern (slower, scans all content).
- Return type:
List[SearchResult]
- Parameters:
-
-
class ml_research_tools.doc.pdf_index.PDFIndexTool(services)[source]
Bases: BaseTool
Initialize the PDF index tool.
-
name:
str = 'pdf-index'
-
description:
str = 'Build searchable index of PDF documents'
-
__init__(services)[source]
Initialize the PDF index tool.
- Return type:
None
-
classmethod add_arguments(parser)[source]
Add tool-specific arguments to the parser.
- Return type:
None
- Parameters:
parser (ArgumentParser)
-
execute(config, args)[source]
Execute the PDF indexing tool.
- Return type:
int
- Parameters:
-