Software Components
Reusable low-level components in the implementation layer
Software components are the core building blocks of the GAIK implementation layer. Each component encapsulates one well-defined capability — speech-to-text, document parsing, structured extraction, classification, or retrieval. They are designed to be used standalone or composed into custom pipelines, giving developers precise control over every step of a knowledge processing workflow.
See also: No-Code Assets for prompt templates and packaged skills.
Use software components when you need fine-grained control over each processing step, when a predefined module doesn't fit your requirements, or when integrating GenAI capabilities into an existing system.
Most chat-based components (Extractor, Classifier, TranscriptEnhancer, AnswerGenerator, Embedder) can run against OpenAI, Azure OpenAI, Anthropic, or Google with the same code — see Multi-Provider LLM Client for the gaik.software_components.llm package and the per-provider support matrix.
Transcriber
The Transcriber converts spoken audio or video recordings into written text. It uses OpenAI's Whisper model for accurate speech-to-text transcription and optionally applies a GPT-based post-processing step to clean up the raw output — fixing punctuation, removing filler words, and improving overall readability. It handles long recordings automatically through chunked processing, making it suitable for real-world audio captured in noisy or uncontrolled environments.
Key features:
- Whisper-based transcription with high accuracy across languages
- Optional GPT enhancement for clean, readable output
- Automatic chunking for recordings of any length
- Supports MP3, WAV, M4A, OGG, and video formats
Potential applications:
-
Workplace incident and safety observation reporting
-
Construction site and field service diaries
-
Meeting and interview transcription
-
Medical dictation and clinical notes
-
Customer call recording analysis
-
Lecture and training material capture
-
Implementation code: software_components/transcriber
-
Usage examples: examples/transcriber
Transcript Enhancer
The Transcript Enhancer improves the quality of raw speech-to-text output through a controlled two-pass LLM workflow — without rewriting or paraphrasing the content. It is designed to reduce the transcription errors that downstream extraction would otherwise inherit. Pass 1 corrects obvious spelling mistakes and normalises repeated terms to a consistent form. Pass 2 uses sentence-level context to repair remaining ASR errors: fixing split or merged compound words, correcting misheard words, and cleaning up minor grammar issues where the correction is unambiguous. The result stays faithful to the spoken content; the goal is error reduction, not prose improvement.
Key features:
- Two-pass workflow: spelling/consistency first, context-based ASR repair second
- Optional correction summary (total changes, insertions, deletions, substitutions)
- Optional diff output showing exactly which spans changed
additional_instructionsparameter to add domain-specific rules (e.g. preserve brand names)- Tuned for Finnish; same structure can be adapted for other languages
Potential applications:
-
Improving transcription quality before structured extraction
-
Construction site diary and safety observation reporting from voice
-
Medical dictation cleanup before clinical data extraction
-
Meeting and interview transcription polishing
-
Any domain where ASR errors would propagate into downstream structured records
-
Implementation code: software_components/enhance_transcript
-
Usage examples: examples/enhance_transcript
Document Parser
The Document Parser converts PDF, DOCX, and other file types into clean, structured markdown text. The package ships six parser variants so you can match accuracy, cost, and infrastructure needs: vision-based parsing (GPT, Claude, or Gemini), local parsing (PyMuPDF, Docling), a Docling + vision hybrid that returns markdown plus metadata, and a remote Docling API client. The parsed markdown output preserves tables, headings, and multi-page structure, making it a reliable input for downstream extraction or indexing.
Parser variants:
| Parser | When to use |
|---|---|
VisionParser | Single-provider (OpenAI / Azure) vision parsing — page images → markdown |
MultimodalParser | Multi-provider PDF → markdown (OpenAI · Anthropic Claude · Google Gemini) with one API |
PyMuPDFParser | Fast local PDF text extraction, no API calls, lowest cost |
DocxParser | Local Word document (.docx / .doc) extraction via python-docx |
DoclingParser | Advanced local parsing with OCR, table extraction, multi-format support |
VisionPlusParser | Docling + vision hybrid that returns markdown and metadata in one pass |
DoclingApiClientParser | Remote client for a hosted Docling parsing service |
Key features:
- Six parsers covering local, vision, hybrid, and remote-service strategies
- Preserves tables, headings, and document structure in markdown output
- Multi-page processing with consistent formatting
- Suitable for both simple forms and complex, visually rich documents
Potential applications:
-
Invoice, receipt, and purchase order digitization
-
Contract and legal document preprocessing
-
Technical manual and product specification extraction
-
Research paper and report indexing
-
Compliance document analysis
-
HR form and CV processing
-
Implementation code: software_components/parsers
-
Usage examples: examples/parsers
Extractor
The Extractor turns any unstructured text — transcripts, parsed documents, or free-form notes — into validated, structured records. What makes it distinctive is how extraction is configured: instead of writing code or defining a database schema, you describe what you need in plain language. A Requirement Parser interprets your field definitions, a Schema Generator builds a type-safe Pydantic model, and a Data Extractor uses an LLM to populate each field. The generated schema is saved and reused for future runs, eliminating the cost and latency of regenerating it each time.
Key features:
- Plain-language requirements replace manual schema definition
- Type-safe extraction with full Pydantic validation
- Supports constrained fields, allowed values, and conditional rules
- Schema persistence for efficient batch processing
Potential applications:
-
Incident and safety report field extraction from voice or text
-
Invoice and purchase order data capture
-
Contract clause and obligation extraction
-
Survey and form response structuring
-
Quality inspection data recording
-
HR and recruitment data extraction from CVs
-
Implementation code: software_components/extractor
-
Usage examples: examples/extractor
Vision Extractor
The Vision Extractor turns PDFs and images directly into structured, schema-validated data — without an intermediate Markdown parse step. One LLM call sees all uploaded files together, so the model can use visual layout cues (tables, column alignment, headers, multi-page structure) and reason across documents in a single pass. It is best suited for forms, purchase orders, bills of materials, scanned reports, and any document where parsing quality strongly affects extraction accuracy or where row-level alignment must be preserved across pages or related files.
Key features:
- Single-pass PDF/image → structured data — no parse-then-extract round-trip
- Multi-document calls: align line items in a PO with quantities in a BOM in one request
- Multi-provider: OpenAI / Azure, Anthropic Claude (Foundry), Google Gemini (Vertex AI or direct)
- Reuses generated Pydantic schemas across runs via
schema_dirpersistence - Optional per-field verification: confidence score + reasoning for every scalar field
Potential applications:
-
Purchase order processing with bill-of-materials cross-referencing
-
Invoice and receipt extraction from scanned documents
-
Form digitisation (insurance claims, medical intake, government applications)
-
Layout-sensitive tables (lab reports, financial statements, technical drawings)
-
Human-in-the-loop QA pipelines where each value must be justified
-
Implementation code: software_components/vision_extractor
-
Usage examples: examples/vision_extractor
Document Classifier
The Document Classifier assigns a predefined label to a document based on its content. It uses an LLM to evaluate the document against a user-defined set of classes and returns both the predicted class and a confidence score. It is most useful as a preprocessing step in multi-document pipelines — routing each document to the appropriate extraction schema, storage location, or downstream process before any further processing begins.
Key features:
- Define any set of classes in plain language — no training required
- Returns confidence scores for review and threshold-based routing
- Works as a standalone step or as a gate before extraction
Potential applications:
-
Incoming document triage and routing in finance or legal workflows
-
Email and attachment categorization
-
Insurance claim and case type detection
-
Archive tagging and document library organization
-
Regulatory filing type identification
-
Preprocessing step for multi-schema extraction pipelines
-
Implementation code: software_components/doc_classifier
-
Usage examples: examples/classifier
Validator (LLM-as-Judge)
The Validator runs a second-pass LLM over the output of an upstream extractor — feeding the source page images plus the extractor's JSON to a vision-capable model and asking it to flag fields whose value doesn't match the document. It returns per-field ok / suspect / wrong flags with reasons and suggested values, plus a token / cost / duration record per call. Useful when missing or hallucinated fields are expensive downstream and you want a cross-provider sanity check before acting on extraction output.
Key features:
- Same code talks to OpenAI / Azure OpenAI / Anthropic Foundry / Google Vertex via the toolkit's existing config helpers
- Returns per-field flags (
ok,suspect,wrong) plus a document-level slot for structural observations like "items missing from the extractor output" - Domain-agnostic by default; pass an optional
ValidationRubricwith vendor-specific check sentences when you want targeted guidance - Cost tracking with longest-prefix model lookup so rotated model ids still resolve to a known rate
- Text-vs-text mode (
judge.judge_text_pair(extracted, expected, field_name=...)) compares two short strings for semantic equivalence without any source document — useful for scoring audio-transcription extractors against hand-annotated ground truth where exact-string match misses paraphrasing - Schema-agnostic hallucination detector (
judge.detect_hallucinations(source_text, extracted)) inspects an extractor's JSON as a whole against a source document and returns oneHallucinationFlagper problem field — drop-in replacement for hand-written keyword post-validators that only cover a single schema
Potential applications:
- Purchase-order extraction QA before sending the order downstream
- Invoice / receipt extraction sanity check
- Form-extraction review where missing or hallucinated fields are expensive
- Two-tier setups: cheap pre-screen + premium judge on flagged rows
Recommended judge model: gemini-3-flash-preview via Vertex AI — F1 = 1.000 at $0.004/PO in our Luvata judge benchmark across 8 candidates including GPT-5.4 / 5.5, Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7, and Gemini 3.1 Flash Lite.
- Implementation code: software_components/validators/llm_judge
- Usage examples: examples/validators
Text-to-Speech
The Text-to-Speech component converts written text into spoken audio using OpenAI or Azure OpenAI TTS models. It accepts plain text and returns an audio file in a configurable format, with control over voice, language, speed, and output format. It is useful anywhere a workflow needs to deliver information aurally — generating audio versions of extracted content, reading back structured report summaries, or producing narration for instructional content.
Key features:
- Supports OpenAI and Azure OpenAI TTS backends with the same interface
- Configurable voice, language (Finnish and English), speed, and audio format
- Returns a
SpeechSynthesisResultwith audio bytes, model, and metadata - Simple
result.save("output_dir")for persisting audio to disk
Potential applications:
-
Audio delivery of extracted report summaries for field workers
-
Accessibility features for reading back structured content aloud
-
Voice notifications in automated workflow pipelines
-
Generating narration for training or instructional content
-
Implementation code: software_components/text_to_speech
-
Usage examples: examples/text_to_speech
RAG Components
The RAG (Retrieval-Augmented Generation) components provide a modular pipeline for building document-grounded question answering systems. Rather than a single monolithic tool, the RAG pipeline is split into five composable blocks — each with a clear responsibility. You can use individual blocks where needed or assemble the full pipeline for a complete Q&A system grounded in your own documents.
Pipeline overview:
1 · RAG Parser
Extracts text from source documents and splits it into manageable chunks with preserved metadata (page number, source file, section heading). The chunks serve as the unit of indexing and retrieval throughout the rest of the pipeline.
Example: A company safety manual (150 pages) is parsed into ~600 chunks. Each chunk carries its page number and section title so answers can be traced back to the source.
2 · Embedder
Converts each text chunk into a dense numerical vector using an embedding model. Semantically similar chunks produce similar vectors, enabling meaning-based search rather than keyword matching.
Example: The chunk "All incidents must be reported within 24 hours of occurrence" is embedded as a vector. A question like "What is the deadline for incident reporting?" produces a similar vector — enabling a match even though no keywords are shared.
3 · Vector Store
Stores and indexes all embeddings for fast similarity search at query time. The vector store persists the knowledge base between sessions, so documents only need to be processed once and can be queried repeatedly.
Example: A product documentation knowledge base is indexed once and reused across hundreds of daily queries without reprocessing the source documents.
Two backends are available: in-memory / ChromaDB for quick prototyping, and PostgreSQL + pgvector (PgVectorStore) for production use with semantic, keyword, and hybrid search (RRF).
4 · Retriever
Accepts a user question, embeds it, and searches the vector store for the most semantically relevant chunks. Optionally reranks results to improve precision before passing them to the answer generator.
Example: The question "What PPE is required in the assembly area?" retrieves the three most relevant policy chunks from a 200-page safety manual, even if the manual never uses the exact word "PPE".
5 · Answer Generator
Takes the retrieved chunks and the original question and uses an LLM to compose a coherent, factual answer. The response is grounded strictly in the retrieved content, and source citations are included so the answer can be verified.
Example: Given the retrieved policy chunks, the answer generator responds: "According to the Safety Manual (Section 4.2), employees in the assembly area must wear safety glasses, steel-toed boots, and high-visibility vests at all times."
Key features:
- Fully modular — use individual blocks or assemble the full pipeline
- Metadata-preserving chunking for accurate source attribution
- Optional reranking to improve retrieval precision
- Citation-aware answer generation for trustworthy, verifiable outputs
Potential applications:
-
Internal knowledge base and company policy Q&A
-
Technical documentation and product manual assistants
-
Regulatory and compliance document lookup
-
Customer support knowledge retrieval
-
Contract and legal clause search
-
Training material and onboarding knowledge assistants
-
Implementation code: software_components/RAG
-
Usage examples: examples/RAG
GAIK