Software Components

Software components are the core building blocks of the GAIK implementation layer. Each component encapsulates one well-defined capability — speech-to-text, document parsing, structured extraction, classification, or retrieval. They are designed to be used standalone or composed into custom pipelines, giving developers precise control over every step of a knowledge processing workflow.

See also: No-Code Assets for prompt templates and packaged skills.

Use software components when you need fine-grained control over each processing step, when a predefined module doesn't fit your requirements, or when integrating GenAI capabilities into an existing system.

Most chat-based components (Extractor, Classifier, TranscriptEnhancer, AnswerGenerator, Embedder) can run against OpenAI, Azure OpenAI, Anthropic, or Google with the same code — see Multi-Provider LLM Client for the gaik.software_components.llm package and the per-provider support matrix.

Transcriber

The Transcriber converts spoken audio or video recordings into written text. It uses OpenAI's Whisper model for accurate speech-to-text transcription and optionally applies a GPT-based post-processing step to clean up the raw output — fixing punctuation, removing filler words, and improving overall readability. It handles long recordings automatically through chunked processing, making it suitable for real-world audio captured in noisy or uncontrolled environments.

Key features:

Whisper-based transcription with high accuracy across languages
Optional GPT enhancement for clean, readable output
Automatic chunking for recordings of any length
Supports MP3, WAV, M4A, OGG, and video formats

Potential applications:

Workplace incident and safety observation reporting
Construction site and field service diaries
Meeting and interview transcription
Medical dictation and clinical notes
Customer call recording analysis
Lecture and training material capture
Implementation code: software_components/transcriber
Usage examples: examples/transcriber

Transcript Enhancer

The Transcript Enhancer improves the quality of raw speech-to-text output through a controlled two-pass LLM workflow — without rewriting or paraphrasing the content. It is designed to reduce the transcription errors that downstream extraction would otherwise inherit. Pass 1 corrects obvious spelling mistakes and normalises repeated terms to a consistent form. Pass 2 uses sentence-level context to repair remaining ASR errors: fixing split or merged compound words, correcting misheard words, and cleaning up minor grammar issues where the correction is unambiguous. The result stays faithful to the spoken content; the goal is error reduction, not prose improvement.

Key features:

Two-pass workflow: spelling/consistency first, context-based ASR repair second
Optional correction summary (total changes, insertions, deletions, substitutions)
Optional diff output showing exactly which spans changed
additional_instructions parameter to add domain-specific rules (e.g. preserve brand names)
Tuned for Finnish; same structure can be adapted for other languages

Potential applications:

Improving transcription quality before structured extraction
Construction site diary and safety observation reporting from voice
Medical dictation cleanup before clinical data extraction
Meeting and interview transcription polishing
Any domain where ASR errors would propagate into downstream structured records
Implementation code: software_components/enhance_transcript
Usage examples: examples/enhance_transcript

Document Parser

The Document Parser converts PDF, DOCX, and other file types into clean, structured markdown text. The package ships six parser variants so you can match accuracy, cost, and infrastructure needs: vision-based parsing (GPT, Claude, or Gemini), local parsing (PyMuPDF, Docling), a Docling + vision hybrid that returns markdown plus metadata, and a remote Docling API client. The parsed markdown output preserves tables, headings, and multi-page structure, making it a reliable input for downstream extraction or indexing.

Parser variants:

Parser	When to use
`VisionParser`	Single-provider (OpenAI / Azure) vision parsing — page images → markdown
`MultimodalParser`	Multi-provider PDF → markdown (OpenAI · Anthropic Claude · Google Gemini) with one API
`PyMuPDFParser`	Fast local PDF text extraction, no API calls, lowest cost
`DocxParser`	Local Word document (`.docx` / `.doc`) extraction via `python-docx`
`DoclingParser`	Advanced local parsing with OCR, table extraction, multi-format support
`VisionPlusParser`	Docling + vision hybrid that returns markdown and metadata in one pass
`DoclingApiClientParser`	Remote client for a hosted Docling parsing service

Key features:

Six parsers covering local, vision, hybrid, and remote-service strategies
Preserves tables, headings, and document structure in markdown output
Multi-page processing with consistent formatting
Suitable for both simple forms and complex, visually rich documents

Potential applications:

Invoice, receipt, and purchase order digitization
Contract and legal document preprocessing
Technical manual and product specification extraction
Research paper and report indexing
Compliance document analysis
HR form and CV processing
Implementation code: software_components/parsers
Usage examples: examples/parsers

Extractor

The Extractor turns any unstructured text — transcripts, parsed documents, or free-form notes — into validated, structured records. What makes it distinctive is how extraction is configured: instead of writing code or defining a database schema, you describe what you need in plain language. A Requirement Parser interprets your field definitions, a Schema Generator builds a type-safe Pydantic model, and a Data Extractor uses an LLM to populate each field. The generated schema is saved and reused for future runs, eliminating the cost and latency of regenerating it each time.

Key features:

Plain-language requirements replace manual schema definition
Type-safe extraction with full Pydantic validation
Supports constrained fields, allowed values, and conditional rules
Schema persistence for efficient batch processing

Potential applications:

Incident and safety report field extraction from voice or text
Invoice and purchase order data capture
Contract clause and obligation extraction
Survey and form response structuring
Quality inspection data recording
HR and recruitment data extraction from CVs
Implementation code: software_components/extractor
Usage examples: examples/extractor

Vision Extractor

The Vision Extractor turns PDFs and images directly into structured, schema-validated data — without an intermediate Markdown parse step. One LLM call sees all uploaded files together, so the model can use visual layout cues (tables, column alignment, headers, multi-page structure) and reason across documents in a single pass. It is best suited for forms, purchase orders, bills of materials, scanned reports, and any document where parsing quality strongly affects extraction accuracy or where row-level alignment must be preserved across pages or related files.

Key features:

Single-pass PDF/image → structured data — no parse-then-extract round-trip
Multi-document calls: align line items in a PO with quantities in a BOM in one request
Multi-provider: OpenAI / Azure, Anthropic Claude (Foundry), Google Gemini (Vertex AI or direct)
Reuses generated Pydantic schemas across runs via schema_dir persistence
Optional per-field verification: confidence score + reasoning for every scalar field

Potential applications:

Purchase order processing with bill-of-materials cross-referencing
Invoice and receipt extraction from scanned documents
Form digitisation (insurance claims, medical intake, government applications)
Layout-sensitive tables (lab reports, financial statements, technical drawings)
Human-in-the-loop QA pipelines where each value must be justified
Implementation code: software_components/vision_extractor
Usage examples: examples/vision_extractor

Document Classifier

The Document Classifier assigns a predefined label to a document based on its content. It uses an LLM to evaluate the document against a user-defined set of classes and returns both the predicted class and a confidence score. It is most useful as a preprocessing step in multi-document pipelines — routing each document to the appropriate extraction schema, storage location, or downstream process before any further processing begins.

Key features:

Define any set of classes in plain language — no training required
Returns confidence scores for review and threshold-based routing
Works as a standalone step or as a gate before extraction

Potential applications:

Incoming document triage and routing in finance or legal workflows
Email and attachment categorization
Insurance claim and case type detection
Archive tagging and document library organization
Regulatory filing type identification
Preprocessing step for multi-schema extraction pipelines
Implementation code: software_components/doc_classifier
Usage examples: examples/classifier

Validator (LLM-as-Judge)

The Validator runs a second-pass LLM over the output of an upstream extractor — feeding the source page images plus the extractor's JSON to a vision-capable model and asking it to flag fields whose value doesn't match the document. It returns per-field ok / suspect / wrong flags with reasons and suggested values, plus a token / cost / duration record per call. Useful when missing or hallucinated fields are expensive downstream and you want a cross-provider sanity check before acting on extraction output.

Key features:

Same code talks to OpenAI / Azure OpenAI / Anthropic Foundry / Google Vertex via the toolkit's existing config helpers
Returns per-field flags (ok, suspect, wrong) plus a document-level slot for structural observations like "items missing from the extractor output"
Domain-agnostic by default; pass an optional ValidationRubric with vendor-specific check sentences when you want targeted guidance
Cost tracking with longest-prefix model lookup so rotated model ids still resolve to a known rate
Text-vs-text mode (judge.judge_text_pair(extracted, expected, field_name=...)) compares two short strings for semantic equivalence without any source document — useful for scoring audio-transcription extractors against hand-annotated ground truth where exact-string match misses paraphrasing
Schema-agnostic hallucination detector (judge.detect_hallucinations(source_text, extracted)) inspects an extractor's JSON as a whole against a source document and returns one HallucinationFlag per problem field — drop-in replacement for hand-written keyword post-validators that only cover a single schema

Potential applications:

Purchase-order extraction QA before sending the order downstream
Invoice / receipt extraction sanity check
Form-extraction review where missing or hallucinated fields are expensive
Two-tier setups: cheap pre-screen + premium judge on flagged rows

Recommended judge model: gemini-3-flash-preview via Vertex AI — F1 = 1.000 at $0.004/PO in our Luvata judge benchmark across 8 candidates including GPT-5.4 / 5.5, Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7, and Gemini 3.1 Flash Lite.

Implementation code: software_components/validators/llm_judge
Usage examples: examples/validators

Text-to-Speech

The Text-to-Speech component converts written text into spoken audio using OpenAI or Azure OpenAI TTS models. It accepts plain text and returns an audio file in a configurable format, with control over voice, language, speed, and output format. It is useful anywhere a workflow needs to deliver information aurally — generating audio versions of extracted content, reading back structured report summaries, or producing narration for instructional content.

Key features:

Supports OpenAI and Azure OpenAI TTS backends with the same interface
Configurable voice, language (Finnish and English), speed, and audio format
Returns a SpeechSynthesisResult with audio bytes, model, and metadata
Simple result.save("output_dir") for persisting audio to disk

Potential applications:

Audio delivery of extracted report summaries for field workers
Accessibility features for reading back structured content aloud
Voice notifications in automated workflow pipelines
Generating narration for training or instructional content
Implementation code: software_components/text_to_speech
Usage examples: examples/text_to_speech

RAG Components

The RAG (Retrieval-Augmented Generation) components provide a modular pipeline for building document-grounded question answering systems. Rather than a single monolithic tool, the RAG pipeline is split into five composable blocks — each with a clear responsibility. You can use individual blocks where needed or assemble the full pipeline for a complete Q&A system grounded in your own documents.

Pipeline overview:

1 · RAG Parser

Extracts text from source documents and splits it into manageable chunks with preserved metadata (page number, source file, section heading). The chunks serve as the unit of indexing and retrieval throughout the rest of the pipeline.

Example: A company safety manual (150 pages) is parsed into ~600 chunks. Each chunk carries its page number and section title so answers can be traced back to the source.

2 · Embedder

Converts each text chunk into a dense numerical vector using an embedding model. Semantically similar chunks produce similar vectors, enabling meaning-based search rather than keyword matching.

Example: The chunk "All incidents must be reported within 24 hours of occurrence" is embedded as a vector. A question like "What is the deadline for incident reporting?" produces a similar vector — enabling a match even though no keywords are shared.

3 · Vector Store

Stores and indexes all embeddings for fast similarity search at query time. The vector store persists the knowledge base between sessions, so documents only need to be processed once and can be queried repeatedly.

Example: A product documentation knowledge base is indexed once and reused across hundreds of daily queries without reprocessing the source documents.

Two backends are available: in-memory / ChromaDB for quick prototyping, and PostgreSQL + pgvector (PgVectorStore) for production use with semantic, keyword, and hybrid search (RRF).

4 · Retriever

Accepts a user question, embeds it, and searches the vector store for the most semantically relevant chunks. Optionally reranks results to improve precision before passing them to the answer generator.

Example: The question "What PPE is required in the assembly area?" retrieves the three most relevant policy chunks from a 200-page safety manual, even if the manual never uses the exact word "PPE".

5 · Answer Generator

Takes the retrieved chunks and the original question and uses an LLM to compose a coherent, factual answer. The response is grounded strictly in the retrieved content, and source citations are included so the answer can be verified.

Example: Given the retrieved policy chunks, the answer generator responds: "According to the Safety Manual (Section 4.2), employees in the assembly area must wear safety glasses, steel-toed boots, and high-visibility vests at all times."

Key features:

Fully modular — use individual blocks or assemble the full pipeline
Metadata-preserving chunking for accurate source attribution
Optional reranking to improve retrieval precision
Citation-aware answer generation for trustworthy, verifiable outputs

Potential applications:

Internal knowledge base and company policy Q&A
Technical documentation and product manual assistants
Regulatory and compliance document lookup
Customer support knowledge retrieval
Contract and legal clause search
Training material and onboarding knowledge assistants
Implementation code: software_components/RAG
Usage examples: examples/RAG

Postgres Agent

The Postgres Agent is a text-to-SQL query agent: connect it to a PostgreSQL database, ask a question in plain language, and get an answer backed by a validated, read-only SQL query. It introspects the schema, asks an LLM for a query, validates it, runs it, and — if the query fails — feeds the error back to the LLM and retries in a lightweight agentic loop. It is read-only by design: only SELECT / WITH ... SELECT queries are allowed, statements are parsed and rejected if they touch other schemas or attempt writes, and the real guarantee comes from connecting with a read-only database role.

Key features:

Natural language → validated, read-only SQL → natural-language answer in one ask() call
Agentic retry loop: SQL errors (including statement timeouts) are fed back to the LLM for correction
Defense in depth: sqlglot validation rejecting non-SELECT, multi-statement, DDL/DML, and cross-schema queries, plus a read-only transaction, statement_timeout, and row cap — backed by a read-only DB role
table_allowlist narrows the agent to specific tables; queries touching anything else are rejected before running
extra_instructions injects a domain glossary or example question→SQL pairs; answer_language localises the answer (en, fi, sv, de, fr, es, no, da)
Multi-provider LLM (OpenAI / Azure / Anthropic / Google); get_schema() and run_sql() work without LLM credentials, so they double as tools for an external agent framework

Potential applications:

Self-service analytics — let non-technical staff query a database in plain language
Conversational dashboards and reporting over operational data
Ad-hoc data exploration without writing SQL by hand
A safe, read-only query tool inside a larger agent or chatbot
Implementation code: software_components/postgres_agent
Usage examples: examples/postgres_agent

Software Components

Reusable low-level components in the implementation layer

See also: No-Code Assets for prompt templates and packaged skills.

Transcriber

Key features:

Whisper-based transcription with high accuracy across languages
Optional GPT enhancement for clean, readable output
Automatic chunking for recordings of any length
Supports MP3, WAV, M4A, OGG, and video formats

Potential applications:

Workplace incident and safety observation reporting
Construction site and field service diaries
Meeting and interview transcription
Medical dictation and clinical notes
Customer call recording analysis
Lecture and training material capture
Implementation code: software_components/transcriber
Usage examples: examples/transcriber

Transcript Enhancer

Key features:

Two-pass workflow: spelling/consistency first, context-based ASR repair second
Optional correction summary (total changes, insertions, deletions, substitutions)
Optional diff output showing exactly which spans changed
additional_instructions parameter to add domain-specific rules (e.g. preserve brand names)
Tuned for Finnish; same structure can be adapted for other languages

Potential applications:

Improving transcription quality before structured extraction
Construction site diary and safety observation reporting from voice
Medical dictation cleanup before clinical data extraction
Meeting and interview transcription polishing
Any domain where ASR errors would propagate into downstream structured records
Implementation code: software_components/enhance_transcript
Usage examples: examples/enhance_transcript

Document Parser

Parser variants:

Parser	When to use
`VisionParser`	Single-provider (OpenAI / Azure) vision parsing — page images → markdown
`MultimodalParser`	Multi-provider PDF → markdown (OpenAI · Anthropic Claude · Google Gemini) with one API
`PyMuPDFParser`	Fast local PDF text extraction, no API calls, lowest cost
`DocxParser`	Local Word document (`.docx` / `.doc`) extraction via `python-docx`
`DoclingParser`	Advanced local parsing with OCR, table extraction, multi-format support
`VisionPlusParser`	Docling + vision hybrid that returns markdown and metadata in one pass
`DoclingApiClientParser`	Remote client for a hosted Docling parsing service

Key features:

Six parsers covering local, vision, hybrid, and remote-service strategies
Preserves tables, headings, and document structure in markdown output
Multi-page processing with consistent formatting
Suitable for both simple forms and complex, visually rich documents

Potential applications:

Invoice, receipt, and purchase order digitization
Contract and legal document preprocessing
Technical manual and product specification extraction
Research paper and report indexing
Compliance document analysis
HR form and CV processing
Implementation code: software_components/parsers
Usage examples: examples/parsers

Extractor

Key features:

Plain-language requirements replace manual schema definition
Type-safe extraction with full Pydantic validation
Supports constrained fields, allowed values, and conditional rules
Schema persistence for efficient batch processing

Potential applications:

Incident and safety report field extraction from voice or text
Invoice and purchase order data capture
Contract clause and obligation extraction
Survey and form response structuring
Quality inspection data recording
HR and recruitment data extraction from CVs
Implementation code: software_components/extractor
Usage examples: examples/extractor

Vision Extractor

Key features:

Single-pass PDF/image → structured data — no parse-then-extract round-trip
Multi-document calls: align line items in a PO with quantities in a BOM in one request
Multi-provider: OpenAI / Azure, Anthropic Claude (Foundry), Google Gemini (Vertex AI or direct)
Reuses generated Pydantic schemas across runs via schema_dir persistence
Optional per-field verification: confidence score + reasoning for every scalar field

Potential applications:

Purchase order processing with bill-of-materials cross-referencing
Invoice and receipt extraction from scanned documents
Form digitisation (insurance claims, medical intake, government applications)
Layout-sensitive tables (lab reports, financial statements, technical drawings)
Human-in-the-loop QA pipelines where each value must be justified
Implementation code: software_components/vision_extractor
Usage examples: examples/vision_extractor

Document Classifier

Key features:

Define any set of classes in plain language — no training required
Returns confidence scores for review and threshold-based routing
Works as a standalone step or as a gate before extraction

Potential applications:

Incoming document triage and routing in finance or legal workflows
Email and attachment categorization
Insurance claim and case type detection
Archive tagging and document library organization
Regulatory filing type identification
Preprocessing step for multi-schema extraction pipelines
Implementation code: software_components/doc_classifier
Usage examples: examples/classifier

Validator (LLM-as-Judge)

Key features:

Same code talks to OpenAI / Azure OpenAI / Anthropic Foundry / Google Vertex via the toolkit's existing config helpers
Returns per-field flags (ok, suspect, wrong) plus a document-level slot for structural observations like "items missing from the extractor output"
Domain-agnostic by default; pass an optional ValidationRubric with vendor-specific check sentences when you want targeted guidance
Cost tracking with longest-prefix model lookup so rotated model ids still resolve to a known rate
Text-vs-text mode (judge.judge_text_pair(extracted, expected, field_name=...)) compares two short strings for semantic equivalence without any source document — useful for scoring audio-transcription extractors against hand-annotated ground truth where exact-string match misses paraphrasing
Schema-agnostic hallucination detector (judge.detect_hallucinations(source_text, extracted)) inspects an extractor's JSON as a whole against a source document and returns one HallucinationFlag per problem field — drop-in replacement for hand-written keyword post-validators that only cover a single schema

Potential applications:

Purchase-order extraction QA before sending the order downstream
Invoice / receipt extraction sanity check
Form-extraction review where missing or hallucinated fields are expensive
Two-tier setups: cheap pre-screen + premium judge on flagged rows

Implementation code: software_components/validators/llm_judge
Usage examples: examples/validators

Text-to-Speech

Key features:

Supports OpenAI and Azure OpenAI TTS backends with the same interface
Configurable voice, language (Finnish and English), speed, and audio format
Returns a SpeechSynthesisResult with audio bytes, model, and metadata
Simple result.save("output_dir") for persisting audio to disk

Potential applications:

Audio delivery of extracted report summaries for field workers
Accessibility features for reading back structured content aloud
Voice notifications in automated workflow pipelines
Generating narration for training or instructional content
Implementation code: software_components/text_to_speech
Usage examples: examples/text_to_speech

RAG Components

Pipeline overview:

1 · RAG Parser

Example: A company safety manual (150 pages) is parsed into ~600 chunks. Each chunk carries its page number and section title so answers can be traced back to the source.

2 · Embedder

Converts each text chunk into a dense numerical vector using an embedding model. Semantically similar chunks produce similar vectors, enabling meaning-based search rather than keyword matching.

Example: The chunk "All incidents must be reported within 24 hours of occurrence" is embedded as a vector. A question like "What is the deadline for incident reporting?" produces a similar vector — enabling a match even though no keywords are shared.

3 · Vector Store

Example: A product documentation knowledge base is indexed once and reused across hundreds of daily queries without reprocessing the source documents.

Two backends are available: in-memory / ChromaDB for quick prototyping, and PostgreSQL + pgvector (PgVectorStore) for production use with semantic, keyword, and hybrid search (RRF).

4 · Retriever

Example: The question "What PPE is required in the assembly area?" retrieves the three most relevant policy chunks from a 200-page safety manual, even if the manual never uses the exact word "PPE".

5 · Answer Generator

Example: Given the retrieved policy chunks, the answer generator responds: "According to the Safety Manual (Section 4.2), employees in the assembly area must wear safety glasses, steel-toed boots, and high-visibility vests at all times."

Key features:

Fully modular — use individual blocks or assemble the full pipeline
Metadata-preserving chunking for accurate source attribution
Optional reranking to improve retrieval precision
Citation-aware answer generation for trustworthy, verifiable outputs

Potential applications:

Internal knowledge base and company policy Q&A
Technical documentation and product manual assistants
Regulatory and compliance document lookup
Customer support knowledge retrieval
Contract and legal clause search
Training material and onboarding knowledge assistants
Implementation code: software_components/RAG
Usage examples: examples/RAG

Postgres Agent

Key features:

Natural language → validated, read-only SQL → natural-language answer in one ask() call
Agentic retry loop: SQL errors (including statement timeouts) are fed back to the LLM for correction
Defense in depth: sqlglot validation rejecting non-SELECT, multi-statement, DDL/DML, and cross-schema queries, plus a read-only transaction, statement_timeout, and row cap — backed by a read-only DB role
table_allowlist narrows the agent to specific tables; queries touching anything else are rejected before running
extra_instructions injects a domain glossary or example question→SQL pairs; answer_language localises the answer (en, fi, sv, de, fr, es, no, da)
Multi-provider LLM (OpenAI / Azure / Anthropic / Google); get_schema() and run_sql() work without LLM credentials, so they double as tools for an external agent framework

Potential applications:

Self-service analytics — let non-technical staff query a database in plain language
Conversational dashboards and reporting over operational data
Ad-hoc data exploration without writing SQL by hand
A safe, read-only query tool inside a larger agent or chatbot
Implementation code: software_components/postgres_agent
Usage examples: examples/postgres_agent

Transcriber

Transcript Enhancer

Document Parser

Extractor

Vision Extractor

Document Classifier

Validator (LLM-as-Judge)

Text-to-Speech

RAG Components

1 · RAG Parser

2 · Embedder

3 · Vector Store

4 · Retriever

5 · Answer Generator

Postgres Agent

On this page

Software Components

Transcriber

Transcript Enhancer

Document Parser

Extractor

Vision Extractor

Document Classifier

Validator (LLM-as-Judge)

Text-to-Speech

RAG Components

1 · RAG Parser

2 · Embedder

3 · Vector Store

4 · Retriever

5 · Answer Generator

Postgres Agent

On this page