What This Methodology Is For
Our AI document analysis turns raw tender PDFs and documents into structured, readable intelligence. Instead of manually reading dozens of pages, you get summarised sections, extracted requirements, and confidence-rated analysis — with the original source always accessible.
The Analysis Pipeline
A four-stage pipeline transforms government tender documents into structured intelligence.
1. Document Ingestion & Format Detection
Documents are fetched from government portals or uploaded directly. Format detection uses magic bytes, file extension, and MIME type analysis. Supported: PDF, DOCX, XLSX, PPTX, ODT, RTF, CSV, TXT, and more.
2. Text Extraction (Deterministic)
PyMuPDF extracts text from PDFs with up to 30-page and 50KB limits. All other formats are converted to plain text. No AI is used at this stage — extraction is purely algorithmic and reproducible.
3. Section Classification (3-Layer)
Extracted text is classified into 15+ section types. Layer 1: regex header matching. Layer 2: keyword-score fallback. Layer 3: LLM re-classification for remaining unclassified content. Each layer produces a confidence score.
4. AI Rewrite (3-Pass LLM)
Classified sections undergo three LLM passes: section rewrite for readability, tender-level summary generation, and SEO metadata creation. A quality gate filters low-confidence output before publishing.
Our Technology Stack
PyMuPDF (fitz)
Primary PDF text extraction — deterministic, no AI dependency
Multi-Format Support
python-docx, openpyxl, python-pptx, striprtf, odfpy — 10+ formats
Regex Classification Engine
15+ pattern groups with section header matching and sub-criterion extraction
Keyword-Score Classifier
Section keyword dictionary with weighted scoring for fallback classification
Google Gemini 2.5 Flash Lite
Primary LLM for professional rewrite, re-classification, and SEO generation
Multi-Provider Fallback
NVIDIA, Bedrock, OpenRouter, Groq — automatic failover if primary is unavailable
Redis Queue
Two queues manage document analysis and enhancement/rewrite jobs with retry logic
PostgreSQL + Prisma
Analysis results, confidence scores, and AI processing logs stored relationally
Section Classification System
Documents are classified into 15+ section types using a three-layer approach.
The primary classifier. 15+ regex pattern groups match section headers (e.g. 'EVALUATION CRITERIA', 'B-BBEE REQUIREMENTS'). Extracts text under each header until the next section or end-of-document marker. This is deterministic and always reproducible.
Highest accuracy when headers are presentWhen no regex header match is found, a keyword dictionary scores content against known section types. Each section type has weighted keywords — the highest-scoring match wins. This handles documents with non-standard formatting.
Good accuracy for documents with standard content but non-standard headersContent that remains unclassified after layers 1 and 2 is sent to Gemini 2.5 Flash Lite for re-classification. The LLM receives raw text and returns classification into the extended taxonomy. This is a tertiary fallback, not the primary path.
Lower confidence — flagged in analysis output when usedAI Rewrite Process (3-Pass)
After classification, a three-pass LLM workflow produces the final analysis.
Quality Assurance & Confidence Scoring
Every AI rewrite pass includes a quality gate that checks the LLM output against confidence and quality thresholds. Low-confidence output is rejected and logged for review. Our team monitors these logs to continuously improve the system.
Confidence scores (0.0–1.0) are published alongside every analysis, so you can see how reliably the system classified each document. Scores above 0.7 indicate strong pattern matches; lower scores suggest the document used non-standard formatting.
✓ Text extraction is fully deterministic (PyMuPDF, no AI)
✓ Regex classification is reproducible given the same document
✓ Keyword scoring uses a fixed dictionary — consistent results
✓ Original source documents always available alongside analysis
✗ LLM rewrites may occasionally omit context — always verify against source
✗ Confidence scores reflect pattern match strength, not factual accuracy
✗ Heavily scanned PDFs without text layers reduce extraction quality
✗ Very large documents (50KB+ text) are truncated before processing
Human Review Disclaimer
Tenders SA document analysis is a decision-support tool. While we strive for accuracy, AI-generated analysis does not replace reading the original tender documents.
Users should always verify critical information — including deadlines, evaluation criteria, and compliance requirements — against the official tender documents. The issuing organisation's documents are the sole authoritative source.
Our team monitors analysis quality and confidence logs to identify and address issues. If you find an analysis that appears incorrect, please report it so we can investigate.
Data Privacy Note
Tender documents are publicly available government procurement records. Our analysis processes these public documents to make them more accessible.
We do not store or process private user data as part of document analysis. AI prompts sent to LLM providers contain only the public document text and section classification instructions.
See our Privacy Policy for full details on how we handle your data.
Limitations
Frequently Asked Questions
Our pipeline supports PDF (primary), DOCX, XLSX, XLS, PPTX, ODT, RTF, DOC, CSV, TXT, and plain text. All documents are extracted to plain text before AI analysis.
Classification uses three layers: regex header matching against 15+ section patterns (highest accuracy), keyword-score fallback for unrecognised headers, and LLM re-classification as a tertiary fallback for content that remains unclassified.
The system can identify up to 15 section types including description, requirements, B-BBEE, evaluation criteria, returnable documents, special conditions, contact information, briefing sessions, contractual terms, quality management, health and safety, environmental, methodology, experience and qualifications, and pricing schedules.
No. The initial extraction uses deterministic regex and keyword algorithms. AI (LLM) is used only for professional rewriting of extracted sections, tender summary generation, and SEO metadata. The underlying facts come from the source documents.
Confidence scores (0.0–1.0) reflect how well the extraction algorithms matched known patterns. Higher confidence means the system found clear section headers and structured data. Low confidence may indicate a document format the system handles less reliably. We publish confidence alongside every analysis.
Yes. Every analysis section includes source document provenance, and the original document is always available for download alongside the AI-enhanced view.