Skip to main content
Methodology

How AI Document Analysis WorksTransparent Approach

Understanding how our AI system extracts, classifies, and rewrites South African tender documents into actionable intelligence.

Purpose

What This Methodology Is For

Our AI document analysis turns raw tender PDFs and documents into structured, readable intelligence. Instead of manually reading dozens of pages, you get summarised sections, extracted requirements, and confidence-rated analysis — with the original source always accessible.

Pipeline

The Analysis Pipeline

A four-stage pipeline transforms government tender documents into structured intelligence.

1. Document Ingestion & Format Detection

Documents are fetched from government portals or uploaded directly. Format detection uses magic bytes, file extension, and MIME type analysis. Supported: PDF, DOCX, XLSX, PPTX, ODT, RTF, CSV, TXT, and more.

2. Text Extraction (Deterministic)

PyMuPDF extracts text from PDFs with up to 30-page and 50KB limits. All other formats are converted to plain text. No AI is used at this stage — extraction is purely algorithmic and reproducible.

3. Section Classification (3-Layer)

Extracted text is classified into 15+ section types. Layer 1: regex header matching. Layer 2: keyword-score fallback. Layer 3: LLM re-classification for remaining unclassified content. Each layer produces a confidence score.

4. AI Rewrite (3-Pass LLM)

Classified sections undergo three LLM passes: section rewrite for readability, tender-level summary generation, and SEO metadata creation. A quality gate filters low-confidence output before publishing.

Technology

Our Technology Stack

PyMuPDF (fitz)

Primary PDF text extraction — deterministic, no AI dependency

Multi-Format Support

python-docx, openpyxl, python-pptx, striprtf, odfpy — 10+ formats

Regex Classification Engine

15+ pattern groups with section header matching and sub-criterion extraction

Keyword-Score Classifier

Section keyword dictionary with weighted scoring for fallback classification

Google Gemini 2.5 Flash Lite

Primary LLM for professional rewrite, re-classification, and SEO generation

Multi-Provider Fallback

NVIDIA, Bedrock, OpenRouter, Groq — automatic failover if primary is unavailable

Redis Queue

Two queues manage document analysis and enhancement/rewrite jobs with retry logic

PostgreSQL + Prisma

Analysis results, confidence scores, and AI processing logs stored relationally

Classification

Section Classification System

Documents are classified into 15+ section types using a three-layer approach.

Description / Scope of Work
Requirements & Eligibility
B-BBEE Certification
Evaluation Criteria
Returnable Documents
Special Conditions
Contact Information
Briefing Sessions
Contractual Terms
Quality Management
Health & Safety
Environmental
Methodology
Experience & Qualifications
Pricing Schedule
Layer 1 — Regex Header Matching

The primary classifier. 15+ regex pattern groups match section headers (e.g. 'EVALUATION CRITERIA', 'B-BBEE REQUIREMENTS'). Extracts text under each header until the next section or end-of-document marker. This is deterministic and always reproducible.

Highest accuracy when headers are present
Layer 2 — Keyword-Score Fallback

When no regex header match is found, a keyword dictionary scores content against known section types. Each section type has weighted keywords — the highest-scoring match wins. This handles documents with non-standard formatting.

Good accuracy for documents with standard content but non-standard headers
Layer 3 — LLM Re-Classification

Content that remains unclassified after layers 1 and 2 is sent to Gemini 2.5 Flash Lite for re-classification. The LLM receives raw text and returns classification into the extended taxonomy. This is a tertiary fallback, not the primary path.

Lower confidence — flagged in analysis output when used
AI Processing

AI Rewrite Process (3-Pass)

After classification, a three-pass LLM workflow produces the final analysis.

Pass 1
Section Rewrite
Classified sections are rewritten into 7 core analysis fields: submission guidelines, evaluation criteria, important dates, contact information, technical specifications, financial requirements, and compliance requirements.
7 structured analysis fields
Pass 2
Tender Updates
Using the rewrites sections, the LLM generates a tender-level AI summary, key requirements list, and enriched title. This provides an at-a-glance understanding of the tender.
AI summary, key requirements, enriched title
Pass 3
SEO Generation
The final pass generates SEO metadata: title, description, keywords, and Open Graph / Twitter card fields. This makes tender pages discoverable through search engines.
6 SEO metadata fields
Quality

Quality Assurance & Confidence Scoring

Quality Gate

Every AI rewrite pass includes a quality gate that checks the LLM output against confidence and quality thresholds. Low-confidence output is rejected and logged for review. Our team monitors these logs to continuously improve the system.

Confidence scores (0.0–1.0) are published alongside every analysis, so you can see how reliably the system classified each document. Scores above 0.7 indicate strong pattern matches; lower scores suggest the document used non-standard formatting.

Deterministic Guarantees

✓ Text extraction is fully deterministic (PyMuPDF, no AI)

✓ Regex classification is reproducible given the same document

✓ Keyword scoring uses a fixed dictionary — consistent results

✓ Original source documents always available alongside analysis

AI Limitations

✗ LLM rewrites may occasionally omit context — always verify against source

✗ Confidence scores reflect pattern match strength, not factual accuracy

✗ Heavily scanned PDFs without text layers reduce extraction quality

✗ Very large documents (50KB+ text) are truncated before processing

Quality Assurance

Human Review Disclaimer

Tenders SA document analysis is a decision-support tool. While we strive for accuracy, AI-generated analysis does not replace reading the original tender documents.

Users should always verify critical information — including deadlines, evaluation criteria, and compliance requirements — against the official tender documents. The issuing organisation's documents are the sole authoritative source.

Our team monitors analysis quality and confidence logs to identify and address issues. If you find an analysis that appears incorrect, please report it so we can investigate.

Privacy

Data Privacy Note

Tender documents are publicly available government procurement records. Our analysis processes these public documents to make them more accessible.

We do not store or process private user data as part of document analysis. AI prompts sent to LLM providers contain only the public document text and section classification instructions.

See our Privacy Policy for full details on how we handle your data.

Important

Limitations

AI analysis may miss context or nuance present in the original document — always verify
Scanned PDFs without embedded text layers produce lower-quality extraction
Documents over 50KB of extracted text are truncated; very large documents may be incomplete
Section classification accuracy depends on document formatting; non-standard layouts may score lower
The system cannot guarantee that all relevant sections are identified or correctly classified
Award decisions, eligibility determinations, and compliance requirements must be verified with the issuing organisation
Language detection is limited; documents in languages other than English may produce unreliable analysis
Questions

Frequently Asked Questions

What document formats are supported?

Our pipeline supports PDF (primary), DOCX, XLSX, XLS, PPTX, ODT, RTF, DOC, CSV, TXT, and plain text. All documents are extracted to plain text before AI analysis.

How are document sections classified?

Classification uses three layers: regex header matching against 15+ section patterns (highest accuracy), keyword-score fallback for unrecognised headers, and LLM re-classification as a tertiary fallback for content that remains unclassified.

What sections can the system identify?

The system can identify up to 15 section types including description, requirements, B-BBEE, evaluation criteria, returnable documents, special conditions, contact information, briefing sessions, contractual terms, quality management, health and safety, environmental, methodology, experience and qualifications, and pricing schedules.

Is the analysis purely AI-generated?

No. The initial extraction uses deterministic regex and keyword algorithms. AI (LLM) is used only for professional rewriting of extracted sections, tender summary generation, and SEO metadata. The underlying facts come from the source documents.

How accurate are the confidence scores?

Confidence scores (0.0–1.0) reflect how well the extraction algorithms matched known patterns. Higher confidence means the system found clear section headers and structured data. Low confidence may indicate a document format the system handles less reliably. We publish confidence alongside every analysis.

Can I see the original document alongside the analysis?

Yes. Every analysis section includes source document provenance, and the original document is always available for download alongside the AI-enhanced view.

Ready to Explore Tender Analysis?

Browse tenders with AI-enhanced document analysis and make faster, more informed decisions.