Skip to main content
Technology

We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents

Meet Tender Extract, our open-source Python microservice that uses deterministic regex-based extraction to parse South African government tender PDFs. And learn how the extracted data flows into our AI enrichment pipeline.

Why We Built and Open Sourced a Tender PDF Extractor

Every day, South African government departments publish hundreds of tender documents as PDFs. These PDFs are the raw source of truth for procurement opportunities worth billions of rand. But there's a problem: these documents are unstructured, inconsistently formatted, and often run 50 to 200 pages. Extracting the information you actually need — closing dates, B-BBEE requirements, CIDB gradings, evaluation criteria, returnable documents — is a slow, error-prone manual process.

Today we're announcing the open source release of Tender Extract (github.com/Tenders-SA/tender-extract

), the Python microservice that powers the first stage of document processing on Tenders-SA.org. We're releasing it under an MIT license so that other developers, researchers, and procurement professionals can use, inspect, and contribute to the extraction engine.

The Problem with Government Tender PDFs

South African government tender PDFs come from dozens of different issuing authorities — national departments, provincial treasuries, municipalities, SOEs like Eskom and Transnet, and public entities. Each has its own document template, formatting conventions, and section naming. A "Scope of Work" in one document might be called "Description" in another, or "Terms of Reference" in a third. Key information like the tender number might appear in a header, a footer, or a dedicated field on page one.

The standard approach to handling this diversity is to throw an LLM at it — feed the entire PDF to an AI model and ask it to extract the fields you need. But that approach has downsides: it's expensive, slow, and inconsistent. A 200-page tender document can cost several cents per analysis, take 30+ seconds, and produce different results on different runs. For a platform processing thousands of tenders daily, that cost and latency adds up fast.

Introducing Tender Extract: Deterministic, Fast, Open Source

Tender Extract is a FastAPI microservice that uses deterministic regex-based pattern matching (not AI) to extract structured data from South African government tender PDFs. It's designed to be fast, cheap, and reliable — running in under 2 seconds per document with zero API costs. It runs as a standalone HTTP service and can be deployed anywhere Docker runs.

The Algorithm: How Extraction Works

The extraction engine, implemented in app/extractor.py, follows a structured pipeline:

  1. Text Extraction — The PDF is opened with PyMuPDF (fitz). Each page is processed sequentially up to a configurable limit (default 30 pages). Text is extracted per page with positional sorting. If a page has no text but contains images, it's flagged as scanned (the service returns a 501 for pure scanned PDFs — it does not attempt OCR).
  2. Text Normalization — The raw text is normalized: Unicode NFKC normalization, soft hyphen removal, line-break hyphenation repair (e.g. "procure-\nment" becomes "procurement"), footer/page number removal, and whitespace normalization. The result is clean, predictable text for pattern matching.
  3. Section Detection — The normalized text is scanned with multiple regex patterns designed specifically for South African tender terminology. The engine identifies sections like: Description/Scope of Work, Requirements/Eligibility, B-BBEE/Preferential Procurement, Evaluation Criteria, Returnable Documents, Special Conditions, Contact Information, and Briefing Sessions.
  4. Field Extraction — Within and outside those sections, scalar values are extracted: tender number (using multiple pattern strategies), title, closing date, closing time, publication date, issuing organization, estimated value, CIDB grading, validity period, and contract period.
  5. Confidence Scoring — Each field contributes to a weighted confidence score (0.0–1.0). A complete extraction with title, organization, description, requirements, tender number, and closing date scores above 0.85. Low-confidence results signal the main application to run AI enrichment.

The key design decision is that this microservice does no AI processing. It's pure Python and PyMuPDF. When the confidence is low (below 0.55) or critical fields are missing, the main application decides whether to fall through to our AI analysis pipeline. This keeps the extraction layer extremely cheap — we can process hundreds of documents per minute on a single small container.

Quick Start

1# Clone the repo
2git clone https://github.com/Tenders-SA/tender-extract.git
3cd tender-extract
4
5# Install dependencies
6pip install -r requirements.txt
7
8# Start the service
9uvicorn app.main:app --reload
10
11# Extract from a PDF file
12curl -X POST http://localhost:8000/v1/extract -F "[email protected]"
13
14# Extract from a URL
15curl -X POST http://localhost:8000/v1/extract \
16// ... (truncated)
BASH

Or run it with Docker:

1docker build -t tender-extract .
2docker run -p 8080:8080 tender-extract
BASH

The Two-Phase Pipeline: Extraction Then AI Enrichment

Tender Extract is stage one of a two-stage document processing pipeline on Tenders-SA.org. Here's how the full pipeline works:

Stage 1: Deterministic Extraction (Tender Extract)

When a tender is ingested from an OCDS feed, our Cloudflare Worker fetches the PDF documents and stores them in Cloudflare R2 object storage. The Tender Extraction Integration Service (src/lib/services/tender-extraction-integration.service.ts) then sends each PDF URL to the Tender Extract API. The service returns a structured JSON response with up to 20+ fields.

This data is stored immediately in the database and made available on the platform. For many well-structured tenders, this is sufficient — the regex extraction captures everything a business needs to make a go/no-go decision.

Stage 2: AI Enrichment (Gemini/Groq)

When the extraction confidence is low, or when critical fields are missing, or when a tender is prioritized for deep analysis, the system triggers the AI enrichment pipeline. The Gemini Analysis Service (src/lib/services/gemini-analysis.service.ts) takes the raw extracted text (passed through as full_text in the extraction response) and uses Google Gemini (with Groq as fallback) to:

  • Generate a coherent plain-English summary of the tender scope
  • Extract mandatory requirements and compliance criteria
  • Categorize the tender by industry and procurement type
  • Estimate the tender value range based on document content
  • Calculate a confidence score for each AI-extracted field

The two-stage approach gives us the best of both worlds: speed and economy for well-structured documents (stage 1), and deep understanding for complex or poorly structured documents (stage 2). In production, roughly 30% of documents pass through both stages.

StageTechnologySpeedCostWhen Used
1. ExtractionPyMuPDF + Regex< 2 seconds~$0.0001Every document
2. AI EnrichmentGemini / Groq5–30 seconds~$0.01–$0.10Low confidence or priority

The raw extraction results are always preserved in the database alongside the AI-enriched fields. This means the system can re-run AI enrichment at any time without losing the original extraction, and users can see both the deterministic extraction confidence and the AI enrichment confidence.

What the Extractor Finds

Here are the fields the extractor can pull from a typical SA government tender PDF, along with examples of the SA-specific patterns it recognizes:

FieldExampleSA-Specific Pattern
Tender NumberSCM/2026/001-KZNHandles reference numbers, RFQ/RFP/BID prefixes, OCDS identifiers
Issuing OrganizationeThekwini Metropolitan MunicipalityRecognizes SA municipalities, departments, SOEs, and public entities
CIDB Grading7CE or higherCIDB category + grade pattern with 'or higher' variants
B-BBEE LevelLevel 1B-BBEE contributor levels, preferential procurement thresholds
Preference System80/20 or 90/10PPPFA preference point systems specific to SA procurement law
Returnable DocumentsSBD 1, SBD 4, MBD 6.1Standard Bidding Document (SBD) and Municipal Bidding Document (MBD) form numbers
Briefing SessionCompulsory site visitDetects compulsory vs optional briefings, venue, date, time
Evaluation Criteria80/20 preference point system, functionality threshold 70%SA procurement evaluation with stage 1/stage 2, functionality scoring

Why We Open Sourced This

South Africa's public procurement system affects every business in the country, but the technical infrastructure around it remains fragmented. By open sourcing Tender Extract, we're aiming to:

  • Enable community contributions — SA procurement language evolves. New tender formats, new SOE templates, and new regulations mean the regex patterns need continuous maintenance. Community contributions help keep the extractor accurate.
  • Provide a reference implementation — Other platforms, researchers, and government agencies can use this as a starting point for their own document processing pipelines.
  • Improve transparency — Anyone can inspect exactly how extraction works, what patterns are used, and how confidence is calculated.
  • Reduce duplication — Every SA procurement tech platform solves this same problem independently. A shared, open-source foundation benefits the entire ecosystem.

Get Started

The Tender Extract repository is live on GitHub: github.com/Tenders-SA/tender-extract

. It includes the full source code, Dockerfile, deployment configurations for Cloud Run and Render.com, API documentation, and integration guides.

If you're building on top of SA procurement data, you might also find our other open-source SDKs useful:

We welcome issues, feature requests, and pull requests. If you're working with SA government tender data and want to contribute patterns for a procurement category we haven't covered, open an issue or submit a PR.

"

Tags

Open SourcePDF ExtractionDocument AnalysisAI PipelinePythonGitHubDeveloper ToolsSA Procurement
AI-Powered Matching
Never Miss a Perfect Tender Again
Our AI analyzes thousands of tenders and finds the ones YOUR company can actually win
AI Match Scoring for every tender
Instant alerts for 85%+ matches
B-BBEE level optimization
Document readiness checks

Share this article

We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents

Meet Tender Extract, our open-source Python microservice that uses deterministic regex-based extraction to parse South African government tender PDFs. And learn how the extracted data flows into our AI enrichment pipeline.

https://www.tenders-sa.org/blog/open-source-tender-pdf-extractor