We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents
Meet Tender Extract, our open-source Python microservice that uses deterministic regex-based extraction to parse South African government tender PDFs. And learn how the extracted data flows into our AI enrichment pipeline.
Why We Built and Open Sourced a Tender PDF Extractor
Every day, South African government departments publish hundreds of tender documents as PDFs. These PDFs are the raw source of truth for procurement opportunities worth billions of rand. But there's a problem: these documents are unstructured, inconsistently formatted, and often run 50 to 200 pages. Extracting the information you actually need — closing dates, B-BBEE requirements, CIDB gradings, evaluation criteria, returnable documents — is a slow, error-prone manual process.
Today we're announcing the open source release of Tender Extract (github.com/Tenders-SA/tender-extract), the Python microservice that powers the first stage of document processing on Tenders-SA.org. We're releasing it under an MIT license so that other developers, researchers, and procurement professionals can use, inspect, and contribute to the extraction engine.
The Problem with Government Tender PDFs
South African government tender PDFs come from dozens of different issuing authorities — national departments, provincial treasuries, municipalities, SOEs like Eskom and Transnet, and public entities. Each has its own document template, formatting conventions, and section naming. A "Scope of Work" in one document might be called "Description" in another, or "Terms of Reference" in a third. Key information like the tender number might appear in a header, a footer, or a dedicated field on page one.
The standard approach to handling this diversity is to throw an LLM at it — feed the entire PDF to an AI model and ask it to extract the fields you need. But that approach has downsides: it's expensive, slow, and inconsistent. A 200-page tender document can cost several cents per analysis, take 30+ seconds, and produce different results on different runs. For a platform processing thousands of tenders daily, that cost and latency adds up fast.
Introducing Tender Extract: Deterministic, Fast, Open Source
Tender Extract is a FastAPI microservice that uses deterministic regex-based pattern matching (not AI) to extract structured data from South African government tender PDFs. It's designed to be fast, cheap, and reliable — running in under 2 seconds per document with zero API costs. It runs as a standalone HTTP service and can be deployed anywhere Docker runs.
The Algorithm: How Extraction Works
The extraction engine, implemented in app/extractor.py, follows a structured pipeline:
- Text Extraction — The PDF is opened with PyMuPDF (fitz). Each page is processed sequentially up to a configurable limit (default 30 pages). Text is extracted per page with positional sorting. If a page has no text but contains images, it's flagged as scanned (the service returns a 501 for pure scanned PDFs — it does not attempt OCR).
- Text Normalization — The raw text is normalized: Unicode NFKC normalization, soft hyphen removal, line-break hyphenation repair (e.g. "procure-\nment" becomes "procurement"), footer/page number removal, and whitespace normalization. The result is clean, predictable text for pattern matching.
- Section Detection — The normalized text is scanned with multiple regex patterns designed specifically for South African tender terminology. The engine identifies sections like: Description/Scope of Work, Requirements/Eligibility, B-BBEE/Preferential Procurement, Evaluation Criteria, Returnable Documents, Special Conditions, Contact Information, and Briefing Sessions.
- Field Extraction — Within and outside those sections, scalar values are extracted: tender number (using multiple pattern strategies), title, closing date, closing time, publication date, issuing organization, estimated value, CIDB grading, validity period, and contract period.
- Confidence Scoring — Each field contributes to a weighted confidence score (0.0–1.0). A complete extraction with title, organization, description, requirements, tender number, and closing date scores above 0.85. Low-confidence results signal the main application to run AI enrichment.
The key design decision is that this microservice does no AI processing. It's pure Python and PyMuPDF. When the confidence is low (below 0.55) or critical fields are missing, the main application decides whether to fall through to our AI analysis pipeline. This keeps the extraction layer extremely cheap — we can process hundreds of documents per minute on a single small container.
Quick Start
1# Clone the repo 2git clone https://github.com/Tenders-SA/tender-extract.git 3cd tender-extract 4 5# Install dependencies 6pip install -r requirements.txt 7 8# Start the service 9uvicorn app.main:app --reload 10 11# Extract from a PDF file 12curl -X POST http://localhost:8000/v1/extract -F "[email protected]" 13 14# Extract from a URL 15curl -X POST http://localhost:8000/v1/extract \ 16// ... (truncated)BASH
Or run it with Docker:
1docker build -t tender-extract . 2docker run -p 8080:8080 tender-extractBASH
The Two-Phase Pipeline: Extraction Then AI Enrichment
Tender Extract is stage one of a two-stage document processing pipeline on Tenders-SA.org. Here's how the full pipeline works:
Stage 1: Deterministic Extraction (Tender Extract)
When a tender is ingested from an OCDS feed, our Cloudflare Worker fetches the PDF documents and stores them in Cloudflare R2 object storage. The Tender Extraction Integration Service (src/lib/services/tender-extraction-integration.service.ts) then sends each PDF URL to the Tender Extract API. The service returns a structured JSON response with up to 20+ fields.
This data is stored immediately in the database and made available on the platform. For many well-structured tenders, this is sufficient — the regex extraction captures everything a business needs to make a go/no-go decision.
Stage 2: AI Enrichment (Gemini/Groq)
When the extraction confidence is low, or when critical fields are missing, or when a tender is prioritized for deep analysis, the system triggers the AI enrichment pipeline. The Gemini Analysis Service (src/lib/services/gemini-analysis.service.ts) takes the raw extracted text (passed through as full_text in the extraction response) and uses Google Gemini (with Groq as fallback) to:
- Generate a coherent plain-English summary of the tender scope
- Extract mandatory requirements and compliance criteria
- Categorize the tender by industry and procurement type
- Estimate the tender value range based on document content
- Calculate a confidence score for each AI-extracted field
The two-stage approach gives us the best of both worlds: speed and economy for well-structured documents (stage 1), and deep understanding for complex or poorly structured documents (stage 2). In production, roughly 30% of documents pass through both stages.
| Stage | Technology | Speed | Cost | When Used |
|---|---|---|---|---|
| 1. Extraction | PyMuPDF + Regex | < 2 seconds | ~$0.0001 | Every document |
| 2. AI Enrichment | Gemini / Groq | 5–30 seconds | ~$0.01–$0.10 | Low confidence or priority |
The raw extraction results are always preserved in the database alongside the AI-enriched fields. This means the system can re-run AI enrichment at any time without losing the original extraction, and users can see both the deterministic extraction confidence and the AI enrichment confidence.
What the Extractor Finds
Here are the fields the extractor can pull from a typical SA government tender PDF, along with examples of the SA-specific patterns it recognizes:
| Field | Example | SA-Specific Pattern |
|---|---|---|
| Tender Number | SCM/2026/001-KZN | Handles reference numbers, RFQ/RFP/BID prefixes, OCDS identifiers |
| Issuing Organization | eThekwini Metropolitan Municipality | Recognizes SA municipalities, departments, SOEs, and public entities |
| CIDB Grading | 7CE or higher | CIDB category + grade pattern with 'or higher' variants |
| B-BBEE Level | Level 1 | B-BBEE contributor levels, preferential procurement thresholds |
| Preference System | 80/20 or 90/10 | PPPFA preference point systems specific to SA procurement law |
| Returnable Documents | SBD 1, SBD 4, MBD 6.1 | Standard Bidding Document (SBD) and Municipal Bidding Document (MBD) form numbers |
| Briefing Session | Compulsory site visit | Detects compulsory vs optional briefings, venue, date, time |
| Evaluation Criteria | 80/20 preference point system, functionality threshold 70% | SA procurement evaluation with stage 1/stage 2, functionality scoring |
Why We Open Sourced This
South Africa's public procurement system affects every business in the country, but the technical infrastructure around it remains fragmented. By open sourcing Tender Extract, we're aiming to:
- Enable community contributions — SA procurement language evolves. New tender formats, new SOE templates, and new regulations mean the regex patterns need continuous maintenance. Community contributions help keep the extractor accurate.
- Provide a reference implementation — Other platforms, researchers, and government agencies can use this as a starting point for their own document processing pipelines.
- Improve transparency — Anyone can inspect exactly how extraction works, what patterns are used, and how confidence is calculated.
- Reduce duplication — Every SA procurement tech platform solves this same problem independently. A shared, open-source foundation benefits the entire ecosystem.
Get Started
The Tender Extract repository is live on GitHub: github.com/Tenders-SA/tender-extract. It includes the full source code, Dockerfile, deployment configurations for Cloud Run and Render.com, API documentation, and integration guides.
If you're building on top of SA procurement data, you might also find our other open-source SDKs useful:
- JavaScript SDK — github.com/Tenders-SA/js (
@tenders-sa-org/sdk-json npm) - Python SDK — github.com/Tenders-SA/python (
tendersa-sdkon PyPI) - CLI Tool — github.com/Tenders-SA/cli (
@tenders-sa-org/clion npm)
We welcome issues, feature requests, and pull requests. If you're working with SA government tender data and want to contribute patterns for a procurement category we haven't covered, open an issue or submit a PR.
"
Tags
Based on this article's topics, here are some current tenders that might interest you
Request for Proposals (RFP) The Provision of Underwater Measurement Capabilities and Technical Expertise Related to the Development and Testing of Maritime Technology and Underwater Sensors with the CSIR for a Period of 5 Years.
APPOINTMENT OF A SERVICE PROVIDER TO RENDER THE INFORMATION TECHNOLOGY MANAGEMENT SERVICES FOR A PERIOD OF TWO (2) YEARS.
Want to see all available tenders?
Browse All Tenders →Share this article
We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents
Meet Tender Extract, our open-source Python microservice that uses deterministic regex-based extraction to parse South African government tender PDFs. And learn how the extracted data flows into our AI enrichment pipeline.