Why We Built and Open Sourced a Tender PDF Extractor

Every day, South African government departments publish hundreds of tender documents as PDFs. These PDFs are the raw source of truth for procurement opportunities worth billions of rand. But there's a problem: these documents are unstructured, inconsistently formatted, and often run 50 to 200 pages. Extracting the information you actually need — closing dates, B-BBEE requirements, CIDB gradings, evaluation criteria, returnable documents — is a slow, error-prone manual process.

Today we're announcing the open source release of Tender Extract (github.com/Tenders-SA/tender-extract

), the Python microservice that powers the first stage of document processing on Tenders-SA.org. We're releasing it under an MIT license so that other developers, researchers, and procurement professionals can use, inspect, and contribute to the extraction engine.

The Problem with Government Tender PDFs

South African government tender PDFs come from dozens of different issuing authorities — national departments, provincial treasuries, municipalities, SOEs like Eskom and Transnet, and public entities. Each has its own document template, formatting conventions, and section naming. A "Scope of Work" in one document might be called "Description" in another, or "Terms of Reference" in a third. Key information like the tender number might appear in a header, a footer, or a dedicated field on page one.

The standard approach to handling this diversity is to throw an LLM at it — feed the entire PDF to an AI model and ask it to extract the fields you need. But that approach has downsides: it's expensive, slow, and inconsistent. A 200-page tender document can cost several cents per analysis, take 30+ seconds, and produce different results on different runs. For a platform processing thousands of tenders daily, that cost and latency adds up fast.

Introducing Tender Extract: Deterministic, Fast, Open Source

Tender Extract is a FastAPI microservice that uses deterministic regex-based pattern matching (not AI) to extract structured data from South African government tender PDFs. It's designed to be fast, cheap, and reliable — running in under 2 seconds per document with zero API costs. It runs as a standalone HTTP service and can be deployed anywhere Docker runs.

The Algorithm: How Extraction Works

The extraction engine, implemented in app/extractor.py, follows a structured pipeline:

Text Extraction — The PDF is opened with PyMuPDF (fitz). Each page is processed sequentially up to a configurable limit (default 30 pages). Text is extracted per page with positional sorting. If a page has no text but contains images, it's flagged as scanned (the service returns a 501 for pure scanned PDFs — it does not attempt OCR).
Text Normalization — The raw text is normalized: Unicode NFKC normalization, soft hyphen removal, line-break hyphenation repair (e.g. "procure-\nment" becomes "procurement"), footer/page number removal, and whitespace normalization. The result is clean, predictable text for pattern matching.
Section Detection — The normalized text is scanned with multiple regex patterns designed specifically for South African tender terminology. The engine identifies sections like: Description/Scope of Work, Requirements/Eligibility, B-BBEE/Preferential Procurement, Evaluation Criteria, Returnable Documents, Special Conditions, Contact Information, and Briefing Sessions.
Field Extraction — Within and outside those sections, scalar values are extracted: tender number (using multiple pattern strategies), title, closing date, closing time, publication date, issuing organization, estimated value, CIDB grading, validity period, and contract period.
Confidence Scoring — Each field contributes to a weighted confidence score (0.0–1.0). A complete extraction with title, organization, description, requirements, tender number, and closing date scores above 0.85. Low-confidence results signal the main application to run AI enrichment.

The key design decision is that this microservice does no AI processing. It's pure Python and PyMuPDF. When the confidence is low (below 0.55) or critical fields are missing, the main application decides whether to fall through to our AI analysis pipeline. This keeps the extraction layer extremely cheap — we can process hundreds of documents per minute on a single small container.

Quick Start

1# Clone the repo
2git clone https://github.com/Tenders-SA/tender-extract.git
3cd tender-extract
4
5# Install dependencies
6pip install -r requirements.txt
7
8# Start the service
9uvicorn app.main:app --reload
10
11# Extract from a PDF file
12curl -X POST http://localhost:8000/v1/extract -F "[email protected]"
13
14# Extract from a URL
15curl -X POST http://localhost:8000/v1/extract \
16// ... (truncated)
BASH

Or run it with Docker:

1docker build -t tender-extract .
2docker run -p 8080:8080 tender-extract
BASH

The Two-Phase Pipeline: Extraction Then AI Enrichment

Tender Extract is stage one of a two-stage document processing pipeline on Tenders-SA.org. Here's how the full pipeline works:

Stage 1: Deterministic Extraction (Tender Extract)

When a tender is ingested from an OCDS feed, our Cloudflare Worker fetches the PDF documents and stores them in Cloudflare R2 object storage. The Tender Extraction Integration Service (src/lib/services/tender-extraction-integration.service.ts) then sends each PDF URL to the Tender Extract API. The service returns a structured JSON response with up to 20+ fields.

This data is stored immediately in the database and made available on the platform. For many well-structured tenders, this is sufficient — the regex extraction captures everything a business needs to make a go/no-go decision.

Stage 2: AI Enrichment (Gemini/Groq)

When the extraction confidence is low, or when critical fields are missing, or when a tender is prioritized for deep analysis, the system triggers the AI enrichment pipeline. The Gemini Analysis Service (src/lib/services/gemini-analysis.service.ts) takes the raw extracted text (passed through as full_text in the extraction response) and uses Google Gemini (with Groq as fallback) to:

Generate a coherent plain-English summary of the tender scope
Extract mandatory requirements and compliance criteria
Categorize the tender by industry and procurement type
Estimate the tender value range based on document content
Calculate a confidence score for each AI-extracted field

The two-stage approach gives us the best of both worlds: speed and economy for well-structured documents (stage 1), and deep understanding for complex or poorly structured documents (stage 2). In production, roughly 30% of documents pass through both stages.

Stage	Technology	Speed	Cost	When Used
1. Extraction	PyMuPDF + Regex	< 2 seconds	~$0.0001	Every document
2. AI Enrichment	Gemini / Groq	5–30 seconds	~$0.01–$0.10	Low confidence or priority

The raw extraction results are always preserved in the database alongside the AI-enriched fields. This means the system can re-run AI enrichment at any time without losing the original extraction, and users can see both the deterministic extraction confidence and the AI enrichment confidence.

What the Extractor Finds

Here are the fields the extractor can pull from a typical SA government tender PDF, along with examples of the SA-specific patterns it recognizes:

Field	Example	SA-Specific Pattern
Tender Number	SCM/2026/001-KZN	Handles reference numbers, RFQ/RFP/BID prefixes, OCDS identifiers
Issuing Organization	eThekwini Metropolitan Municipality	Recognizes SA municipalities, departments, SOEs, and public entities
CIDB Grading	7CE or higher	CIDB category + grade pattern with 'or higher' variants
B-BBEE Level	Level 1	B-BBEE contributor levels, preferential procurement thresholds
Preference System	80/20 or 90/10	PPPFA preference point systems specific to SA procurement law
Returnable Documents	SBD 1, SBD 4, MBD 6.1	Standard Bidding Document (SBD) and Municipal Bidding Document (MBD) form numbers
Briefing Session	Compulsory site visit	Detects compulsory vs optional briefings, venue, date, time
Evaluation Criteria	80/20 preference point system, functionality threshold 70%	SA procurement evaluation with stage 1/stage 2, functionality scoring

Why We Open Sourced This

South Africa's public procurement system affects every business in the country, but the technical infrastructure around it remains fragmented. By open sourcing Tender Extract, we're aiming to:

Enable community contributions — SA procurement language evolves. New tender formats, new SOE templates, and new regulations mean the regex patterns need continuous maintenance. Community contributions help keep the extractor accurate.
Provide a reference implementation — Other platforms, researchers, and government agencies can use this as a starting point for their own document processing pipelines.
Improve transparency — Anyone can inspect exactly how extraction works, what patterns are used, and how confidence is calculated.
Reduce duplication — Every SA procurement tech platform solves this same problem independently. A shared, open-source foundation benefits the entire ecosystem.

Get Started

The Tender Extract repository is live on GitHub: github.com/Tenders-SA/tender-extract

. It includes the full source code, Dockerfile, deployment configurations for Cloud Run and Render.com, API documentation, and integration guides.

If you're building on top of SA procurement data, you might also find our other open-source SDKs useful:

JavaScript SDK — github.com/Tenders-SA/js
(@tenders-sa-org/sdk-js on npm)
Python SDK — github.com/Tenders-SA/python
(tendersa-sdk on PyPI)
CLI Tool — github.com/Tenders-SA/cli
(@tenders-sa-org/cli on npm)

We welcome issues, feature requests, and pull requests. If you're working with SA government tender data and want to contribute patterns for a procurement category we haven't covered, open an issue or submit a PR.

"

We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents

Why We Built and Open Sourced a Tender PDF Extractor

The Problem with Government Tender PDFs

Introducing Tender Extract: Deterministic, Fast, Open Source

The Algorithm: How Extraction Works

Quick Start

The Two-Phase Pipeline: Extraction Then AI Enrichment

Stage 1: Deterministic Extraction (Tender Extract)

Stage 2: AI Enrichment (Gemini/Groq)

What the Extractor Finds

Why We Open Sourced This

Get Started

Tags

Request for Proposals (RFP) The Provision of Underwater Measurement Capabilities and Technical Expertise Related to the Development and Testing of Maritime Technology and Underwater Sensors with the CSIR for a Period of 5 Years.

APPOINTMENT OF A SERVICE PROVIDER TO RENDER THE INFORMATION TECHNOLOGY MANAGEMENT SERVICES FOR A PERIOD OF TWO (2) YEARS.

Share this article

We Open Sourced Our Tender PDF Extractor — Here's How It Extracts Critical Data from 100+ Page Government Documents

About Tenders SA Team

Visualising South African Procurement Data by Sector: An Embeddable Bar Chart Widget

Embedding a Government Supplier Leaderboard on Your Site: Top Companies Widget

How to Display Live South African Tender Awards on Your Website