University POLITEHNICA of Bucharest
University POLITEHNICA of Bucharest Faculty of Entrepreneurship, Business Engineering and Management
Management of Digital Enterprises
Faculty FAIMA
Master’s dissertation · May 2026

SAGABridge

Invoice Digitalization and Integration with SAGA

An open-source, privacy-first pipeline that turns PDF invoices into SAGA-ready XML in under 30 seconds — while running entirely on your own machine. Hybrid local extraction, cross-checked against ANAF and VIES, scored for risk, and packaged for direct accounting import.

Try the live demo  → Read more  ↓

§1What it does

Every small-business accountant spends 6–10 hours a month keying invoice data into accounting software. That work is slow, error-prone, and lacks any verification step — if the supplier on the invoice doesn’t actually exist, the cost still hits the books and the discrepancy surfaces months later.

SAGABridge automates that entire flow. Upload an invoice PDF and the application extracts every field, verifies both supplier and customer against the Romanian ANAF registry and the EU VIES database, computes a heuristic risk score, and produces an XML compatible with SAGA’s import format.

What separates this project from commercial OCR products is the architectural choice to run nothing on the cloud. Local LLM via Ollama, local OCR via Tesseract, verifications via public registries. No OpenAI, no Google, no commercial APIs in the production pipeline. The invoice never leaves the user’s machine.

For a Romanian SME processing 200 invoices/month, the manual flow takes 6–10 hours. SAGABridge handles the same volume in under 2 hours of unattended runtime, with full verification of every counterparty included.

§2What’s inside

Six pipeline stages, each chosen deliberately to balance speed, accuracy, and privacy. Every component is open-source.

01

Hybrid extraction

PyMuPDF first for digital PDFs — instant and free. Tesseract OCR fallback for scanned documents. Local LLM (Qwen 2.5) for semantic structuring of the resulting text.

02

Dual-party verification

Both supplier and customer cross-checked in parallel against ANAF (for RO companies) and VIES (for any EU VAT number). Auto-fallback between providers.

03

Online presence audit

DuckDuckGo search plus direct domain probing finds the company’s official website, registry pages, social profiles, and press mentions — classified into four badges.

04

Heuristic risk score

0–100 additive score from verification status, tax-ID mismatch, company status, and negative news keywords. Bucketed into low/medium/high with explicit warnings.

05

SAGA-ready XML

Deterministic ElementTree-based serialization. Includes the full verification dossier for both parties as an audit trail embedded in the document.

06

100% on-device AI

Stack: Ollama with Qwen 2.5 3B, Tesseract OCR, ANAF/VIES APIs, DuckDuckGo. Zero cloud LLM. Zero data leaving the user’s server. GDPR by architecture, not by promise.

§3Technology stack

Twelve components, all open-source, composed into a single pipeline. Each addresses a specific concern from the original problem statement.

§4Try it live

The full application is hosted at the URL below. Drop an invoice PDF, watch the four-stage pipeline run, and download a SAGA-compatible XML with both parties verified.

Open the application  →
https://app.sagabridge.live

§5Credits

Author
Ing. David-Adrian Băbţan
Scientific advisor
Conf. dr. ing. Silviu Răileanu
Place & date
Bucharest
Programme
Management of Digital Enterprises