Invoice Digitalization and Integration with SAGA
An open-source, privacy-first pipeline that turns PDF invoices into SAGA-ready XML in under 30 seconds — while running entirely on your own machine. Hybrid local extraction, cross-checked against ANAF and VIES, scored for risk, and packaged for direct accounting import.
Try the live demo → Read more ↓Every small-business accountant spends 6–10 hours a month keying invoice data into accounting software. That work is slow, error-prone, and lacks any verification step — if the supplier on the invoice doesn’t actually exist, the cost still hits the books and the discrepancy surfaces months later.
SAGABridge automates that entire flow. Upload an invoice PDF and the application extracts every field, verifies both supplier and customer against the Romanian ANAF registry and the EU VIES database, computes a heuristic risk score, and produces an XML compatible with SAGA’s import format.
What separates this project from commercial OCR products is the architectural choice to run nothing on the cloud. Local LLM via Ollama, local OCR via Tesseract, verifications via public registries. No OpenAI, no Google, no commercial APIs in the production pipeline. The invoice never leaves the user’s machine.
Six pipeline stages, each chosen deliberately to balance speed, accuracy, and privacy. Every component is open-source.
PyMuPDF first for digital PDFs — instant and free. Tesseract OCR fallback for scanned documents. Local LLM (Qwen 2.5) for semantic structuring of the resulting text.
Both supplier and customer cross-checked in parallel against ANAF (for RO companies) and VIES (for any EU VAT number). Auto-fallback between providers.
DuckDuckGo search plus direct domain probing finds the company’s official website, registry pages, social profiles, and press mentions — classified into four badges.
0–100 additive score from verification status, tax-ID mismatch, company status, and negative news keywords. Bucketed into low/medium/high with explicit warnings.
Deterministic ElementTree-based serialization. Includes the full verification dossier for both parties as an audit trail embedded in the document.
Stack: Ollama with Qwen 2.5 3B, Tesseract OCR, ANAF/VIES APIs, DuckDuckGo. Zero cloud LLM. Zero data leaving the user’s server. GDPR by architecture, not by promise.
Twelve components, all open-source, composed into a single pipeline. Each addresses a specific concern from the original problem statement.
The full application is hosted at the URL below. Drop an invoice PDF, watch the four-stage pipeline run, and download a SAGA-compatible XML with both parties verified.
Open the application →