AI OCR Invoice Data Extraction with a Simple Python Script

Learn how to automate invoice data extraction using AI-based OCR and Python with FlowHunt’s API, enabling fast, accurate, and scalable document processing.

AI OCR Invoice Data Extraction with a Simple Python Script

What is AI-Based OCR?

AI-driven OCR goes beyond the capabilities of traditional OCR in that it uses artificial intelligence to understand context, handle numerous layout varieties, and produce high-quality structured data extraction out of even the most complex documents. While traditional OCR is designed to pick up text from a fixed format, AI OCR can handle many types of layouts and configurations common in invoices and other business documents.

Key Features of AI-Based OCR

  • Contextual Understanding: AI OCR uses NLP for contextual understanding in documents. It identifies fields like “Total Amount,” “Invoice Date,” and “Client Name” when those fields pop up in different places.
  • Flexibility: Traditional OCR tends to go haywire with irregular layouts; AI OCR is flexible and can extract information from different types of invoice formats without breaking a sweat.
  • Data Structuring: Often, AI OCR will directly provide structured outputs that are easier for post-processing than traditional OCR’s raw text output.

Why Use AI OCR for Invoices?

Invoices have to be processed efficiently and with a high degree of accuracy, whether it is related to the accounting, logistics, or procurement department. AI OCR automates data extraction and smooths out workflows, improving data accuracy.

Benefits of AI OCR for Invoices

  • Speed and Efficiency: With AI OCR, a large number of invoices can be processed in minimal time, freeing up resources and manpower.
  • Improved Accuracy: AI models, trained on a wide variety of document formats, reduce errors associated with manual data entry.
  • Smoother Data Management: Since the data is already structured, it easily merges into databases, analytics systems, and even ERP systems.
  • Scalability: AI OCR can process high volumes of documents without requiring more personnel, making it ideal for large organizations or those rapidly expanding.

The ROI of using FlowHunt’s AI OCR tool

Most conventional companies extract data from invoices manually by using employees for these tasks. This is a very time-consuming and costly operation which can be automated in many different fields and companies, such as tax, legal, and finance companies, and more.

This process takes 5 to 15 seconds and costs 0.01 – 0.02 credits, where you normally would have to pay $15 – $30 per hour for an employee to do the same task.

OCR Cost Comparison
ProcessorCost per YearInvoices Processed per YearCost per Invoice
Human$30,00012,000$2.50
FlowHunt$16212,000$0.013
FlowHunt (at $30,000)$30,0002,250,000$0.0133

I would say FlowHunt is more efficient by a huge margin.

Issues in Implementing OCR

While OCR is highly beneficial, it comes with some challenges:

  1. Image Quality: OCR accuracy depends significantly on image quality. Blurry or low-resolution images yield poor results.
  2. Complex Formatting: Documents with complex layouts, mixed fonts, or tables may require advanced OCR processing.
  3. Language and Character Set: OCR software may have limited language support, requiring specialized models for non-Latin characters.
  4. Error Rate: No OCR software is 100% accurate, especially with cursive or irregular fonts, which can introduce errors in the output.

To tackle these challenges, it’s essential to use a powerful and flexible OCR tool. FlowHunt’s API provides a robust OCR solution capable of handling complex document structures, making it ideal for large-scale OCR projects.

Setting Up the Python OCR Script

To automate the process, you’ll need to install the following Python libraries:

pip install requests pdf2image git+https://github.com/QualityUnit/flowhunt-python-sdk.git

This installs:

  • requests: For sending HTTP requests to FlowHunt’s API and downloading OCR outputs.
  • pdf2image: To convert PDF pages to images.
  • flowhunt: FlowHunt’s Python SDK, which simplifies interaction with the OCR API.

Step-by-Step Breakdown of the Code

This code will take a PDF, convert it into images, send each image to FlowHunt for OCR processing, and save the output in CSV format.

Import Libraries

import json
import os
import re
import time
import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path
  • json, os, re, and time help with JSON handling, file management, regular expressions, and time intervals.
  • requests: Used to handle HTTP requests, like downloading the OCR results.
  • flowhunt: FlowHunt’s SDK handles authentication and communication with the OCR API.
  • pdf2image: Converts PDF pages to images, enabling individual page OCR.

Function to Convert PDF Pages to Images

def convert_pdf_to_image(path: str) -> None:
    """
    Convert a PDF file to images, storing each page as a JPEG.
    """
    images = convert_from_path(path)
    for i in range(len(images)):
        images[i].save('data/images/' + 'page' + str(i) + '.jpg', 'JPEG')
  • convert_from_path: Converts each PDF page to an image.
  • images[i].save: Saves each page as an individual JPEG for OCR processing.

Extracting the Output Attachment URL

def extract_attachment_url(data_string):
    pattern = r'```flowhunt\n({.*})\n```'
    match = re.search(pattern, data_string, re.DOTALL)
    if match:
        json_string = match.group(1)
        try:
            json_data = json.loads(json_string)
            return json_data.get('download_link', None)
        except json.JSONDecodeError:
            print("Error: Failed to decode JSON.")
            return None
    return None
  • The function retrieves the URL for downloading the OCR output.
  • Uses regex to find the JSON object with the download link.

API Configuration and Authentication

convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"

configuration = flowhunt.Configuration(
    host="https://api.flowhunt.io",
    api_key={"APIKeyHeader": "<API_KEY_HERE>"}
)
  • Converts the PDF into images.
  • Sets up API access with FlowHunt credentials.

Initializing the API Client

with flowhunt.ApiClient(configuration) as api_client:
    auth_api = flowhunt.AuthApi(api_client)
    api_response = auth_api.get_user()
    workspace_id = api_response.api_key_workspace_id
  • Authenticates and retrieves the workspace_id for subsequent API calls.

Starting a Flow Session

flows_api = flowhunt.FlowsApi(api_client)
from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(flow_id=FLOW_ID)
create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)
  • Sets up a session for uploading images and processing OCR.

Uploading Images for OCR Processing

for image in os.listdir("data/images"):
    image_name, image_extension = os.path.splitext(image)
    with open("data/images/" + image, "rb") as file:
        try:
            flow_sess_attachment = flows_api.upload_attachments(
                create_session_rsp.session_id,
                file.read()
            )
  • Uploads each image in the session for OCR processing.

Invoking OCR Processing and Polling for Results

invoke_rsp = flows_api.invoke_flow_response(
    create_session_rsp.session_id, 
    flowhunt.FlowSessionInvokeRequest(message="")
)
while True:
    get_flow_rsp = flows_api.poll_flow_response(
        create_session_rsp.session_id, invoke_rsp.message_id
    )
    print("Flow response: ", get_flow_rsp)
    if get_flow_rsp.response_status == "S":
        print("done OCR")
        break
    time.sleep(3)
  • Triggers OCR processing and polls every 3 seconds until completion.

Downloading and Saving OCR Output

attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])
if attachment_url:
    response = requests.get(attachment_url)
    with open("data/results/" + image_name + ".csv", "wb") as file:
        file.write(response.content)
  • Downloads the CSV output and saves it locally.

Running the Script and Testing Output

To execute this script:

  1. Place your PDF in the data/ folder.
  2. Update <FLOW_ID_HERE> and <API_KEY_HERE> with your FlowHunt credentials.
  3. Run the script to convert the PDF, upload images for OCR, and download the structured CSV results.

Conclusion

This Python script offers an efficient solution for scaling OCR processes, ideal for industries with high document processing demands. With FlowHunt’s API, this solution handles document-to-CSV conversion, streamlining workflows and boosting productivity.

Full Code Overview

Click HERE for the Gist version.

import json
import os
import re
import time
import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path

def convert_pdf_to_image(path: str) -> None:
    """
    Convert a pdf file to an image
    :return:
    """
    images = convert_from_path(path)
    for i in range(len(images)):
        images[i].save('data/images/' + 'page'+ str(i) +'.jpg', 'JPEG')

def extract_attachment_url(data_string):
    pattern = r'```flowhunt\n({.*})\n```'
    match = re.search(pattern, data_string, re.DOTALL)
    if match:
        json_string = match.group(1)
        try:
            json_data = json.loads(json_string)
            return json_data.get('download_link', None)
        except json.JSONDecodeError:
            print("Error: Failed to decode JSON.")
            return None
    return None

convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"

configuration = flowhunt.Configuration(host = "https://api.flowhunt.io",
                                       api_key = {"APIKeyHeader": "<API_KEY_HERE>"})

with flowhunt.ApiClient(configuration) as api_client:
    auth_api = flowhunt.AuthApi(api_client)
    api_response = auth_api.get_user()
    workspace_id = api_response.api_key_workspace_id

    flows_api = flowhunt.FlowsApi(api_client)
    from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(
        flow_id=FLOW_ID
    )
    create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)

    for image in os.listdir("data/images"):
        image_name, image_extension = os.path.splitext(image)
        with open("data/images/" + image, "rb") as file:
            try:
                flow_sess_attachment = flows_api.upload_attachments(
                    create_session_rsp.session_id,
                    file.read()
                )
                invoke_rsp = flows_api.invoke_flow_response(create_session_rsp.session_id, flowhunt.FlowSessionInvokeRequest(
                    message="",
                ))
                while True:
                    get_flow_rsp = flows_api.poll_flow_response(create_session_rsp.session_id, invoke_rsp.message_id)
                    print("Flow response: ", get_flow_rsp)
                    if get_flow_rsp.response_status == "S":
                        print("done OCR")
                        attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])
                        if attachment_url:
                            print("Attachment URL: ", attachment_url, "\n Downloading the file...")
                            response = requests.get(attachment_url)
                            with open("data/results/" + image_name + ".csv", "wb") as file:
                                file.write(response.content)
                        break
                    time.sleep(3)
            except ApiException as e:
                print("error for file ", image)
                print(e)

Frequently asked questions

What is AI-based OCR and how does it differ from traditional OCR?

AI-based OCR leverages machine learning and NLP to understand document context, handle complex layouts, and extract structured data from invoices, unlike traditional OCR which relies on fixed-format text recognition.

What are the main benefits of using AI OCR for invoices?

AI OCR delivers speed, accuracy, scalability, and structured outputs, reducing manual work, minimizing errors, and enabling seamless integration with business systems.

How can I implement invoice OCR automation with Python and FlowHunt?

By using FlowHunt’s Python SDK, you can convert PDFs to images, send them to FlowHunt’s API for OCR, and retrieve structured data in CSV format, automating the entire extraction process.

What challenges exist in OCR processing and how does FlowHunt address them?

Common challenges include poor image quality, complex document layouts, and varied languages. FlowHunt’s API is designed to handle these with advanced AI models and flexible processing capabilities.

What is the ROI of automating invoice data extraction with FlowHunt?

FlowHunt’s AI OCR can process invoices in seconds at a fraction of human cost, delivering massive efficiency gains and scalability for growing businesses.

Try FlowHunt's AI Invoice OCR Tool

Automate invoice data extraction with FlowHunt’s robust AI OCR. Save time, reduce errors, and streamline your workflows by converting PDFs to structured data in seconds.

Learn more