AI OCR Invoice Data Extraction with a Simple Python Script

AI OCR with Python enhances invoice data extraction using machine learning and NLP, offering accuracy, speed, and scalability. FlowHunt's API processes complex layouts, converting PDFs to structured CSVs efficiently, ideal for high-volume document workflows.

Last modified on November 11, 2024 at 6:20 pm
AI OCR Invoice Data Extraction with a Simple Python Script


Organizations handle a huge volume of invoices with the speed of business today. Extracting data from them manually is highly time-consuming and invites considerable chances of errors. Conventional solutions in OCR, which were mainly developed based on static text recognition, act poorly when applied for extracting data from such documents because of their complex and variable structures. It is here that AI-based OCR jumps into the scene. Unlike traditional OCR, AI-driven OCR makes use of machine learning and NLP techniques in order to intelligently extract structured data from invoices, such as details about invoice numbers, dates, itemized costs, and totals when these details appear in different places or formats across different documents.

In this blog I will show you how to find an approach to a scalable solution for AI-driven automated data extraction with the help of Optical Character Recognition from invoices through the FlowHunt API. You will go through how to get the main benefits of AI OCR, how to implement the workflow for handling large-scale tasks of OCR, and how each part of the Python script works in invoice processing.

If you are interested in a fast and easy to use Invoice data extraction OCR tool go HERE.

What is AI-Based OCR?


AI-driven OCR goes beyond the capabilities of traditional OCR in that it uses artificial intelligence to understand context, handle numerous layout varieties, and produce high-quality structured data extraction out of even the most complex documents. While traditional OCR is designed to pick up text majorly from a fixed format, AI OCR can handle many types of layouts and configurations common in invoices and other business documents.

Key Features of AI-Based OCR


Contextual Understanding: AI OCR uses NLP for contextual understanding in documents. It identifies fields like “Total Amount,” “Invoice Date,” and “Client Name” when those fields pop up in different places.
Flexibility: Traditional OCR tends to go haywire with irregular layouts; on the other hand, AI OCR will be pretty flexible and extract information from these different types of invoice formats without breaking a sweat.
Data Structuring: Often, AI OCR will directly provide structured outputs that might be easier to do some post-processing on than traditional OCR’s raw text output.

Why Use AI OCR for Invoices?


Invoices have to be processed efficiently and with a high degree of accuracy, whether it is related to the accounting, logistics, or procurement department. AI OCR, therefore, automates data extraction and hence smooths out workflows and improves data accuracy.

Benefits of AI OCR for Invoices


Speed and Efficiency: With AI OCR, a large number of invoices can be accommodated with minimal time consumption, thereby freeing up resources and manpower.
Improved Accuracy: AI models, trained on a wide variety of document formats, reduce errors associated with manual data entry. Smoother Data Management: Since the data is already structured, it easily merges into databases, analytics systems, and even into ERP systems. Scalability: AI OCR can process high volumes of documents without requiring more personnel and hence is quite suitable for large organizations or those that are rapidly expanding.

The ROI of using FlowHunt’s AI OCR tool

most conventional companies extract data from invoices manually by using employees to these tasks. this is a very time consuming and costly operation which can be automated in many different fields and companies. such as tax companies, legal companies, finance companies, and so many more.
this process takes 5 to 15 seconds and costs 0.01 – 0.02 credits, where you normally would have to pay 15$ – 30$ per hour for an employee to do the same task.


ProcessorCost per YearInvoices Processed per YearCost per Invoice
Human$30,00012,000$2.50
FlowHunt$16212,000$0.013
FlowHunt (at $30,000)$30,0002,250,000$0.0133
I would say FlowHunt is more efficient by a huge margin.


Issues in Implementing OCR:



While OCR is highly beneficial, it comes with some challenges:

  1. Image Quality: OCR accuracy depends significantly on image quality. Blurry or low-resolution images yield poor results.
  2. Complex Formatting: Documents with complex layouts, mixed fonts, or tables may require advanced OCR processing.
  3. Language and Character Set: OCR software may have limited language support, requiring specialized models for non-Latin characters.
  4. Error Rate: No OCR software is 100% accurate, especially with cursive or irregular fonts, which can introduce errors in the output.

To tackle these challenges, it’s essential to use a powerful and flexible OCR tool. FlowHunt’s API, for instance, provides a robust OCR solution capable of handling complex document structures, which makes it ideal for large-scale OCR projects.

Setting Up the Python OCR Script

To automate the process, you’ll need to install the following Python libraries:

pip install requests pdf2image git+https://github.com/QualityUnit/flowhunt-python-sdk.git

This installs:

  • requests: For sending HTTP requests to FlowHunt’s API and downloading OCR outputs.
  • pdf2image: To convert PDF pages to images.
  • flowhunt: FlowHunt’s Python SDK, which simplifies interaction with the OCR API.

Step-by-Step Breakdown of the Code

Let’s walk through each part of the Python script. This code will take a PDF, convert it into images, send each image to FlowHunt for OCR processing, and save the output in CSV format.

Import Libraries

import json
import os
import re
import time
import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path
  • Standard Libraries: json, os, re, and time help with JSON handling, file management, regular expressions, and time intervals.
  • requests: Used to handle HTTP requests, like downloading the OCR results.
  • flowhunt: FlowHunt’s SDK handles authentication and communication with the OCR API.
  • pdf2image: Converts PDF pages to images, enabling individual page OCR.

Function to Convert PDF Pages to Images

def convert_pdf_to_image(path: str) -> None:
    """
    Convert a PDF file to images, storing each page as a JPEG.
    """
    images = convert_from_path(path)
    for i in range(len(images)):
        images[i].save('data/images/' + 'page' + str(i) + '.jpg', 'JPEG')

This function takes a PDF and splits it into images:

  • convert_from_path: This function takes the PDF path, converting each page to an image. Each page is saved in data/images/ as a JPEG.
  • images[i].save: Iterates through pages, saving each one as an individual JPEG.

By saving each page as an image, we prepare them for OCR processing. Each image represents a single page that will be processed independently.

Extracting the Output Attachment URL

def extract_attachment_url(data_string):
    pattern = r'```flowhunt\n({.*})\n```'
    match = re.search(pattern, data_string, re.DOTALL)

    if match:
        json_string = match.group(1)
        try:
            json_data = json.loads(json_string)
            return json_data.get('download_link', None)
        except json.JSONDecodeError:
            print("Error: Failed to decode JSON.")
            return None

    return None

The extract_attachment_url function helps retrieve the URL for downloading the OCR output:

  • Regex Pattern: Looks for JSON objects embedded within specific delimiters (flowhunt).
  • json.loads: Converts the extracted JSON string to a Python dictionary, enabling us to access the download_link key.
  • Error Handling: If JSON fails to decode, it logs an error message and returns None.

API Configuration and Authentication

convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"

configuration = flowhunt.Configuration(
    host="https://api.flowhunt.io",
    api_key={"APIKeyHeader": "<API_KEY_HERE>"}
)

This section sets up API access with your FlowHunt credentials:

  • convert_pdf_to_image: Converts the specified PDF into images.
  • FlowHunt Configuration: The Configuration object requires your FlowHunt API key and Flow ID, which provide secure access to the OCR API.

Initializing the API Client

with flowhunt.ApiClient(configuration) as api_client:
    auth_api = flowhunt.AuthApi(api_client)
    api_response = auth_api.get_user()
    workspace_id = api_response.api_key_workspace_id

This code uses the ApiClient to authenticate and retrieve the workspace_id, which is required for subsequent API calls.

Starting a Flow Session

flows_api = flowhunt.FlowsApi(api_client)
from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(flow_id=FLOW_ID)
create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)

The session setup involves creating an OCR session within the FlowHunt workflow:

  • FlowSessionCreateFromFlowRequest: Sets up a session with the provided flow_id.
  • create_flow_session: This method starts the session for uploading images and processing OCR.

Uploading Images for OCR Processing

for image in os.listdir("data/images"):
    image_name, image_extension = os.path.splitext(image)
    with open("data/images/" + image, "rb") as file:
        try:
            flow_sess_attachment = flows_api.upload_attachments(
                create_session_rsp.session_id,
                file.read()
            )

In this loop:

  • os.listdir: Lists all images in data/images.
  • upload_attachments: Uploads each image in the session, preparing it for OCR processing.

Invoking OCR Processing and Polling for Results

invoke_rsp = flows_api.invoke_flow_response(
    create_session_rsp.session_id, 
    flowhunt.FlowSessionInvokeRequest(message="")
)

This line triggers OCR processing, and then a polling loop checks OCR completion:

while True:
    get_flow_rsp = flows_api.poll_flow_response(
        create_session_rsp.session_id, invoke_rsp.message_id
    )
    print("Flow response: ", get_flow_rsp)
    if get_flow_rsp.response_status == "S":
        print("done OCR")

This loop checks the OCR status every 3 seconds until completion. Once marked "S" (success), it proceeds to download the output.

Downloading and Saving OCR Output

attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])
if attachment_url:
    response = requests.get(attachment_url)
    with open("data/results/" + image_name + ".csv", "wb") as file:
        file.write(response.content)

The output is saved as a CSV:

  • extract_attachment_url: Retrieves the download link from the response.
  • requests.get: Downloads the CSV, saving it to data/results/.

Running the Script and Testing Output

To execute this script:

  1. Place your PDF in the data/ folder.
  2. Update <FLOW_ID_HERE> and <API_KEY_HERE> with your FlowHunt credentials.
  3. Run the script to convert the PDF, upload images for OCR, and download the structured CSV results.

Conclusion

This Python script offers an efficient solution for scaling OCR processes, ideal for industries with high document processing demands. With FlowHunt’s API, this solution handles document-to-CSV conversion, streamlining workflows and boosting productivity.

Full Code Overview

Click HERE to get the Gist

import json
import os
import re
import time

import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path


def convert_pdf_to_image(path: str) -> None:
    """
    Convert a pdf file to an image
    :return:
    """
    # Store Pdf with convert_from_path function
    images = convert_from_path(path)
    for i in range(len(images)):
        # Save pages as images in the pdf
        images[i].save('data/images/' + 'page'+ str(i) +'.jpg', 'JPEG')

def extract_attachment_url(data_string):
    # Define a regular expression pattern to find the JSON object in the string
    pattern = r'```flowhunt\n({.*})\n```'
    match = re.search(pattern, data_string, re.DOTALL)

    if match:
        # Extract the JSON object from the matched pattern
        json_string = match.group(1)

        try:
            # Load the JSON data
            json_data = json.loads(json_string)

            # Return the 'download_link' value from the JSON data
            return json_data.get('download_link', None)

        except json.JSONDecodeError:
            # Handle JSON decoding error if the extracted string is not valid JSON
            print("Error: Failed to decode JSON.")
            return None

    return None



convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"

# Assuming all images are in the data/images folder

# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure Bearer authorization: HTTPBearer
configuration = flowhunt.Configuration(host = "https://api.flowhunt.io",
                                       api_key = {"APIKeyHeader": "<API_KEY_HERE>"})


# Enter a context with an instance of the API client
with flowhunt.ApiClient(configuration) as api_client:
    # get workspace_id
    auth_api = flowhunt.AuthApi(api_client)
    api_response = auth_api.get_user()
    workspace_id = api_response.api_key_workspace_id



    # Create an instance of the API class
    flows_api = flowhunt.FlowsApi(api_client)
    from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(
        flow_id=FLOW_ID
    )
    create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)

    # Looping through the images and attaching the images to flow
    for image in os.listdir("data/images"):
        image_name, image_extension = os.path.splitext(image)
        with open("data/images/" + image, "rb") as file:
            try:
                flow_sess_attachment = flows_api.upload_attachments(
                    create_session_rsp.session_id,
                    file.read()
                )

                # invoking the flow and getting the response in csv format
                invoke_rsp = flows_api.invoke_flow_response(create_session_rsp.session_id, flowhunt.FlowSessionInvokeRequest(
                    message="",
                ))

                # polling to get the result in loop with 3 seconds interval
                while True:
                    get_flow_rsp = flows_api.poll_flow_response(create_session_rsp.session_id, invoke_rsp.message_id)
                    print("Flow response: ", get_flow_rsp)
                    if get_flow_rsp.response_status == "S":
                        print("done OCR")

                        ### Extracting the url path of the output attachment
                        attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])

                        if attachment_url:
                            print("Attachment URL: ", attachment_url, "\n Downloading the file...")
                            # Downloading the file using standard libraries
                            response = requests.get(attachment_url)
                            # save the csv file in /data/results
                            with open("data/results/" + image_name + ".csv", "wb") as file:
                                file.write(response.content)

                        break
                    time.sleep(3)



            except ApiException as e:
                print("error for file ", image)
                print(e)
Discover how a Webpage Content GAP Analysis can boost your SEO by identifying missing elements in your content. Learn to enhance your webpage's ranking with actionable insights and competitor comparisons. Visit FlowHunt for more details.

Webpage Content GAP Analysis

Boost your SEO with FlowHunt's Webpage Content GAP Analysis. Identify content gaps, enhance ranking potential, and refine your strategy.

Discover FlowHunt's AI-driven templates for chatbots, content creation, SEO, and more. Simplify your workflow with powerful, specialized tools today!

Templates

Discover FlowHunt's AI-driven templates for chatbots, content creation, SEO, and more. Simplify your workflow with powerful, specialized tools today!

Generate perfect SEO titles effortlessly with FlowHunt's Web Page Title Generator. Input your keyword and let AI create optimized titles for you!

Web Page Title Generator Template

Generate perfect SEO titles effortlessly with FlowHunt's Web Page Title Generator. Just input a keyword and get top-performing titles in seconds!

Learn from the top-ranking content on Google. This Tool will generate high-quality, SEO-optimized content inspired by the best.

Top Pages Content Generator

Generate high-quality, SEO-optimized content by analyzing top-ranking Google pages with FlowHunt's Top Pages Content Generator. Try it now!

Our website uses cookies. By continuing we assume your permission to deploy cookies as detailed in our privacy and cookies policy.