Organizations handle a huge volume of invoices with the speed of business today. Extracting data from them manually is highly time-consuming and invites considerable chances of errors. Conventional solutions in OCR, which were mainly developed based on static text recognition, act poorly when applied for extracting data from such documents because of their complex and variable structures. It is here that AI-based OCR jumps into the scene. Unlike traditional OCR, AI-driven OCR makes use of machine learning and NLP techniques in order to intelligently extract structured data from invoices, such as details about invoice numbers, dates, itemized costs, and totals when these details appear in different places or formats across different documents.
In this blog I will show you how to find an approach to a scalable solution for AI-driven automated data extraction with the help of Optical Character Recognition from invoices through the FlowHunt API. You will go through how to get the main benefits of AI OCR, how to implement the workflow for handling large-scale tasks of OCR, and how each part of the Python script works in invoice processing.
If you are interested in a fast and easy to use Invoice data extraction OCR tool go HERE.
What is AI-Based OCR?
AI-driven OCR goes beyond the capabilities of traditional OCR in that it uses artificial intelligence to understand context, handle numerous layout varieties, and produce high-quality structured data extraction out of even the most complex documents. While traditional OCR is designed to pick up text majorly from a fixed format, AI OCR can handle many types of layouts and configurations common in invoices and other business documents.
Key Features of AI-Based OCR
Contextual Understanding: AI OCR uses NLP for contextual understanding in documents. It identifies fields like “Total Amount,” “Invoice Date,” and “Client Name” when those fields pop up in different places.
Flexibility: Traditional OCR tends to go haywire with irregular layouts; on the other hand, AI OCR will be pretty flexible and extract information from these different types of invoice formats without breaking a sweat.
Data Structuring: Often, AI OCR will directly provide structured outputs that might be easier to do some post-processing on than traditional OCR’s raw text output.
Why Use AI OCR for Invoices?
Invoices have to be processed efficiently and with a high degree of accuracy, whether it is related to the accounting, logistics, or procurement department. AI OCR, therefore, automates data extraction and hence smooths out workflows and improves data accuracy.
Benefits of AI OCR for Invoices
Speed and Efficiency: With AI OCR, a large number of invoices can be accommodated with minimal time consumption, thereby freeing up resources and manpower.
Improved Accuracy: AI models, trained on a wide variety of document formats, reduce errors associated with manual data entry. Smoother Data Management: Since the data is already structured, it easily merges into databases, analytics systems, and even into ERP systems. Scalability: AI OCR can process high volumes of documents without requiring more personnel and hence is quite suitable for large organizations or those that are rapidly expanding.
The ROI of using FlowHunt’s AI OCR tool
most conventional companies extract data from invoices manually by using employees to these tasks. this is a very time consuming and costly operation which can be automated in many different fields and companies. such as tax companies, legal companies, finance companies, and so many more.
this process takes 5 to 15 seconds and costs 0.01 – 0.02 credits, where you normally would have to pay 15$ – 30$ per hour for an employee to do the same task.
Processor | Cost per Year | Invoices Processed per Year | Cost per Invoice |
---|---|---|---|
Human | $30,000 | 12,000 | $2.50 |
FlowHunt | $162 | 12,000 | $0.013 |
FlowHunt (at $30,000) | $30,000 | 2,250,000 | $0.0133 |
Issues in Implementing OCR:
While OCR is highly beneficial, it comes with some challenges:
- Image Quality: OCR accuracy depends significantly on image quality. Blurry or low-resolution images yield poor results.
- Complex Formatting: Documents with complex layouts, mixed fonts, or tables may require advanced OCR processing.
- Language and Character Set: OCR software may have limited language support, requiring specialized models for non-Latin characters.
- Error Rate: No OCR software is 100% accurate, especially with cursive or irregular fonts, which can introduce errors in the output.
To tackle these challenges, it’s essential to use a powerful and flexible OCR tool. FlowHunt’s API, for instance, provides a robust OCR solution capable of handling complex document structures, which makes it ideal for large-scale OCR projects.
Setting Up the Python OCR Script
To automate the process, you’ll need to install the following Python libraries:
pip install requests pdf2image git+https://github.com/QualityUnit/flowhunt-python-sdk.git
This installs:
- requests: For sending HTTP requests to FlowHunt’s API and downloading OCR outputs.
- pdf2image: To convert PDF pages to images.
- flowhunt: FlowHunt’s Python SDK, which simplifies interaction with the OCR API.
Step-by-Step Breakdown of the Code
Let’s walk through each part of the Python script. This code will take a PDF, convert it into images, send each image to FlowHunt for OCR processing, and save the output in CSV format.
Import Libraries
import json
import os
import re
import time
import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path
- Standard Libraries:
json
,os
,re
, andtime
help with JSON handling, file management, regular expressions, and time intervals. - requests: Used to handle HTTP requests, like downloading the OCR results.
- flowhunt: FlowHunt’s SDK handles authentication and communication with the OCR API.
- pdf2image: Converts PDF pages to images, enabling individual page OCR.
Function to Convert PDF Pages to Images
def convert_pdf_to_image(path: str) -> None:
"""
Convert a PDF file to images, storing each page as a JPEG.
"""
images = convert_from_path(path)
for i in range(len(images)):
images[i].save('data/images/' + 'page' + str(i) + '.jpg', 'JPEG')
This function takes a PDF and splits it into images:
- convert_from_path: This function takes the PDF path, converting each page to an image. Each page is saved in
data/images/
as a JPEG. - images[i].save: Iterates through pages, saving each one as an individual JPEG.
By saving each page as an image, we prepare them for OCR processing. Each image represents a single page that will be processed independently.
Extracting the Output Attachment URL
def extract_attachment_url(data_string):
pattern = r'```flowhunt\n({.*})\n```'
match = re.search(pattern, data_string, re.DOTALL)
if match:
json_string = match.group(1)
try:
json_data = json.loads(json_string)
return json_data.get('download_link', None)
except json.JSONDecodeError:
print("Error: Failed to decode JSON.")
return None
return None
The extract_attachment_url
function helps retrieve the URL for downloading the OCR output:
- Regex Pattern: Looks for JSON objects embedded within specific delimiters (
flowhunt
). - json.loads: Converts the extracted JSON string to a Python dictionary, enabling us to access the
download_link
key. - Error Handling: If JSON fails to decode, it logs an error message and returns
None
.
API Configuration and Authentication
convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"
configuration = flowhunt.Configuration(
host="https://api.flowhunt.io",
api_key={"APIKeyHeader": "<API_KEY_HERE>"}
)
This section sets up API access with your FlowHunt credentials:
- convert_pdf_to_image: Converts the specified PDF into images.
- FlowHunt Configuration: The
Configuration
object requires your FlowHunt API key and Flow ID, which provide secure access to the OCR API.
Initializing the API Client
with flowhunt.ApiClient(configuration) as api_client:
auth_api = flowhunt.AuthApi(api_client)
api_response = auth_api.get_user()
workspace_id = api_response.api_key_workspace_id
This code uses the ApiClient
to authenticate and retrieve the workspace_id
, which is required for subsequent API calls.
Starting a Flow Session
flows_api = flowhunt.FlowsApi(api_client)
from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(flow_id=FLOW_ID)
create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)
The session setup involves creating an OCR session within the FlowHunt workflow:
- FlowSessionCreateFromFlowRequest: Sets up a session with the provided
flow_id
. - create_flow_session: This method starts the session for uploading images and processing OCR.
Uploading Images for OCR Processing
for image in os.listdir("data/images"):
image_name, image_extension = os.path.splitext(image)
with open("data/images/" + image, "rb") as file:
try:
flow_sess_attachment = flows_api.upload_attachments(
create_session_rsp.session_id,
file.read()
)
In this loop:
- os.listdir: Lists all images in
data/images
. - upload_attachments: Uploads each image in the session, preparing it for OCR processing.
Invoking OCR Processing and Polling for Results
invoke_rsp = flows_api.invoke_flow_response(
create_session_rsp.session_id,
flowhunt.FlowSessionInvokeRequest(message="")
)
This line triggers OCR processing, and then a polling loop checks OCR completion:
while True:
get_flow_rsp = flows_api.poll_flow_response(
create_session_rsp.session_id, invoke_rsp.message_id
)
print("Flow response: ", get_flow_rsp)
if get_flow_rsp.response_status == "S":
print("done OCR")
This loop checks the OCR status every 3 seconds until completion. Once marked "S"
(success), it proceeds to download the output.
Downloading and Saving OCR Output
attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])
if attachment_url:
response = requests.get(attachment_url)
with open("data/results/" + image_name + ".csv", "wb") as file:
file.write(response.content)
The output is saved as a CSV:
- extract_attachment_url: Retrieves the download link from the response.
- requests.get: Downloads the CSV, saving it to
data/results/
.
Running the Script and Testing Output
To execute this script:
- Place your PDF in the
data/
folder. - Update
<FLOW_ID_HERE>
and<API_KEY_HERE>
with your FlowHunt credentials. - Run the script to convert the PDF, upload images for OCR, and download the structured CSV results.
Conclusion
This Python script offers an efficient solution for scaling OCR processes, ideal for industries with high document processing demands. With FlowHunt’s API, this solution handles document-to-CSV conversion, streamlining workflows and boosting productivity.
Full Code Overview
Click HERE to get the Gist
import json
import os
import re
import time
import requests
import flowhunt
from flowhunt.rest import ApiException
from pprint import pprint
from pdf2image import convert_from_path
def convert_pdf_to_image(path: str) -> None:
"""
Convert a pdf file to an image
:return:
"""
# Store Pdf with convert_from_path function
images = convert_from_path(path)
for i in range(len(images)):
# Save pages as images in the pdf
images[i].save('data/images/' + 'page'+ str(i) +'.jpg', 'JPEG')
def extract_attachment_url(data_string):
# Define a regular expression pattern to find the JSON object in the string
pattern = r'```flowhunt\n({.*})\n```'
match = re.search(pattern, data_string, re.DOTALL)
if match:
# Extract the JSON object from the matched pattern
json_string = match.group(1)
try:
# Load the JSON data
json_data = json.loads(json_string)
# Return the 'download_link' value from the JSON data
return json_data.get('download_link', None)
except json.JSONDecodeError:
# Handle JSON decoding error if the extracted string is not valid JSON
print("Error: Failed to decode JSON.")
return None
return None
convert_pdf_to_image("data/test.pdf")
FLOW_ID = "<FLOW_ID_HERE>"
# Assuming all images are in the data/images folder
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.
# Configure Bearer authorization: HTTPBearer
configuration = flowhunt.Configuration(host = "https://api.flowhunt.io",
api_key = {"APIKeyHeader": "<API_KEY_HERE>"})
# Enter a context with an instance of the API client
with flowhunt.ApiClient(configuration) as api_client:
# get workspace_id
auth_api = flowhunt.AuthApi(api_client)
api_response = auth_api.get_user()
workspace_id = api_response.api_key_workspace_id
# Create an instance of the API class
flows_api = flowhunt.FlowsApi(api_client)
from_flow_create_session_req = flowhunt.FlowSessionCreateFromFlowRequest(
flow_id=FLOW_ID
)
create_session_rsp = flows_api.create_flow_session(workspace_id, from_flow_create_session_req)
# Looping through the images and attaching the images to flow
for image in os.listdir("data/images"):
image_name, image_extension = os.path.splitext(image)
with open("data/images/" + image, "rb") as file:
try:
flow_sess_attachment = flows_api.upload_attachments(
create_session_rsp.session_id,
file.read()
)
# invoking the flow and getting the response in csv format
invoke_rsp = flows_api.invoke_flow_response(create_session_rsp.session_id, flowhunt.FlowSessionInvokeRequest(
message="",
))
# polling to get the result in loop with 3 seconds interval
while True:
get_flow_rsp = flows_api.poll_flow_response(create_session_rsp.session_id, invoke_rsp.message_id)
print("Flow response: ", get_flow_rsp)
if get_flow_rsp.response_status == "S":
print("done OCR")
### Extracting the url path of the output attachment
attachment_url = extract_attachment_url(get_flow_rsp.final_response[0])
if attachment_url:
print("Attachment URL: ", attachment_url, "\n Downloading the file...")
# Downloading the file using standard libraries
response = requests.get(attachment_url)
# save the csv file in /data/results
with open("data/results/" + image_name + ".csv", "wb") as file:
file.write(response.content)
break
time.sleep(3)
except ApiException as e:
print("error for file ", image)
print(e)
Web Page Title Generator Template
Generate perfect SEO titles effortlessly with FlowHunt's Web Page Title Generator. Just input a keyword and get top-performing titles in seconds!