How Do You Extract Calculation Data from a Page Using Python?

5/5 - (1 vote)

Invoices, receipts, physical accounting records, and other documents can be a headache when converting to digital form. There are tools and technologies like OCR that can detect text and numbers from images but what about calculations?

For example, you have an invoice where the total price, the total, and tax calculations are done.

Now, if you scan and convert such a document into digital format, you must manually enter calculation data. This can be a tiring task for the data entry operator and there can be errors as well.

To fully automate the process, you need a smart solution to extract calculation data from the page. Python is a language that can handle such complex procedures.

Python is a third-generation advanced language that can handle complex algorithms quite smartly. And that’s the reason most Artificial Intelligence Algorithms are programmed and developed in Python.

Extracting calculation is such a complex task and here we will discuss how Python can help in that.

Data Calculation Process from a Webpage Using Python

First, we are going to talk about how you can extract data from a webpage using Python:

Analyze the Webpage Structure:

The first step is to analyze the website structure. For this, you have to inspect the HTML source code file to understand how calculation data is organized. If you don’t know how to inspect the file, right-click on the page, and there you will see an option of “Inspect.”

Webpage Structure
Webpage Structure

Then, the source file will open on the right side of the website. Next, what you have to do is identify the specific HTML elements (tags, classes, IDs) that contain the data you need.

Choose Appropriate Libraries:

Python has multiple libraries that can help write better code. In the case of extracting data, there are some useful libraries that you can choose for this, which are:

  • Beautiful Soup: This is an ideal choice for static HTML parsing. It allows easy navigation and extraction of data from HTML structures.
  • Requests: A request used to fetch the raw HTML content from the webpage. It is often used in conjunction with Beautiful Soup.
  • Selenium: This library can handle dynamic webpages that rely heavily on JavaScript. It simulates a real browser environment, executing JavaScript and rendering content.

Fetch the Webpage Content:

This is the main step, where you have to fetch the webpage content by sending the request to the URL. For this, you have to follow some steps to send a request:

  • First, you have to import the requests library.
  • Then, you need to specify the page URL of the webpage (from where you want to extract data).
  • After that, you have to send the GET request by using this code, “response = requests.get (url).”

This is how you will be able to fetch the webpage content. 

Parse the HTML:

Parsing is very important when you want to extract data from a webpage. It lets you open and check what’s inside, keeping only the specific data you’re looking for. Aside from that, if the data contains a lot of information, but you only want a specific one, then parsing the HTML comes in handy.

Here is how you can do that:

Create a BeautifulSoup object: 

soup = BeautifulSoup(html_content, ‘html.parser’)

Use Beautiful Soup’s methods to locate and extract the desired data:

  • find(), find_all(): Find elements by tag, class, ID, or other attributes.
  • text: Extract text content from elements.
  • attrs: Access attributes of elements.

Extract the Calculation Data and Store it:

Target the specific elements containing the calculation results. Extract the text content or attributes using Beautiful Soup’s methods. Optionally, perform any cleaning or formatting as needed.

Save the data to a file (CSV, JSON, etc.). Process it further within your Python code for calculations or analysis.

Code Example – Static Page with Beautiful Soup:

Moving on, here’s an example of how you can use the beautiful shop library in Python for this:

import requestsfrom bs4 import BeautifulSoup
url = ‘https://example.com/calculations’response = requests.get(url)soup = BeautifulSoup(response.content, ‘html.parser’)
# Assuming the calculation result is in a <span> with class “result”result = soup.find(‘span’, class_=’result’).text
print(“Calculation result:”, result)

Extract Calculation Data from a Paper Page

When it comes to extracting calculation data from a paper page, then it is entirely different from a webpage. A paper page can be a physical document present from where you want to extract the data. 

In that case, you can use Tesseract—an open-source OCR engine by Google, or Pytesseract—a Python library. These are tools that help us read text from digital images, such as scanned documents and photos. It turns printed or written text into a format that machines can read.

Tesseract

Tesseract is a free OCR engine that many people use. It’s very accurate and works with lots of different languages. Google made this free OCR. It can read the words in a picture and turn them into a text file you can edit.

To utilize Tesseract for extracting data from paper page, you need the image path. If you have a physical page, then you need to click the image and then save it in your computer storage. Afterward, use this code, and it will help you extract text from it.

from PIL import Imageimport pytesseract
# Load an image from your local file systemimage = Image.open(‘path_to_your_image.png’)
# Use Tesseract to do OCR on the imagetext = pytesseract.image_to_string(image)
# Print the textprint(text)

In this code, replace ‘path_to_your_image.png’ with the actual path to the image file you want to process.

Pytesseract

A Python wrapper for Tesseract, which makes it easier to integrate OCR functionality into Python scripts. This also works similarly to Tesseract but with some slight changes.

Check out the example below to get an idea of how Pytesseract can help you extract data from a paper page.

import pytesseractfrom PIL import Image
# Load the imageimg = Image.open(‘path/to/image.jpg’)
# Extract texttext = pytesseract.image_to_string(img)
# Print the extracted textprint(text)

Online Image-to-Text Converter

For those who don’t have programming experience or prefer a simpler approach, an online OCR-based image-to-text converter can be a convenient alternative. It works to understand and analyze the pattern of text, extract it very precisely, and then provide you with the output.

Not only that, but if you have an image on the webpage from where you want to extract text, you can paste its link or path into the tool, and it will do the rest of the work. 

If you are wondering how you can extract text from image using OCR tools, then it’s very simple. The majority of image-to-text converters work on the procedure. You just have to:

  1. Open any tool you like.
  2. Upload the image of the paper page.
  3. Some tools offer language selection and formatting options.
  4. Initiate the conversion process.
  5. Download the extracted text in the desired output format (TXT or DOC).

Final Words

In simple words, Python helps you get data from both online and real-life places. For websites, use libraries like Beautiful Soup, Requests, and Selenium.

They help make the web easier to work with. Use OCR tools like Tesseract or Pytesseract for paper pages to read the text, or choose online image-to-text converters if you don’t want any coding. Pick the method that best matches your data sources and coding skills.