Introduction
Imagine the case of extracting textual information from various sources for your current project you are working on, what are the different sources where you can find textual data for extraction? from websites of course we can use web scraping tools like Beautiful Soup, from documents like PDF we have libraries in python like PyPDF, PDFminer, etc. But what if you can extract textual data from images? that will be interesting, it is known as OCR(Optical Character Recognition). This feature is used on most modern software like Google Lens, Microsoft Math Calculator, etc. But can we use it for our own purpose? Yes, we can, since we have this powerful tool which is Python. Python along with tesseract-OCR can be used to recognize and extract textual data from images and convert it to strings. Here we are going to implement OCR using python.
What is tesseract-OCR?
Tesseract-OCR is one of the most powerful Optical Character Recognition engines that is being run by Google. It is an open-source free tool that supports all kinds of operating systems including Windows, Linux, macOS, etc. It is built using various strategies like Computer Vision, Machine Learning, Natural Language Processing. Tesseract-OCR supports all types of image files including JPG, PNG, JPEG, BMP, JFIF. which makes it one of the best OCR tools.
Where can we use OCR?
OCR can be useful in many cases like:
- Extracting textual data from images helps to create data for machine learning and data science.
- Converting handwritten words into digital text.
- Retrieving information from passports and documents.
Installing tesseract-OCR and pytesseract
For this purpose, we require a module named "pytesseract" which is a wrapper for tesseract-OCR. You can simply install pytesseract and tesseract-OCR using pip.
$ pip install pytesseract tesseract
If you are using a Windows system, you'll need to install the tesseract.exe file separately. install it from this link.
After successful installation, it's time to get into the coding part.
Here, we have a simple image file:
greetings.png |
Now we are going to insert this image into the code for extracting the text written in the form of string.
import pytesseract# Path were tesseract is installedpytesseract.pytesseract.tesseract_cmd = 'C:/Users/Tesseract-OCR/Tesseract'# Converting image to stringimage_to_text = str(pytesseract.image_to_string(r'C:/users/91759/Desktop/greeting.png'))print(image_to_text)Output:Welcome to PyCodeMates
First, we imported the "pytesseract" module and initialize the path where the tesseract module is installed(Note that the tesseract module installed in your system might be in a different path). Then the name and location of our image are specified so that the program can identify it. Lastly, we got the correct output "Welcome to PyCodeMates". You may wonder about the easiness of implementing your own Optical Character Recognition in at least 5 lines of python code. But if you want to learn more about this here are some references:
You may also like Machine Learning: Handwritten digit recognition with Python.