AI has helped us solve problems which were earlier considered either unsolvable or too computationally expensive. Recruitment or Talent Acquisition is currently being disrupted with the technology of AI. One such ‘hard to crack’ problem in this domain is that of Resume Parsing, which if solved with precision, could considerably save the time of recruiters in executing the repetitive — tedious task of manually screening resumes.
We at Skillate, are building Smart Recruitment technology to help identify, engage and hire the best candidates using AI. Resume Parsing can be considered as the first step towards achieving this goal.
It took us about a year and a half to develop a state-of-the-art (SoTA) Resume Parser which achieves more than 90% accuracy even on the most complex resumes (after testing over thousands of resumes). As you can guess by now, solving this problem with such high accuracy, required us to leverage the power of cutting edge Deep Learning technologies in AI. In this post, we would like to share the knowledge and experience gained while building this Artificially Intelligent piece of software.
Resume Parsing, formally speaking, is the conversion of a free-form CV/resume document into structured information — suitable for storage, reporting, and manipulation by a computer.
The problem of Resume Parsing can be broken into two major subproblems — 1. Text Extraction, and 2. Information Extraction. For building a SoTA resume parser, both these problems need to be solved with the highest possible accuracy. In this post, we will be talking about Text Extraction, while Information Extraction will be discussed in the upcoming articles.
Almost everyone tries to use a unique template to put information on their CV. Even the templates that might seem indistinguishable to the human eye, are processed differently by the computer. This creates the possibility of hundreds of thousands of templates in which resumes are written worldwide. Not all templates are straightforward to read from. For eg. One can find tables, graphics, columns in a resume, and every such entity needs to be read in a different manner. Therefore it is easy to conclude that rule-based parsers do not stand a chance and an intelligent algorithm is required to extract text in a meaningful manner from raw documents (pdf, doc, docx, etc).
We explored several libraries to extract text from pdf, doc, docx, etc type of documents, but none of them could provide the quality of results we were aiming to reach. It became evident that text extraction could not be solved by a single type of algorithm alone.
So we first created an entirely new classification system to segregate the resumes into different types, based on their template, and tackle each type differently. Some of the types were straightforward, but most of them (like the ones that contain tables, partitions, etc) required higher order intelligence from the software. For such complex types, we decided to use Optical Character Recognition (OCR) along with some Deep NLP algorithms on top, to extract text.
For every problem, there is a hard way and a smart way to solve, and we decided to go with the later. OCR is a very generic problem which has been researched upon and solved by the biggest tech companies in the world. The best part is that this technology has been open sourced as well! Therefore, in this context, the hard way would be to build a deep learning model from scratch for OCR and NLP, and the smart way was to use the power of open source and deploy an off the shelf model for the task.
With the help of our classification algorithm to segregate the resumes, we were able to amalgamate different technologies and obtain the best of everything, to build a highly accurate and fast text extraction method. Currently, we are able to extract text accurately from about 98% of simple resumes and 90% of the complex ones.
In the next article, we will be talking about the deep learning technology we built ourselves from scratch, for the Information Extraction task. Stay Tuned!
Link to the second part: https://bit.ly/2ZbjdWT