Objective of this Experiment
The primary objective of this experiment is to develop algorithms to solve the Closed-Domain tabular extraction problem as well as Open-Domain tabular extraction for business documents or annual reports etc. In the case of Closed-Domain, we have analysed and reviewed various algorithms in order to make a model that fuses machine learning and business logic to extract custom tables from the provided datasets. Also, in the case of the Open-Domain tabular extraction problem, we have made use of algorithms to extract the table from the user-provided page number.
Business Use Cases and Applications
There are multiple direct and indirect applications of this experiment. Some of them include-
- Annual Report Analysis: All public listed companies publish their quarter/annual reports. This document contains lots of information about the company’s financial health, such as balance sheet, P&L report and shareholding. These are mostly reported in a tabular structure. We can use closed domain extraction as well as open domain extraction for fetching any of these relevant tabular data as required, thereby saving a significant amount of manual effort.
- Invoice Automation: There are many small-scale and large-scale industries whose invoices are still generated in tabular formats. These do not provide properly secured tax statements. To overcome such hurdles, we can use table extraction to convert all invoices into an editable format and thereby upgrade them to a newer version.
Dataset
Corporate Filings – Annual reports for 30+ companies registering UK and US filings for the past ten years were gathered. The annual reports are corporate documents disseminated to a shareholder that spells out the company’s financial condition and operations over the course of time. They are usually encrypted. The information was supported with a combination of graphics, photos, and an accompanying narrative, all of which chronicle the company’s activities which makes the extraction of relevant information in the form of tables and texts extremely tough. The training data consisted of around 250 annual reports spanning over ten years for both US and UK filings.
Environment Setup
Python; for Documentation, Exploratory Data Analysis and Pre-processing using reticulate package and Python regular expressions. For Training – AWS Deep Learning AMI (instance type: t2.xlarge); for parallel processing and training models. For inference – AWS Deep Learning AMI (instance type: t2.xlarge); loading the trained model & using it for tabular extraction.