To help advance machine learning research for table extraction from images in the financial domain, we present SynFinTabs, a dataset of 100,000 synthetic financial tables. The table images come with HTML, JSON, and CSV representations. Due to our method of generating the table images, we know the ground truth structure and contents of each table image, at the time of creation. Therefore, we are able to annotate each word, cell, and row with its corresponding bounding box in the image. Unlike other datasets, our cell bounding boxes accurately represent the full spatially meaningful cell, as opposed to the minimum pixel region in which the cell text is located.
The 100,000 tables of SynFinTabs are split across six themes. The first theme makes up 40% of the dataset and aims to represent tables found in financial statements filed with Companies House. The remaining 60% of the dataset is evenly split across five themes that aim to represent financial tables that may be found in spreadsheets or stylised company reports. The dataset is split into train/validation/test splits, 80%/10%/10%, respectively, with each theme represented proportionally in each split.
To design the Companies House–style theme, we downloaded thousands of financial statements filed with Companies House in March 2023 and extracted tables from them using an off-the-shelf table detection model, Table Transformer. With a random sample of these tables, we made observations regarding the structure and style of the tables. Using this information, we developed a CSS template to mimic these observed properties. The company report–style theme is designed to mimic tables found in stylised company reports as might be distributed to shareholders. We looked at one company's annual report and created a CSS template to style some of our tables to look similar. To design the remaining four themes, we searched online for financial spreadsheet images and based our CSS templates on some of the returned images. Within the themes, many of the table properties vary, including the typeface, font size, bold or regular headers, numbering of table sections, date format, number of columns, use of a note column, etc. The themes create more diversity and variation within the dataset by changing the visual properties of the tables, which will help when training vision models on various information and table extraction tasks.
To create a synthetic financial table, we first generate a table specification which is a blueprint of the table to be created, such as how many sections and the number of columns there will be; what theme the table will have; the typeface and font size to be used; the format to be used for the date headers; and the stylistic properties it will have. With this table specification, we create a table object which contains the rows of the table, each row contains cells, and each cell contains words or a number. For each section title, we randomly select a title from a list of commonly seen section titles in real-world financial tables. Textual cells are populated with a number of random words from a vocabulary of 10,000 English words. Numerical cells are populated with a random number. We then create an HTML document that contains an HTML table element to represent the table object. During the conversion of the table object to HTML, the HTML element for each row, cell, and word is given a unique ID, corresponding to its location in the table. The HTML document is then opened in a headless browser with a window size equivalent to the size of an A4 page. Then, each row, cell, and word element is located using its ID, and the bounding box of each of these elements is retrieved and saved to an annotations file. We also save every word and number that appears in the table, along with the bounding box of the full table. A screenshot of the browser window gives us the final document image. A question-answer pair is generated for each non-empty cell in the table. We build a natural language question using the cell's row header and column header. Along with this pair, we store the headers individually and the start and end positions of the target answer span in the flattened list of table words. One of the table's question-answer pairs is randomly selected as the competition pair which can be used for training, validation, or testing, depending on the dataset split. The generation process is repeated until a dataset of the desired size has been created.
Our level of labelling lends the dataset well to training machine learning models on a range of table extraction tasks. The structure annotations enable the dataset to be used to train models on table structure recognition tasks. The word-level annotations allow for training on natural language processing tasks, such as table visual question-answering. Additionally, SynFinTabs could be used to create synthetic financial documents where the positions of tables within the documents are accurately known. With such a dataset, table detection models could be trained to detect financial tables within financial documents. Extracting features from table images with OCR is a crucial step in any training, testing, and inference, where the words and their positions are not already known. Some OCR solutions do not perform well when the text in an image is presented in tabular format, impacting downstream steps that make use of the OCR output. Our annotations of SynFinTabs tables include all words, and their positions, giving the dataset the potential to be used to train OCR systems on extracting text from tables.
Date made available | 05 Dec 2024 |
---|
Publisher | Hugging Face |
---|
Date of data production | Oct 2023 |
---|