{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hybrid Parser step-by-step\n", "\n", "This notebook describes the algorithms behind the hybrid parser, which blends the results of the network parser (text based) and the lattice parser (image based).\n", "\n", "You can modify the section below to point to a pdf or your choice to visualize how the algorithm analyzes it. By default, it points to one of the test .pdfs included with camelot.\n", "\n", "You can also use the `parser-comparison-notebook` notebook to compare the parsers results side-by-side." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Bootstrap and common imports\n", "import os, sys, time\n", "sys.path.insert(0, os.path.abspath('')) # Prefer the local version of camelot if available\n", "import camelot\n", "\n", "print(f\"Using Camelot v{camelot.__version__} from file {camelot.__file__}.\")\n", "\n", "# Select a pdf to analyze.\n", "kwargs = {}\n", "data = None\n", "# pdf_file = \"vertical_header.pdf\" # test_network_vertical_header\n", "# pdf_file, kwargs = \"background_lines_1.pdf\", {} # {\"process_background\": True} # test_lattice_process_background\n", "\n", "# pdf_file, kwargs, data = \"superscript.pdf\", {\"flag_size\": True}, data_stream_flag_size # test_network_flag_size\n", "# pdf_file = \"health.pdf\" # test_network\n", "# pdf_file = \"clockwise_table_2.pdf\"\n", "# pdf_file = \"tabula/12s0324.pdf\" # interesting because contains two separate tables\n", "# pdf_file, kwargs = \"tabula/us-007.pdf\", {\"table_regions\": [\"320,335,573,505\"]} # test_network_table_regions\n", "# pdf_file, kwargs = \"tabula/us-007.pdf\", {\"table_areas\": [\"320,500,573,335\"]} # test_network_table_areas\n", "# pdf_file, kwargs = \"detect_vertical_false.pdf\", {\"strip_text\": \" ,\\n\"} # data_stream_strip_text\n", "# pdf_file, kwargs, data = \"tabula/m27.pdf\", {\"columns\": [\"72,95,209,327,442,529,566,606,683\"], \"split_text\": True, }, data_stream_split_text # data_stream_split_text\n", "# pdf_file = \"clockwise_table_2.pdf\" # test_network_table_rotated / test_stream_table_rotated\n", "pdf_file = \"vertical_header.pdf\"\n", "\n", "# pdf_file = \"twotables_2.pdf\"\n", "# pdf_file = \"camelot-issue-132-multiple-tables.pdf\"\n", "# pdf_file, kwargs, data = \"edge_tol.pdf\", {\"edge_tol\": 500}, data_stream_edge_tol\n", "# pdf_file, kwargs, data = \"edge_tol.pdf\", {}, data_stream_edge_tol\n", "\n", "filename = os.path.join(\n", " os.path.dirname(os.path.abspath('.')),\n", " \"camelot/tests/files\",\n", " pdf_file\n", ")\n", "\n", "# Set up plotting options\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "PLOT_HEIGHT = 12\n", "def init_figure_and_axis(title):\n", " fig = plt.figure(figsize=(PLOT_HEIGHT * 2.5, PLOT_HEIGHT))\n", " ax = fig.add_subplot(111)\n", " ax.set_title(title)\n", " return fig, ax\n", "\n", "# Utility function to display tables\n", "def display_parse_results(tables, parse_time, flavor):\n", " if not tables:\n", " return\n", " tables_dims = \", \".join(\n", " map(\n", " lambda table: \"{rows}x{cols}\".format(\n", " rows=table.shape[0],\n", " cols=table.shape[1],\n", " ), tables\n", " )\n", " )\n", " print(f\"The {flavor} parser found {len(tables)} table(s) ({tables_dims}) in {parse_time:.2f}s\")\n", " for table in tables:\n", " display(table.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overall Algorithm\n", "\n", "The hybrid parser combines results from the network parser and the lattice parser to get the \"best of both worlds.\" Before we look at the combination itself, let's see how each of the two parsers work.\n", "\n", "### Network parser\n", "\n", "The network parser is text-based: it relies on the bounding boxes of the text elements encoded in the .pdf document to identify patterns indicative of a table.\n", "\n", "The plot belows shows the bounding boxes of all the text elements on the parsed document, in light blue for horizontal elements, light red for vertical elements (rare in most documents)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Parse file\n", "flavor = \"network\"\n", "timer_before_parse = time.perf_counter()\n", "tables = camelot.read_pdf(filename, flavor=flavor, debug=True, **kwargs)\n", "timer_after_parse = time.perf_counter()\n", "\n", "if tables:\n", " fig, ax = init_figure_and_axis(f\"Text elements in PDF\\n{pdf_file}\")\n", " camelot.plot(tables[0], kind=\"text\", ax=ax)\n", "else:\n", " print(\"No table found for this document.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Network parser - step 1: Identify a network of connected alignments\n", "\n", "The network parser starts by identifying common horizontal (shown in green on the plot below) or vertical (in blue) coordinate alignments across these text elements. In other words it looks for bounding box rectangles which either share the same top, center, or bottom coordinates (horizontal axis), or the same left, right, or middle coordinates (vertical axis). See the `generate` method.\n", "\n", "Once the parser found these alignments, it performs some pruning to only keep text elements that are part of a network - they have connections along both axis The idea is that it's not enough for two elements to be aligned to belong to a table, for instance the lines of text in this paragraph are all left-aligned, but they do not form a network. The pruning is done iteratively, see `remove_unconnected_edges` method.\n", "\n", "Once the network is pruned, the parser keeps track of how many alignments each text element belongs to: that's the number on top (vertical alignments) or to the left of each alignment in the plot below. The text element with the most connections (in red on the plot) is the starting point -the *seed*- of the next step. Finally, the parser measures how far the alignments are from one another, to determine a plausible search zone around each cell for the next stage of growing the table. See `compute_plausible_gaps` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if tables:\n", " fig, ax = init_figure_and_axis(f\"Text edges in PDF\\n{pdf_file}\")\n", " camelot.plot(tables[0], kind=\"textedge\", ax=ax)\n", "else:\n", " print(f\"No table found for document {pdf_file}.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Network parser - step 2: Detect table body iteratively from seed\n", "\n", "In the next step, the parser iteratively \"grows\" a table, starting from the seed identified in the previous step. The bounding box is initialized with the bounding box of the seed, then it iteratively searches for text elements that are close to the bounding box, then grows the table to ingest them, until there are no more text elements to ingest. The two steps are:\n", "* Search: create a search bounding box by expanding the current table bounding box in all directions, based on the plausible gap numbers determined above. Search bounding boxes are shown in orange on the graph below. \n", "* Grow: if a networked text element is found in this search area, expand the table bounding box so that it includes this new element. Each successive table bounding box is shown in red in the plot below.\n", "\n", "Notice in the plot below how the search area and the table bounding box grow starting from the seed. See method `search_table_body`.\n", "\n", "#### Network parser - step 3: Search for a header section\n", "\n", "Headers are often aligned differently from the rest of the table. To account for this, the network parser searches for text elements that are good candidates for a header section: these text elements are just above the bounding box of the body of the table, and they fit within the rows identified in the table body. See the method `search_header_from_body_bbox`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if tables:\n", " fig, ax = init_figure_and_axis(f\"Growth steps for table in PDF\\n{pdf_file}\")\n", " camelot.plot(tables[0], kind=\"network_table_search\", ax=ax)\n", "else:\n", " print(\"No table found for this document.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Network parser - step 4: Repeat\n", "\n", "There are sometimes multiple tables on one page. So once a first table is identified, all the text edges it contains are removed, and the algorithm is repeated until no new network is identified.\n", "\n", "The final parse for this .pdf is as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lattice parser\n", "\n", "The lattice parser is based on an analyzis of the image from the .pdf, rather than its text content. It relies on the borders of the tables to be solid vertical lines.\n", "\n", "#### Lattice parser - step 1: Identify solid lines within the document.\n", "\n", "The lattice parser relies on the OpenCV library (`getStructuringElement` function) to detect all solid vertical and horizontal lines within the document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Parse file\n", "flavor = \"lattice\"\n", "timer_before_parse = time.perf_counter()\n", "tables = camelot.read_pdf(filename, flavor=flavor, debug=True, **kwargs)\n", "timer_after_parse = time.perf_counter()\n", "\n", "if tables:\n", " fig, ax = init_figure_and_axis(f\"Line structure in PDF\\n{pdf_file}\")\n", " camelot.plot(tables[0], kind=\"line\", ax=ax)\n", "else:\n", " print(\"No table found for this document.\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lattice parser - step 2: Find the contours of the table(s) based on the solid lines.\n", "\n", "The lattice parser then uses OpenCV's `findContours` function to detect the overall bounding box of the table(s), since the solid lines might draw more than one table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for table in tables:\n", " fig, ax = init_figure_and_axis(f\"Contour structure in PDF\\n{pdf_file}\")\n", " camelot.plot(table, kind=\"contour\", ax=ax)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lattice parser - step 3: Identify joints\n", "\n", "For each table bounding box (contour), the lattice parser then makes a list of all the intersections between vertical and horizontal lines: the joints." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for table in tables:\n", " fig, ax = init_figure_and_axis(f\"Joint structure in PDF\\n{pdf_file}\")\n", " camelot.plot(table, kind=\"joint\", ax=ax)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lattice parser - step 4: Identify rows and columns\n", "\n", "In the final step, the algorithm sorts all the x coordinates of the joints to identify the position of the table's columns, and the y coordinates for the table's rows. See method `_generate_columns_and_rows`.\n", "\n", "The resulting lattice parse for the .pdf is as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining results of Network and Lattice with the Hybrid parser\n", "\n", "The hybrid parser aims to combine the strengths of the Network parser (identifying cells based on text alignments) and of the Lattice parser (relying on solid lines to determine tables rows and columns boundaries).\n", "\n", "#### Hybrid parser - step 1: Apply both parsers table bounding box detection techniques to the document\n", "\n", "In this step, hybrid calls both parsers, to get a) the standard table parse, b) the coordinates of the rows and columns boundaries, and c) the table boundaries (or contour).\n", "\n", "#### Hybrid parser - step 2: Merge the results\n", "\n", "If there are areas in the document where both lattice and network found a table, the hybrid parser uses the results from network, but enhances them based on the rows/columns boundaries identified by lattice in the area. Because lattice uses the solid lines detected on the document, the coordinates for b) and c) detected by Lattice are generally more precise. See the `_merge_bbox_analysis` method.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "flavor = \"hybrid\"\n", "timer_before_parse = time.perf_counter()\n", "tables = camelot.read_pdf(filename, flavor=flavor, debug=True, **kwargs)\n", "timer_after_parse = time.perf_counter()\n", "\n", "display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)" ] } ], "metadata": { "language_info": { "name": "python", "codemirror_mode": { "name": "ipython", "version": 3 }, "version": "3.7.7-final" }, "orig_nbformat": 2, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "pygments_lexer": "ipython3", "version": 3, "kernelspec": { "name": "python37764bit8418972e58f441528b05b4b21a1f095d", "display_name": "Python 3.7.7 64-bit" } }, "nbformat": 4, "nbformat_minor": 2 }