Merge pull request #92 from socialcopsdev/refactor

Refactor
2018-09-09 10:06:29 +05:30 · 2018-09-09 10:06:29 +05:30 · 9c71d87c68
parent 9753889ea2 9a6ed555c8
commit 9c71d87c68
391 changed files with 2267 additions and 4010 deletions
--- a/.gitignore
+++ b/.gitignore
@ -6,3 +6,7 @@ build/
 dist/
 *.egg-info/
 .coverage
 .pytest_cache/
 _build/
 _static/
--- a/31
+++ b/31
@ -1,31 +0,0 @@
 PYTHON ?= python
 NOSETESTS ?= nosetests
 help:
 	@echo "Please use \`make <target>' where <target> is one of"
 	@echo "  clean"
 	@echo "  dev            to install in develop mode"
 	@echo "  undev          to uninstall develop mode"
 	@echo "  install        to install for all users"
 	@echo "  test           to run tests"
 	@echo "  test-coverage  to run tests with coverage report"
 clean:
 	$(PYTHON) setup.py clean
 	rm -rf dist
 dev:
 	$(PYTHON) setup.py develop
 undev:
 	$(PYTHON) setup.py develop --uninstall
 install:
 	$(PYTHON) setup.py install
 test:
 	$(NOSETESTS) -s -v
 test-coverage:
 	rm -rf coverage .coverage
 	$(NOSETESTS) -s -v --with-coverage
--- a/README.md
+++ b/README.md
@ -1,67 +1,31 @@
-# camelot
+# Camelot: PDF Table Parsing for Humans
-Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.
+Camelot is a Python 2.7 library and command-line tool for extracting tabular data from PDF files.
 ## Usage
 <pre>
-from camelot.pdf import Pdf
+>>> import camelot
-from camelot.lattice import Lattice
+>>> tables = camelot.read_pdf("foo.pdf")
-
+>>> tables
-manager = Pdf(Lattice(), "/path/to/pdf")
+&lt;TableList n=2&gt;
-tables = manager.extract()
+>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
-</pre>
+>>> tables[0]
-
+&lt;Table shape=(3,4)&gt;
-Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.
+>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
-
+>>> tables[0].parsing_report
-<pre>
+{
-Camelot: PDF parsing made simpler!
+    "accuracy": 96,
-
+    "whitespace": 80,
-usage:
+    "order": 1,
- camelot [options] &lt;method&gt; [&lt;args&gt;...]
+    "page": 1
-
+}
-options:
+>>> df = tables[0].df
 -h, --help                Show this screen.
 -v, --version             Show version.
 -V, --verbose             Verbose.
 -p, --pages &lt;pageno&gt;      Comma-separated list of page numbers.
                           Example: -p 1,3-6,10  [default: 1]
 -P, --parallel            Parallelize the parsing process.
 -f, --format &lt;format&gt;     Output format. (csv,tsv,html,json,xlsx) [default: csv]
 -l, --log                 Log to file.
 -o, --output &lt;directory&gt;  Output directory.
 -M, --cmargin &lt;cmargin&gt;   Char margin. Chars closer than cmargin are
                           grouped together to form a word. [default: 2.0]
 -L, --lmargin &lt;lmargin&gt;   Line margin. Lines closer than lmargin are
                           grouped together to form a textbox. [default: 0.5]
 -W, --wmargin &lt;wmargin&gt;   Word margin. Insert blank spaces between chars
                           if distance between words is greater than word
                           margin. [default: 0.1]
 -J, --split_text          Split text lines if they span across multiple cells.
 -K, --flag_size           Flag substring if its size differs from the whole string.
                           Useful for super and subscripts.
 -X, --print-stats         List stats on the parsing process.
 -Y, --save-stats          Save stats to a file.
 -Z, --plot &lt;dist&gt;         Plot distributions. (page,all,rc)
 camelot methods:
 lattice  Looks for lines between data.
 stream   Looks for spaces between data.
 ocrl     Lattice, but for images.
 ocrs     Stream, but for images.
 See 'camelot &lt;method&gt; -h' for more information on a specific method.
 </pre>
 ## Dependencies
-Currently, camelot works under Python 2.7.
+The dependencies include [tk](https://wiki.tcl.tk/3743) and [ghostscript](https://www.ghostscript.com/).
 The required dependencies include [numpy](http://www.numpy.org/), [OpenCV](http://opencv.org/) and [ImageMagick](http://www.imagemagick.org/script/index.php).
 ### Optional
 You'll need to install [Tesseract](https://github.com/tesseract-ocr/tesseract) if you want to extract tables from image based pdfs. Also, you'll need a tesseract language pack if your pdf isn't in english.
 ## Installation
@ -73,32 +37,32 @@ pip install -U pip setuptools
 ### Installing dependencies
-numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.
+tk and ghostscript can be installed using your system's default package manager.
 #### Linux
 * Arch Linux
 <pre>
 sudo pacman -S opencv imagemagick
 </pre>
 * Ubuntu
 <pre>
-sudo apt-get install libopencv-dev python-opencv imagemagick
+sudo apt-get install python-opencv python-tk ghostscript
 </pre>
 * Arch Linux
 <pre>
 sudo pacman -S opencv tk ghostscript
 </pre>
 #### OS X
 <pre>
-brew install homebrew/science/opencv imagemagick
+brew install homebrew/science/opencv ghostscript
 </pre>
 Finally, `cd` into the project directory and install by
 <pre>
-make install
+python setup.py install
 </pre>
 ## Development
@ -113,12 +77,12 @@ git clone https://github.com/socialcopsdev/camelot.git
 ### Contributing
-See [Contributing doc]().
+See [Contributing guidelines]().
 ### Testing
 <pre>
-make test
+python setup.py test
 </pre>
 ## License
--- a/camelot/init.py
+++ b/camelot/init.py
@ -1,3 +1,4 @@
-__version__ = '1.2.0'
+from .__version__ import __version__
-__all__ = ['pdf', 'lattice', 'stream', 'ocr']
+from .io import read_pdf
 from .plotting import plot_geometry
--- a/camelot/version.py
+++ b/camelot/version.py
@ -0,0 +1 @@
 __version__ = '0.1.0'
--- a/camelot/cell.py
+++ b/camelot/cell.py
@ -1,128 +0,0 @@
 class Cell:
    """Cell.
    Defines a cell object with coordinates relative to a left-bottom
    origin, which is also PDFMiner's coordinate space.
    Parameters
    ----------
    x1 : float
        x-coordinate of left-bottom point.
    y1 : float
        y-coordinate of left-bottom point.
    x2 : float
        x-coordinate of right-top point.
    y2 : float
        y-coordinate of right-top point.
    Attributes
    ----------
    lb : tuple
        Tuple representing left-bottom coordinates.
    lt : tuple
        Tuple representing left-top coordinates.
    rb : tuple
        Tuple representing right-bottom coordinates.
    rt : tuple
        Tuple representing right-top coordinates.
    bbox : tuple
        Tuple representing the cell's bounding box using the
        lower-bottom and right-top coordinates.
    left : bool
        Whether or not cell is bounded on the left.
    right : bool
        Whether or not cell is bounded on the right.
    top : bool
        Whether or not cell is bounded on the top.
    bottom : bool
        Whether or not cell is bounded on the bottom.
    text_objects : list
        List of text objects assigned to cell.
    text : string
        Text assigned to cell.
    spanning_h : bool
        Whether or not cell spans/extends horizontally.
    spanning_v : bool
        Whether or not cell spans/extends vertically.
    """
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2
        self.lb = (x1, y1)
        self.lt = (x1, y2)
        self.rb = (x2, y1)
        self.rt = (x2, y2)
        self.bbox = (x1, y1, x2, y2)
        self.left = False
        self.right = False
        self.top = False
        self.bottom = False
        self.text_objects = []
        self.text = ''
        self.spanning_h = False
        self.spanning_v = False
        self.image = None
    def add_text(self, text):
        """Adds text to cell.
        Parameters
        ----------
        text : string
        """
        self.text = ''.join([self.text, text])
    def get_text(self):
        """Returns text assigned to cell.
        Returns
        -------
        text : string
        """
        return self.text
    def add_object(self, t_object):
        """Adds PDFMiner text object to cell.
        Parameters
        ----------
        t_object : object
        """
        self.text_objects.append(t_object)
    def get_objects(self):
        """Returns list of text objects assigned to cell.
        Returns
        -------
        text_objects : list
        """
        return self.text_objects
    def get_bounded_edges(self):
        """Returns the number of edges by which a cell is bounded.
        Returns
        -------
        bounded_edges : int
        """
        self.bounded_edges = self.top + self.bottom + self.left + self.right
        return self.bounded_edges
--- a/camelot/cli.py
+++ b/camelot/cli.py
@ -0,0 +1 @@
 import click
--- a/camelot/core.py
+++ b/camelot/core.py
@ -0,0 +1,491 @@
 import os
 import json
 import zipfile
 import tempfile
 import numpy as np
 import pandas as pd
 class Cell(object):
    """Defines a cell in a table with coordinates relative to a
    left-bottom origin. (pdf coordinate space)
    Parameters
    ----------
    x1 : float
        x-coordinate of left-bottom point.
    y1 : float
        y-coordinate of left-bottom point.
    x2 : float
        x-coordinate of right-top point.
    y2 : float
        y-coordinate of right-top point.
    Attributes
    ----------
    lb : tuple
        Tuple representing left-bottom coordinates.
    lt : tuple
        Tuple representing left-top coordinates.
    rb : tuple
        Tuple representing right-bottom coordinates.
    rt : tuple
        Tuple representing right-top coordinates.
    left : bool
        Whether or not cell is bounded on the left.
    right : bool
        Whether or not cell is bounded on the right.
    top : bool
        Whether or not cell is bounded on the top.
    bottom : bool
        Whether or not cell is bounded on the bottom.
    hspan : bool
        Whether or not cell spans horizontally.
    vspan : bool
        Whether or not cell spans vertically.
    text : string
        Text assigned to cell.
    bound
    """
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2
        self.lb = (x1, y1)
        self.lt = (x1, y2)
        self.rb = (x2, y1)
        self.rt = (x2, y2)
        self.left = False
        self.right = False
        self.top = False
        self.bottom = False
        self.hspan = False
        self.vspan = False
        self._text = ''
    def __repr__(self):
        return '<Cell x1={} y1={} x2={} y2={}>'.format(
            self.x1, self.y1, self.x2, self.y2)
    @property
    def text(self):
        return self._text
    @text.setter
    def text(self, t):
        self._text = ''.join([self._text, t])
    @property
    def bound(self):
        """The number of sides on which the cell is bounded.
        """
        return self.top + self.bottom + self.left + self.right
 class Table(object):
    """Defines a table with coordinates relative to a left-bottom
    origin. (pdf coordinate space)
    Parameters
    ----------
    cols : list
        List of tuples representing column x-coordinates in increasing
        order.
    rows : list
        List of tuples representing row y-coordinates in decreasing
        order.
    Attributes
    ----------
    df : object
        pandas.DataFrame
    shape : tuple
        Shape of the table.
    accuracy : float
        Accuracy with which text was assigned to the cell.
    whitespace : float
        Percentage of whitespace in the table.
    order : int
        Table number on pdf page.
    page : int
        Pdf page number.
    data
    parsing_report
    """
    def __init__(self, cols, rows):
        self.cols = cols
        self.rows = rows
        self.cells = [[Cell(c[0], r[1], c[1], r[0])
                       for c in cols] for r in rows]
        self.df = None
        self.shape = (0, 0)
        self.accuracy = 0
        self.whitespace = 0
        self.order = None
        self.page = None
    def __repr__(self):
        return '<{} shape={}>'.format(self.__class__.__name__, self.shape)
    @property
    def data(self):
        """Returns two-dimensional list of strings in table.
        """
        d = []
        for row in self.cells:
            d.append([cell.text.strip() for cell in row])
        return d
    @property
    def parsing_report(self):
        """Returns a parsing report with accuracy, %whitespace,
        table number on page and page number.
        """
        # pretty?
        report = {
            'accuracy': self.accuracy,
            'whitespace': self.whitespace,
            'order': self.order,
            'page': self.page
        }
        return report
    def set_all_edges(self):
        """Sets all table edges to True.
        """
        for row in self.cells:
            for cell in row:
                cell.left = cell.right = cell.top = cell.bottom = True
        return self
    def set_edges(self, vertical, horizontal, joint_close_tol=2):
        """Sets a cell's edges to True depending on whether the cell's
        coordinates overlap with the line's coordinates within a
        tolerance.
        Parameters
        ----------
        vertical : list
            List of detected vertical lines.
        horizontal : list
            List of detected horizontal lines.
        """
        for v in vertical:
            # find closest x coord
            # iterate over y coords and find closest start and end points
            i = [i for i, t in enumerate(self.cols)
                 if np.isclose(v[0], t[0], atol=joint_close_tol)]
            j = [j for j, t in enumerate(self.rows)
                 if np.isclose(v[3], t[0], atol=joint_close_tol)]
            k = [k for k, t in enumerate(self.rows)
                 if np.isclose(v[1], t[0], atol=joint_close_tol)]
            if not j:
                continue
            J = j[0]
            if i == [0]:  # only left edge
                L = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][L].left = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][L].left = True
                        J += 1
            elif i == []:  # only right edge
                L = len(self.cols) - 1
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][L].right = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][L].right = True
                        J += 1
            else:  # both left and right edges
                L = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][L].left = True
                        self.cells[J][L - 1].right = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][L].left = True
                        self.cells[J][L - 1].right = True
                        J += 1
        for h in horizontal:
            # find closest y coord
            # iterate over x coords and find closest start and end points
            i = [i for i, t in enumerate(self.rows)
                 if np.isclose(h[1], t[0], atol=joint_close_tol)]
            j = [j for j, t in enumerate(self.cols)
                 if np.isclose(h[0], t[0], atol=joint_close_tol)]
            k = [k for k, t in enumerate(self.cols)
                 if np.isclose(h[2], t[0], atol=joint_close_tol)]
            if not j:
                continue
            J = j[0]
            if i == [0]:  # only top edge
                L = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[L][J].top = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[L][J].top = True
                        J += 1
            elif i == []:  # only bottom edge
                I = len(self.rows) - 1
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[L][J].bottom = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[L][J].bottom = True
                        J += 1
            else:  # both top and bottom edges
                L = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[L][J].top = True
                        self.cells[L - 1][J].bottom = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[L][J].top = True
                        self.cells[L - 1][J].bottom = True
                        J += 1
        return self
    def set_border(self):
        """Sets table border edges to True.
        """
        for r in range(len(self.rows)):
            self.cells[r][0].left = True
            self.cells[r][len(self.cols) - 1].right = True
        for c in range(len(self.cols)):
            self.cells[0][c].top = True
            self.cells[len(self.rows) - 1][c].bottom = True
        return self
    def set_span(self):
        """Sets a cell's hspan or vspan attribute to True depending
        on whether the cell spans horizontally or vertically.
        """
        for row in self.cells:
            for cell in row:
                left = cell.left
                right = cell.right
                top = cell.top
                bottom = cell.bottom
                if cell.bound == 4:
                    continue
                elif cell.bound == 3:
                    if not left and (right and top and bottom):
                        cell.hspan = True
                    elif not right and (left and top and bottom):
                        cell.hspan = True
                    elif not top and (left and right and bottom):
                        cell.vspan = True
                    elif not bottom and (left and right and top):
                        cell.vspan = True
                elif cell.bound == 2:
                    if left and right and (not top and not bottom):
                        cell.vspan = True
                    elif top and bottom and (not left and not right):
                        cell.hspan = True
        return self
    def to_csv(self, path, **kwargs):
        """Write Table to a comma-separated values (csv) file.
        """
        kw = {
            'encoding': 'utf-8',
            'index': False,
            'quoting': 1
        }
        kw.update(kwargs)
        self.df.to_csv(path, **kw)
    def to_json(self, path, **kwargs):
        """Write Table to a JSON file.
        """
        kw = {
            'orient': 'records'
        }
        kw.update(kwargs)
        json_string = self.df.to_json(**kw)
        with open(path, 'w') as f:
            f.write(json_string)
    def to_excel(self, path, **kwargs):
        """Write Table to an Excel file.
        """
        kw = {
            'sheet_name': 'page-{}-table-{}'.format(self.page, self.order),
            'encoding': 'utf-8'
        }
        kw.update(kwargs)
        writer = pd.ExcelWriter(path)
        self.df.to_excel(writer, **kw)
        writer.save()
    def to_html(self, path, **kwargs):
        """Write Table to an HTML file.
        """
        html_string = self.df.to_html(**kwargs)
        with open(path, 'w') as f:
            f.write(html_string)
 class TableList(object):
    """Defines a list of camelot.core.Table objects. Each table can
    be accessed using its index.
    Attributes
    ----------
    n : int
        Number of tables in the list.
    """
    def __init__(self, tables):
        self._tables = tables
    def __repr__(self):
        return '<{} tables={}>'.format(
            self.__class__.__name__, len(self._tables))
    def __len__(self):
        return len(self._tables)
    def __getitem__(self, idx):
        return self._tables[idx]
    @staticmethod
    def _format_func(table, f):
        return getattr(table, 'to_{}'.format(f))
    @property
    def n(self):
        return len(self._tables)
    def _write_file(self, f=None, **kwargs):
        dirname = kwargs.get('dirname')
        root = kwargs.get('root')
        ext = kwargs.get('ext')
        for table in self._tables:
            filename = os.path.join('{}-page-{}-table-{}{}'.format(
                                    root, table.page, table.order, ext))
            filepath = os.path.join(dirname, filename)
            to_format = self._format_func(table, f)
            to_format(filepath)
    def _compress_dir(self, **kwargs):
        path = kwargs.get('path')
        dirname = kwargs.get('dirname')
        root = kwargs.get('root')
        ext = kwargs.get('ext')
        zipname = os.path.join(os.path.dirname(path), root) + '.zip'
        with zipfile.ZipFile(zipname, 'w', allowZip64=True) as z:
            for table in self._tables:
                filename = os.path.join('{}-page-{}-table-{}{}'.format(
                                        root, table.page, table.order, ext))
                filepath = os.path.join(dirname, filename)
                z.write(filepath, os.path.basename(filepath))
    def export(self, path, f='csv', compress=False):
        """Exports the list of tables to specified file format.
        Parameters
        ----------
        path : str
            Filepath
        f : str
            File format. Can be csv, json, excel and html.
        compress : bool
            Whether or not to add files to a ZIP archive.
        """
        dirname = os.path.dirname(path)
        basename = os.path.basename(path)
        root, ext = os.path.splitext(basename)
        if compress:
            dirname = tempfile.mkdtemp()
        kwargs = {
            'path': path,
            'dirname': dirname,
            'root': root,
            'ext': ext
        }
        if f in ['csv', 'json', 'html']:
            self._write_file(f=f, **kwargs)
            if compress:
                self._compress_dir(**kwargs)
        elif f == 'excel':
            filepath = os.path.join(dirname, basename)
            writer = pd.ExcelWriter(filepath)
            for table in self._tables:
                sheet_name = 'page-{}-table-{}'.format(table.page, table.order)
                table.df.to_excel(writer, sheet_name=sheet_name, encoding='utf-8')
            writer.save()
            if compress:
                zipname = os.path.join(os.path.dirname(path), root) + '.zip'
                with zipfile.ZipFile(zipname, 'w', allowZip64=True) as z:
                    z.write(filepath, os.path.basename(filepath))
 class Geometry(object):
    def __init__(self):
        self.text = []
        self.images = ()
        self.segments = ()
        self.tables = []
    def __repr__(self):
        return '<{} text={} images={} segments={} tables={}>'.format(
            self.__class__.__name__,
            len(self.text),
            len(self.images),
            len(self.segments),
            len(self.tables))
 class GeometryList(object):
    def __init__(self, geometry):
        self.text = [g.text for g in geometry]
        self.images = [g.images for g in geometry]
        self.segments = [g.segments for g in geometry]
        self.tables = [g.tables for g in geometry]
    def __repr__(self):
        return '<{} text={} images={} segments={} tables={}>'.format(
            self.__class__.__name__,
            len(self.text),
            len(self.images),
            len(self.segments),
            len(self.tables))
--- a/camelot/handlers.py
+++ b/camelot/handlers.py
@ -0,0 +1,144 @@
 import os
 import tempfile
 from PyPDF2 import PdfFileReader, PdfFileWriter
 from .core import TableList, GeometryList
 from .parsers import Stream, Lattice
 from .utils import get_page_layout, get_text_objects, get_rotation
 class PDFHandler(object):
    """Handles all operations like temp directory creation, splitting
    file into single page pdfs, parsing each pdf and then removing the
    temp directory.
    Parameter
    ---------
    filename : str
        Path to pdf file.
    pages : str
        Comma-separated page numbers to parse.
        Example: 1,3,4 or 1,4-end
    """
    def __init__(self, filename, pages='1'):
        self.filename = filename
        if not self.filename.endswith('.pdf'):
            raise TypeError("File format not supported.")
        self.pages = self._get_pages(self.filename, pages)
        self.tempdir = tempfile.mkdtemp()
    def _get_pages(self, filename, pages):
        """Converts pages string to list of ints.
        Parameters
        ----------
        filename : str
            Path to pdf file.
        pages : str
            Comma-separated page numbers to parse.
            Example: 1,3,4 or 1,4-end
        Returns
        -------
        P : list
            List of int page numbers.
        """
        page_numbers = []
        if pages == '1':
            page_numbers.append({'start': 1, 'end': 1})
        else:
            infile = PdfFileReader(open(filename, 'rb'), strict=False)
            if pages == 'all':
                page_numbers.append({'start': 1, 'end': infile.getNumPages()})
            else:
                for r in pages.split(','):
                    if '-' in r:
                        a, b = r.split('-')
                        if b == 'end':
                            b = infile.getNumPages()
                        page_numbers.append({'start': int(a), 'end': int(b)})
                    else:
                        page_numbers.append({'start': int(r), 'end': int(r)})
        P = []
        for p in page_numbers:
            P.extend(range(p['start'], p['end'] + 1))
        return sorted(set(P))
    def _save_page(self, filename, page, temp):
        """Saves specified page from pdf into a temporary directory.
        Parameters
        ----------
        filename : str
            Path to pdf file.
        page : int
            Page number
        temp : str
            Tmp directory
        """
        with open(filename, 'rb') as fileobj:
            infile = PdfFileReader(fileobj, strict=False)
            fpath = os.path.join(temp, 'page-{0}.pdf'.format(page))
            froot, fext = os.path.splitext(fpath)
            p = infile.getPage(page - 1)
            outfile = PdfFileWriter()
            outfile.addPage(p)
            with open(fpath, 'wb') as f:
                outfile.write(f)
            layout, dim = get_page_layout(fpath)
            # fix rotated pdf
            lttextlh = get_text_objects(layout, ltype="lh")
            lttextlv = get_text_objects(layout, ltype="lv")
            ltchar = get_text_objects(layout, ltype="char")
            rotation = get_rotation(lttextlh, lttextlv, ltchar)
            if rotation != '':
                fpath_new = ''.join([froot.replace('page', 'p'), '_rotated', fext])
                os.rename(fpath, fpath_new)
                infile = PdfFileReader(open(fpath_new, 'rb'), strict=False)
                outfile = PdfFileWriter()
                p = infile.getPage(0)
                if rotation == 'anticlockwise':
                    p.rotateClockwise(90)
                elif rotation == 'clockwise':
                    p.rotateCounterClockwise(90)
                outfile.addPage(p)
                with open(fpath, 'wb') as f:
                    outfile.write(f)
    def parse(self, mesh=False, **kwargs):
        """Extracts tables by calling parser.get_tables on all single
        page pdfs.
        Parameters
        ----------
        mesh : bool (default: False)
            Whether or not to use Lattice method of parsing. Stream
            is used by default.
        kwargs : dict
            See camelot.read_pdf kwargs.
        Returns
        -------
        tables : camelot.core.TableList
            List of tables found in pdf.
        geometry : camelot.core.GeometryList
            List of geometry objects (contours, lines, joints)
            found in pdf.
        """
        for p in self.pages:
            self._save_page(self.filename, p, self.tempdir)
        pages = [os.path.join(self.tempdir, 'page-{0}.pdf'.format(p))
                 for p in self.pages]
        tables = []
        geometry = []
        parser = Stream(**kwargs) if not mesh else Lattice(**kwargs)
        for p in pages:
            t, g = parser.extract_tables(p)
            tables.extend(t)
            geometry.append(g)
        return TableList(tables), GeometryList(geometry)
--- a/camelot/image_processing.py
+++ b/camelot/image_processing.py
@ -1,3 +1,4 @@
 from __future__ import division
 from itertools import groupby
 from operator import itemgetter
@ -7,40 +8,38 @@ import numpy as np
 from .utils import merge_tuples
-def adaptive_threshold(imagename, invert=False, blocksize=15, c=-2):
+def adaptive_threshold(imagename, process_background=False, blocksize=15, c=-2):
    """Thresholds an image using OpenCV's adaptiveThreshold.
    Parameters
    ----------
    imagename : string
        Path to image file.
-
+    process_background : bool, optional (default: False)
-    invert : bool
+        Whether or not to process lines that are in background.
-        Whether or not to invert the image. Useful when pdfs have
+    blocksize : int, optional (default: 15)
        tables with lines in background.
        (optional, default: False)
    blocksize: int
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
-    c: float
+        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
-        Constant subtracted from the mean or weighted mean
+    c : int, optional (default: -2)
-        (see the details below). Normally, it is positive but may be
+        Constant subtracted from the mean or weighted mean.
-        zero or negative as well.
+        Normally, it is positive but may be zero or negative as well.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    Returns
    -------
    img : object
        numpy.ndarray representing the original image.
    threshold : object
        numpy.ndarray representing the thresholded image.
    """
    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
-    if invert:
+    if process_background:
        threshold = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY, blocksize, c)
    else:
@ -49,7 +48,7 @@ def adaptive_threshold(imagename, invert=False, blocksize=15, c=-2):
    return img, threshold
-def find_lines(threshold, direction='horizontal', scale=15, iterations=0):
+def find_lines(threshold, direction='horizontal', line_size_scaling=15, iterations=0):
    """Finds horizontal and vertical lines by applying morphological
    transformations on an image.
@ -57,38 +56,37 @@ def find_lines(threshold, direction='horizontal', scale=15, iterations=0):
    ----------
    threshold : object
        numpy.ndarray representing the thresholded image.
-
+    direction : string, optional (default: 'horizontal')
    direction : string
        Specifies whether to find vertical or horizontal lines.
-        (default: 'horizontal')
+    line_size_scaling : int, optional (default: 15)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
-    scale : int
+        The larger this value, smaller the detected lines. Making it
-        Used to divide the height/width to get a structuring element
+        too large will lead to text being detected as lines.
-        for morph transform.
+    iterations : int, optional (default: 0)
-        (optional, default: 15)
+        Number of times for erosion/dilation is applied.
-    iterations : int
+        For more information, refer `OpenCV's dilate <https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#dilate>`_.
        Number of iterations for dilation.
        (optional, default: 2)
    Returns
    -------
    dmask : object
        numpy.ndarray representing pixels where vertical/horizontal
        lines lie.
    lines : list
        List of tuples representing vertical/horizontal lines with
        coordinates relative to a left-top origin in
-        OpenCV's coordinate space.
+        image coordinate space.
    """
    lines = []
    if direction == 'vertical':
-        size = threshold.shape[0] // scale
+        size = threshold.shape[0] // line_size_scaling
        el = cv2.getStructuringElement(cv2.MORPH_RECT, (1, size))
    elif direction == 'horizontal':
-        size = threshold.shape[1] // scale
+        size = threshold.shape[1] // line_size_scaling
        el = cv2.getStructuringElement(cv2.MORPH_RECT, (size, 1))
    elif direction is None:
        raise ValueError("Specify direction as either 'vertical' or"
@ -110,9 +108,9 @@ def find_lines(threshold, direction='horizontal', scale=15, iterations=0):
        x1, x2 = x, x + w
        y1, y2 = y, y + h
        if direction == 'vertical':
-            lines.append(((x1 + x2) / 2, y2, (x1 + x2) / 2, y1))
+            lines.append(((x1 + x2) // 2, y2, (x1 + x2) // 2, y1))
        elif direction == 'horizontal':
-            lines.append((x1, (y1 + y2) / 2, x2, (y1 + y2) / 2))
+            lines.append((x1, (y1 + y2) // 2, x2, (y1 + y2) // 2))
    return dmask, lines
@ -124,7 +122,6 @@ def find_table_contours(vertical, horizontal):
    ----------
    vertical : object
        numpy.ndarray representing pixels where vertical lines lie.
    horizontal : object
        numpy.ndarray representing pixels where horizontal lines lie.
@ -133,7 +130,8 @@ def find_table_contours(vertical, horizontal):
    cont : list
        List of tuples representing table boundaries. Each tuple is of
        the form (x, y, w, h) where (x, y) -> left-top, w -> width and
-        h -> height in OpenCV's coordinate space.
+        h -> height in image coordinate space.
    """
    mask = vertical + horizontal
@ -161,11 +159,9 @@ def find_table_joints(contours, vertical, horizontal):
    contours : list
        List of tuples representing table boundaries. Each tuple is of
        the form (x, y, w, h) where (x, y) -> left-top, w -> width and
-        h -> height in OpenCV's coordinate space.
+        h -> height in image coordinate space.
    vertical : object
        numpy.ndarray representing pixels where vertical lines lie.
    horizontal : object
        numpy.ndarray representing pixels where horizontal lines lie.
@ -174,9 +170,9 @@ def find_table_joints(contours, vertical, horizontal):
    tables : dict
        Dict with table boundaries as keys and list of intersections
        in that boundary as their value.
        Keys are of the form (x1, y1, x2, y2) where (x1, y1) -> lb
-        and (x2, y2) -> rt in OpenCV's coordinate space.
+        and (x2, y2) -> rt in image coordinate space.
    """
    joints = np.bitwise_and(vertical, horizontal)
    tables = {}
@ -194,32 +190,35 @@ def find_table_joints(contours, vertical, horizontal):
        joint_coords = []
        for j in jc:
            jx, jy, jw, jh = cv2.boundingRect(j)
-            c1, c2 = x + (2 * jx + jw) / 2, y + (2 * jy + jh) / 2
+            c1, c2 = x + (2 * jx + jw) // 2, y + (2 * jy + jh) // 2
            joint_coords.append((c1, c2))
        tables[(x, y + h, x + w, y)] = joint_coords
    return tables
-def remove_lines(threshold, line_scale=15):
+def remove_lines(threshold, line_size_scaling=15):
    """Removes lines from a thresholded image.
    Parameters
    ----------
    threshold : object
        numpy.ndarray representing the thresholded image.
    line_size_scaling : int, optional (default: 15)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
-    line_scale : int
+        The larger this value, smaller the detected lines. Making it
-        Line scaling factor.
+        too large will lead to text being detected as lines.
        (optional, default: 15)
    Returns
    -------
    threshold : object
        numpy.ndarray representing the thresholded image
        with horizontal and vertical lines removed.
    """
-    size = threshold.shape[0] // line_scale
+    size = threshold.shape[0] // line_size_scaling
    vertical_erode_el = cv2.getStructuringElement(cv2.MORPH_RECT, (1, size))
    horizontal_erode_el = cv2.getStructuringElement(cv2.MORPH_RECT, (size, 1))
    dilate_el = cv2.getStructuringElement(cv2.MORPH_RECT, (10, 10))
@ -235,24 +234,26 @@ def remove_lines(threshold, line_scale=15):
    return threshold
-def find_cuts(threshold, char_scale=200):
+def find_cuts(threshold, char_size_scaling=200):
    """Finds cuts made by text projections on y-axis.
    Parameters
    ----------
    threshold : object
        numpy.ndarray representing the thresholded image.
    line_size_scaling : int, optional (default: 200)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
-    char_scale : int
+        The larger this value, smaller the detected lines. Making it
-        Char scaling factor.
+        too large will lead to text being detected as lines.
        (optional, default: 200)
    Returns
    -------
    y_cuts : list
        List of cuts on y-axis.
    """
-    size = threshold.shape[0] // char_scale
+    size = threshold.shape[0] // char_size_scaling
    char_el = cv2.getStructuringElement(cv2.MORPH_RECT, (1, size))
    threshold = cv2.erode(threshold, char_el)
@ -268,5 +269,5 @@ def find_cuts(threshold, char_scale=200):
    contours = [cv2.boundingRect(c) for c in contours]
    y_cuts = [(c[1], c[1] + c[3]) for c in contours]
    y_cuts = list(merge_tuples(sorted(y_cuts)))
-    y_cuts = [(y_cuts[i][0] + y_cuts[i - 1][1]) / 2 for i in range(1, len(y_cuts))]
+    y_cuts = [(y_cuts[i][0] + y_cuts[i - 1][1]) // 2 for i in range(1, len(y_cuts))]
    return sorted(y_cuts, reverse=True)
--- a/camelot/io.py
+++ b/camelot/io.py
@ -0,0 +1,94 @@
 from .handlers import PDFHandler
 def read_pdf(filepath, pages='1', mesh=False, **kwargs):
    """Read PDF and return parsed data tables.
    Note: kwargs annotated with ^ can only be used with mesh=False
    and kwargs annotated with * can only be used with mesh=True.
    Parameters
    ----------
    filepath : str
        Path to pdf file.
    pages : str
        Comma-separated page numbers to parse.
        Example: 1,3,4 or 1,4-end
    mesh : bool (default: False)
        Whether or not to use Lattice method of parsing. Stream
        is used by default.
    table_area : list, optional (default: None)
        List of table areas to analyze as strings of the form
        x1,y1,x2,y2 where (x1, y1) -> left-top and
        (x2, y2) -> right-bottom in pdf coordinate space.
    columns^ : list, optional (default: None)
        List of column x-coordinates as strings where the coordinates
        are comma-separated.
    split_text : bool, optional (default: False)
        Whether or not to split a text line if it spans across
        multiple cells.
    flag_size : bool, optional (default: False)
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
    row_close_tol^ : int, optional (default: 2)
        Rows will be formed by combining text vertically
        within this tolerance.
    col_close_tol^ : int, optional (default: 0)
        Columns will be formed by combining text horizontally
        within this tolerance.
    process_background* : bool, optional (default: False)
        Whether or not to process lines that are in background.
    line_size_scaling* : int, optional (default: 15)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
        The larger this value, smaller the detected lines. Making it
        too large will lead to text being detected as lines.
    copy_text* : list, optional (default: None)
        {'h', 'v'}
        Select one or more strings from above and pass them as a list
        to specify the direction in which text should be copied over
        when a cell spans multiple rows or columns.
    shift_text* : list, optional (default: ['l', 't'])
        {'l', 'r', 't', 'b'}
        Select one or more strings from above and pass them as a list
        to specify where the text in a spanning cell should flow.
    line_close_tol* : int, optional (default: 2)
        Tolerance parameter used to merge vertical and horizontal
        detected lines which lie close to each other.
    joint_close_tol* : int, optional (default: 2)
        Tolerance parameter used to decide whether the detected lines
        and points lie close to each other.
    threshold_blocksize : int, optional (default: 15)
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    threshold_constant : int, optional (default: -2)
        Constant subtracted from the mean or weighted mean.
        Normally, it is positive but may be zero or negative as well.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    iterations : int, optional (default: 0)
        Number of times for erosion/dilation is applied.
        For more information, refer `OpenCV's dilate <https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#dilate>`_.
    margins : tuple
        PDFMiner margins. (char_margin, line_margin, word_margin)
        For for information, refer `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
    debug : bool, optional (default: False)
        Whether or not to return all text objects on the page
        which can be used to generate a matplotlib plot, to get
        values for table_area(s) and debugging.
    Returns
    -------
    tables : camelot.core.TableList
    """
    # validate kwargs?
    p = PDFHandler(filepath, pages)
    tables, __ = p.parse(mesh=mesh, **kwargs)
    return tables
--- a/camelot/lattice.py
+++ b/camelot/lattice.py
@ -1,382 +0,0 @@
 from __future__ import division
 import os
 import sys
 import copy
 import types
 import logging
 import copy_reg
 import warnings
 import subprocess
 from .imgproc import (adaptive_threshold, find_lines, find_table_contours,
                      find_table_joints)
 from .table import Table
 from .utils import (scale_to_pdf, scale_to_image, segments_bbox, text_in_bbox,
                    merge_close_values, get_table_index, get_score, count_empty,
                    encode_list, get_text_objects, get_page_layout)
 __all__ = ['Lattice']
 logger = logging.getLogger('app_logger')
 def _reduce_method(m):
    if m.im_self is None:
        return getattr, (m.im_class, m.im_func.func_name)
    else:
        return getattr, (m.im_self, m.im_func.func_name)
 copy_reg.pickle(types.MethodType, _reduce_method)
 def _reduce_index(t, idx, shift_text):
    """Reduces index of a text object if it lies within a spanning
    cell.
    Parameters
    ----------
    table : object
        camelot.table.Table
    idx : list
        List of tuples of the form (r_idx, c_idx, text).
    shift_text : list
        {'l', 'r', 't', 'b'}
        Select one or more from above and pass them as a list to
        specify where the text in a spanning cell should flow.
    Returns
    -------
    indices : list
        List of tuples of the form (idx, text) where idx is the reduced
        index of row/column and text is the an lttextline substring.
    """
    indices = []
    for r_idx, c_idx, text in idx:
        for d in shift_text:
            if d == 'l':
                if t.cells[r_idx][c_idx].spanning_h:
                    while not t.cells[r_idx][c_idx].left:
                        c_idx -= 1
            if d == 'r':
                if t.cells[r_idx][c_idx].spanning_h:
                    while not t.cells[r_idx][c_idx].right:
                        c_idx += 1
            if d == 't':
                if t.cells[r_idx][c_idx].spanning_v:
                    while not t.cells[r_idx][c_idx].top:
                        r_idx -= 1
            if d == 'b':
                if t.cells[r_idx][c_idx].spanning_v:
                    while not t.cells[r_idx][c_idx].bottom:
                        r_idx += 1
        indices.append((r_idx, c_idx, text))
    return indices
 def _fill_spanning(t, fill=None):
    """Fills spanning cells.
    Parameters
    ----------
    t : object
        camelot.table.Table
    fill : list
        {'h', 'v'}
        Specify to fill spanning cells in horizontal or vertical
        direction.
        (optional, default: None)
    Returns
    -------
    t : object
        camelot.table.Table
    """
    for f in fill:
        if f == "h":
            for i in range(len(t.cells)):
                for j in range(len(t.cells[i])):
                    if t.cells[i][j].get_text().strip() == '':
                        if t.cells[i][j].spanning_h and not t.cells[i][j].left:
                            t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
        elif f == "v":
            for i in range(len(t.cells)):
                for j in range(len(t.cells[i])):
                    if t.cells[i][j].get_text().strip() == '':
                        if t.cells[i][j].spanning_v and not t.cells[i][j].top:
                            t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
    return t
 class Lattice:
    """Lattice looks for lines in the pdf to form a table.
    If you want to give fill and mtol for each table when specifying
    multiple table areas, make sure that the length of fill and mtol
    is equal to the length of table_area. Mapping between them is based
    on index.
    Parameters
    ----------
    table_area : list
        List of strings of the form x1,y1,x2,y2 where
        (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
        coordinate space, denoting table areas to analyze.
        (optional, default: None)
    fill : list
        List of strings specifying directions to fill spanning cells.
        {'h', 'v'} to fill spanning cells in horizontal or vertical
        direction.
        (optional, default: None)
    mtol : list
        List of ints specifying m-tolerance parameters.
        (optional, default: [2])
    jtol : list
        List of ints specifying j-tolerance parameters.
        (optional, default: [2])
    blocksize : int
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        (optional, default: 15)
    threshold_constant : float
        Constant subtracted from the mean or weighted mean
        (see the details below). Normally, it is positive but may be
        zero or negative as well.
        (optional, default: -2)
    scale : int
        Used to divide the height/width of a pdf to get a structuring
        element for image processing.
        (optional, default: 15)
    iterations : int
        Number of iterations for dilation.
        (optional, default: 0)
    invert : bool
        Whether or not to invert the image. Useful when pdfs have
        tables with lines in background.
        (optional, default: False)
    margins : tuple
        PDFMiner margins. (char_margin, line_margin, word_margin)
        (optional, default: (1.0, 0.5, 0.1))
    split_text : bool
        Whether or not to split a text line if it spans across
        different cells.
        (optional, default: False)
    flag_size : bool
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
        (optional, default: True)
    shift_text : list
        {'l', 'r', 't', 'b'}
        Select one or more from above and pass them as a list to
        specify where the text in a spanning cell should flow.
        (optional, default: ['l', 't'])
    debug : string
        {'contour', 'line', 'joint', 'table'}
        Set to one of the above values to generate a matplotlib plot
        of detected contours, lines, joints and the table generated.
        (optional, default: None)
    """
    def __init__(self, table_area=None, fill=None, mtol=[2], jtol=[2],
                 blocksize=15, threshold_constant=-2, scale=15, iterations=0,
                 invert=False, margins=(1.0, 0.5, 0.1), split_text=False,
                 flag_size=True, shift_text=['l', 't'], debug=None):
        self.method = 'lattice'
        self.table_area = table_area
        self.fill = fill
        self.mtol = mtol
        self.jtol = jtol
        self.blocksize = blocksize
        self.threshold_constant = threshold_constant
        self.scale = scale
        self.iterations = iterations
        self.invert = invert
        self.char_margin, self.line_margin, self.word_margin = margins
        self.split_text = split_text
        self.flag_size = flag_size
        self.shift_text = shift_text
        self.debug = debug
    def get_tables(self, pdfname):
        """Expects a single page pdf as input with rotation corrected.
        Parameters
        ----------
        pdfname : string
            Path to single page pdf file.
        Returns
        -------
        page : dict
        """
        layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
            line_margin=self.line_margin, word_margin=self.word_margin)
        lttextlh = get_text_objects(layout, ltype="lh")
        lttextlv = get_text_objects(layout, ltype="lv")
        ltchar = get_text_objects(layout, ltype="char")
        width, height = dim
        bname, __ = os.path.splitext(pdfname)
        logger.info('Processing {0}.'.format(os.path.basename(bname)))
        if not ltchar:
            warnings.warn("{0}: Page contains no text.".format(
                os.path.basename(bname)))
            return {os.path.basename(bname): None}
        imagename = ''.join([bname, '.png'])
        gs_call = [
            "-q", "-sDEVICE=png16m", "-o", imagename, "-r600", pdfname
        ]
        if "ghostscript" in subprocess.check_output(["gs", "-version"]).lower():
            gs_call.insert(0, "gs")
        else:
            gs_call.insert(0, "gsc")
        subprocess.call(gs_call, stdout=open(os.devnull, 'w'),
            stderr=subprocess.STDOUT)
        img, threshold = adaptive_threshold(imagename, invert=self.invert,
            blocksize=self.blocksize, c=self.threshold_constant)
        pdf_x = width
        pdf_y = height
        img_x = img.shape[1]
        img_y = img.shape[0]
        sc_x_image = img_x / float(pdf_x)
        sc_y_image = img_y / float(pdf_y)
        sc_x_pdf = pdf_x / float(img_x)
        sc_y_pdf = pdf_y / float(img_y)
        factors_image = (sc_x_image, sc_y_image, pdf_y)
        factors_pdf = (sc_x_pdf, sc_y_pdf, img_y)
        vmask, v_segments = find_lines(threshold, direction='vertical',
            scale=self.scale, iterations=self.iterations)
        hmask, h_segments = find_lines(threshold, direction='horizontal',
            scale=self.scale, iterations=self.iterations)
        if self.table_area is not None:
            areas = []
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = float(x1)
                y1 = float(y1)
                x2 = float(x2)
                y2 = float(y2)
                x1, y1, x2, y2 = scale_to_image((x1, y1, x2, y2), factors_image)
                areas.append((x1, y1, abs(x2 - x1), abs(y2 - y1)))
            table_bbox = find_table_joints(areas, vmask, hmask)
        else:
            contours = find_table_contours(vmask, hmask)
            table_bbox = find_table_joints(contours, vmask, hmask)
        if len(self.mtol) == 1 and self.mtol[0] == 2:
            mtolerance = copy.deepcopy(self.mtol) * len(table_bbox)
        else:
            mtolerance = copy.deepcopy(self.mtol)
        if len(self.jtol) == 1 and self.jtol[0] == 2:
            jtolerance = copy.deepcopy(self.jtol) * len(table_bbox)
        else:
            jtolerance = copy.deepcopy(self.jtol)
        if self.debug:
            self.debug_images = (img, table_bbox)
        table_bbox, v_segments, h_segments = scale_to_pdf(table_bbox, v_segments,
            h_segments, factors_pdf)
        if self.debug:
            self.debug_segments = (v_segments, h_segments)
            self.debug_tables = []
        page = {}
        tables = {}
        # sort tables based on y-coord
        for table_no, k in enumerate(sorted(table_bbox.keys(), key=lambda x: x[1], reverse=True)):
            # select elements which lie within table_bbox
            table_data = {}
            t_bbox = {}
            v_s, h_s = segments_bbox(k, v_segments, h_segments)
            t_bbox['horizontal'] = text_in_bbox(k, lttextlh)
            t_bbox['vertical'] = text_in_bbox(k, lttextlv)
            char_bbox = text_in_bbox(k, ltchar)
            table_data['text_p'] = 100 * (1 - (len(char_bbox) / len(ltchar)))
            for direction in t_bbox:
                t_bbox[direction].sort(key=lambda x: (-x.y0, x.x0))
            cols, rows = zip(*table_bbox[k])
            cols, rows = list(cols), list(rows)
            cols.extend([k[0], k[2]])
            rows.extend([k[1], k[3]])
            # sort horizontal and vertical segments
            cols = merge_close_values(sorted(cols), mtol=mtolerance[table_no])
            rows = merge_close_values(
                sorted(rows, reverse=True), mtol=mtolerance[table_no])
            # make grid using x and y coord of shortlisted rows and cols
            cols = [(cols[i], cols[i + 1])
                    for i in range(0, len(cols) - 1)]
            rows = [(rows[i], rows[i + 1])
                    for i in range(0, len(rows) - 1)]
            table = Table(cols, rows)
            # set table edges to True using ver+hor lines
            table = table.set_edges(v_s, h_s, jtol=jtolerance[table_no])
            nouse = table.nocont_ / (len(v_s) + len(h_s))
            table_data['line_p'] = 100 * (1 - nouse)
            # set spanning cells to True
            table = table.set_spanning()
            # set table border edges to True
            table = table.set_border_edges()
            if self.debug:
                self.debug_tables.append(table)
            assignment_errors = []
            table_data['split_text'] = []
            table_data['superscript'] = []
            for direction in ['vertical', 'horizontal']:
                for t in t_bbox[direction]:
                    indices, error = get_table_index(
                        table, t, direction, split_text=self.split_text,
                        flag_size=self.flag_size)
                    if indices[:2] != (-1, -1):
                        assignment_errors.append(error)
                        indices = _reduce_index(table, indices, shift_text=self.shift_text)
                        if len(indices) > 1:
                            table_data['split_text'].append(indices)
                        for r_idx, c_idx, text in indices:
                            if all(s in text for s in ['<s>', '</s>']):
                                table_data['superscript'].append((r_idx, c_idx, text))
                            table.cells[r_idx][c_idx].add_text(text)
            score = get_score([[100, assignment_errors]])
            table_data['score'] = score
            if self.fill is not None:
                table = _fill_spanning(table, fill=self.fill)
            ar = table.get_list()
            ar = encode_list(ar)
            table_data['data'] = ar
            empty_p, r_nempty_cells, c_nempty_cells = count_empty(ar)
            table_data['empty_p'] = empty_p
            table_data['r_nempty_cells'] = r_nempty_cells
            table_data['c_nempty_cells'] = c_nempty_cells
            table_data['nrows'] = len(ar)
            table_data['ncols'] = len(ar[0])
            tables['table-{0}'.format(table_no + 1)] = table_data
        page[os.path.basename(bname)] = tables
        if self.debug:
            return None
        return page
--- a/camelot/ocr.py
+++ b/camelot/ocr.py
@ -1,331 +0,0 @@
 import os
 import copy
 import logging
 import subprocess
 import pyocr
 from PIL import Image
 from .table import Table
 from .imgproc import (adaptive_threshold, find_lines, find_table_contours,
                      find_table_joints, remove_lines, find_cuts)
 from .utils import merge_close_values, encode_list
 __all__ = ['OCRLattice', 'OCRStream']
 logger = logging.getLogger('app_logger')
 class OCRLattice:
    """Lattice, but for images.
    Parameters
    ----------
    table_area : list
        List of strings of the form x1,y1,x2,y2 where
        (x1, y1) -> left-top and (x2, y2) -> right-bottom in OpenCV's
        coordinate space, denoting table areas to analyze.
        (optional, default: None)
    mtol : list
        List of ints specifying m-tolerance parameters.
        (optional, default: [2])
    blocksize : int
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        (optional, default: 15)
    threshold_constant : float
        Constant subtracted from the mean or weighted mean
        (see the details below). Normally, it is positive but may be
        zero or negative as well.
        (optional, default: -2)
    dpi : int
        Dots per inch.
        (optional, default: 300)
    layout : int
        Tesseract page segmentation mode.
        (optional, default: 7)
    lang : string
        Language to be used for OCR.
        (optional, default: 'eng')
    scale : int
        Used to divide the height/width of a pdf to get a structuring
        element for image processing.
        (optional, default: 15)
    iterations : int
        Number of iterations for dilation.
        (optional, default: 0)
    debug : string
        {'contour', 'line', 'joint', 'table'}
        Set to one of the above values to generate a matplotlib plot
        of detected contours, lines, joints and the table generated.
        (optional, default: None)
    """
    def __init__(self, table_area=None, mtol=[2], blocksize=15, threshold_constant=-2,
                 dpi=300, layout=7, lang="eng", scale=15, iterations=0, debug=None):
        self.method = 'ocrl'
        self.table_area = table_area
        self.mtol = mtol
        self.blocksize = blocksize
        self.threshold_constant = threshold_constant
        self.tool = pyocr.get_available_tools()[0] # fix this
        self.dpi = dpi
        self.layout = layout
        self.lang = lang
        self.scale = scale
        self.iterations = iterations
        self.debug = debug
    def get_tables(self, pdfname):
        if self.tool is None:
            return None
        bname, __ = os.path.splitext(pdfname)
        imagename = ''.join([bname, '.png'])
        logger.info('Processing {0}.'.format(os.path.basename(bname)))
        gs_call = [
            "-q", "-sDEVICE=png16m", "-o", imagename, "-r{0}".format(self.dpi),
            pdfname
        ]
        if "ghostscript" in subprocess.check_output(["gs", "-version"]).lower():
            gs_call.insert(0, "gs")
        else:
            gs_call.insert(0, "gsc")
        subprocess.call(gs_call, stdout=open(os.devnull, 'w'),
            stderr=subprocess.STDOUT)
        img, threshold = adaptive_threshold(imagename, blocksize=self.blocksize,
            c=self.threshold_constant)
        vmask, v_segments = find_lines(threshold, direction='vertical',
            scale=self.scale, iterations=self.iterations)
        hmask, h_segments = find_lines(threshold, direction='horizontal',
            scale=self.scale, iterations=self.iterations)
        if self.table_area is not None:
            areas = []
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = int(float(x1))
                y1 = int(float(y1))
                x2 = int(float(x2))
                y2 = int(float(y2))
                areas.append((x1, y1, abs(x2 - x1), abs(y2 - y1)))
            table_bbox = find_table_joints(areas, vmask, hmask)
        else:
            contours = find_table_contours(vmask, hmask)
            table_bbox = find_table_joints(contours, vmask, hmask)
        if self.debug:
            self.debug_images = (img, table_bbox)
            self.debug_segments = (v_segments, h_segments)
            self.debug_tables = []
        if len(self.mtol) == 1 and self.mtol[0] == 2:
            mtolerance = copy.deepcopy(self.mtol) * len(table_bbox)
        else:
            mtolerance = copy.deepcopy(self.mtol)
        page = {}
        tables = {}
        table_no = 0
        for k in sorted(table_bbox.keys(), key=lambda x: x[1]):
            table_data = {}
            cols, rows = zip(*table_bbox[k])
            cols, rows = list(cols), list(rows)
            cols.extend([k[0], k[2]])
            rows.extend([k[1], k[3]])
            cols = merge_close_values(sorted(cols), mtol=mtolerance[table_no])
            rows = merge_close_values(sorted(rows, reverse=True), mtol=mtolerance[table_no])
            cols = [(cols[i], cols[i + 1])
                    for i in range(0, len(cols) - 1)]
            rows = [(rows[i], rows[i + 1])
                    for i in range(0, len(rows) - 1)]
            table = Table(cols, rows)
            if self.debug:
                self.debug_tables.append(table)
            table.image = img[k[3]:k[1],k[0]:k[2]]
            for i in range(len(table.cells)):
                for j in range(len(table.cells[i])):
                    x1 = int(table.cells[i][j].x1)
                    y1 = int(table.cells[i][j].y1)
                    x2 = int(table.cells[i][j].x2)
                    y2 = int(table.cells[i][j].y2)
                    table.cells[i][j].image = img[y1:y2,x1:x2]
                    text = self.tool.image_to_string(
                        Image.fromarray(table.cells[i][j].image),
                        lang=self.lang,
                        builder=pyocr.builders.TextBuilder(tesseract_layout=self.layout)
                    )
                    table.cells[i][j].add_text(text)
            ar = table.get_list()
            ar.reverse()
            ar = encode_list(ar)
            table_data['data'] = ar
            tables['table-{0}'.format(table_no + 1)] = table_data
            table_no += 1
        page[os.path.basename(bname)] = tables
        if self.debug:
            return None
        return page
 class OCRStream:
    """Stream, but for images.
    Parameters
    ----------
    table_area : list
        List of strings of the form x1,y1,x2,y2 where
        (x1, y1) -> left-top and (x2, y2) -> right-bottom in OpenCV's
        coordinate space, denoting table areas to analyze.
        (optional, default: None)
    columns : list
        List of strings where each string is comma-separated values of
        x-coordinates in OpenCV's coordinate space.
        (optional, default: None)
    blocksize : int
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        (optional, default: 15)
    threshold_constant : float
        Constant subtracted from the mean or weighted mean
        (see the details below). Normally, it is positive but may be
        zero or negative as well.
        (optional, default: -2)
    dpi : int
        Dots per inch.
        (optional, default: 300)
    layout : int
        Tesseract page segmentation mode.
        (optional, default: 7)
    lang : string
        Language to be used for OCR.
        (optional, default: 'eng')
    line_scale : int
        Line scaling factor.
        (optional, default: 15)
    char_scale : int
        Char scaling factor.
        (optional, default: 200)
    """
    def __init__(self, table_area=None, columns=None, blocksize=15,
                 threshold_constant=-2, dpi=300, layout=7, lang="eng",
                 line_scale=15, char_scale=200, debug=False):
        self.method = 'ocrs'
        self.table_area = table_area
        self.columns = columns
        self.blocksize = blocksize
        self.threshold_constant = threshold_constant
        self.tool = pyocr.get_available_tools()[0] # fix this
        self.dpi = dpi
        self.layout = layout
        self.lang = lang
        self.line_scale = line_scale
        self.char_scale = char_scale
        self.debug = debug
    def get_tables(self, pdfname):
        if self.tool is None:
            return None
        bname, __ = os.path.splitext(pdfname)
        imagename = ''.join([bname, '.png'])
        logger.info('Processing {0}.'.format(os.path.basename(bname)))
        gs_call = [
            "-q", "-sDEVICE=png16m", "-o", imagename, "-r{0}".format(self.dpi),
            pdfname
        ]
        if "ghostscript" in subprocess.check_output(["gs", "-version"]).lower():
            gs_call.insert(0, "gs")
        else:
            gs_call.insert(0, "gsc")
        subprocess.call(gs_call, stdout=open(os.devnull, 'w'),
            stderr=subprocess.STDOUT)
        img, threshold = adaptive_threshold(imagename, blocksize=self.blocksize,
            c=self.threshold_constant)
        threshold = remove_lines(threshold, line_scale=self.line_scale)
        height, width = threshold.shape
        if self.debug:
            self.debug_images = img
            return None
        if self.table_area is not None:
            if self.columns is not None:
                if len(self.table_area) != len(self.columns):
                    raise ValueError("{0}: Length of table area and columns"
                                     " should be equal.".format(os.path.basename(bname)))
            table_bbox = {}
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = int(float(x1))
                y1 = int(float(y1))
                x2 = int(float(x2))
                y2 = int(float(y2))
                table_bbox[(x1, y1, x2, y2)] = None
        else:
            table_bbox = {(0, 0, width, height): None}
        page = {}
        tables = {}
        table_no = 0
        for k in sorted(table_bbox.keys(), key=lambda x: x[1]):
            if self.columns is None:
                raise NotImplementedError
            else:
                table_data = {}
                table_image = threshold[k[1]:k[3],k[0]:k[2]]
                cols = self.columns[table_no].split(',')
                cols = [float(c) for c in cols]
                cols.insert(0, k[0])
                cols.append(k[2])
                cols = [(cols[i] - k[0], cols[i + 1] - k[0]) for i in range(0, len(cols) - 1)]
                y_cuts = find_cuts(table_image, char_scale=self.char_scale)
                rows = [(y_cuts[i], y_cuts[i + 1]) for i in range(0, len(y_cuts) - 1)]
                table = Table(cols, rows)
                for i in range(len(table.cells)):
                    for j in range(len(table.cells[i])):
                        x1 = int(table.cells[i][j].x1)
                        y1 = int(table.cells[i][j].y1)
                        x2 = int(table.cells[i][j].x2)
                        y2 = int(table.cells[i][j].y2)
                        table.cells[i][j].image = table_image[y1:y2,x1:x2]
                        cell_image = Image.fromarray(table.cells[i][j].image)
                        text = self.tool.image_to_string(
                            cell_image,
                            lang=self.lang,
                            builder=pyocr.builders.TextBuilder(tesseract_layout=self.layout)
                        )
                        table.cells[i][j].add_text(text)
                ar = table.get_list()
                ar.reverse()
                ar = encode_list(ar)
                table_data['data'] = ar
                tables['table-{0}'.format(table_no + 1)] = table_data
                table_no += 1
        page[os.path.basename(bname)] = tables
        return page
--- a/camelot/parsers/init.py
+++ b/camelot/parsers/init.py
@ -0,0 +1,2 @@
 from .stream import Stream
 from .lattice import Lattice
--- a/camelot/parsers/base.py
+++ b/camelot/parsers/base.py
@ -0,0 +1,21 @@
 import os
 from ..core import Geometry
 from ..utils import get_page_layout, get_text_objects
 class BaseParser(object):
    """Defines a base parser.
    """
    def _generate_layout(self, filename):
        self.filename = filename
        self.layout, self.dimensions = get_page_layout(
            self.filename,
            char_margin=self.char_margin,
            line_margin=self.line_margin,
            word_margin=self.word_margin)
        self.horizontal_text = get_text_objects(self.layout, ltype="lh")
        self.vertical_text = get_text_objects(self.layout, ltype="lv")
        self.pdf_width, self.pdf_height = self.dimensions
        self.rootname, __ = os.path.splitext(self.filename)
        self.g = Geometry()
--- a/camelot/parsers/lattice.py
+++ b/camelot/parsers/lattice.py
@ -0,0 +1,336 @@
 from __future__ import division
 import os
 import copy
 import logging
 import subprocess
 import numpy as np
 import pandas as pd
 from .base import BaseParser
 from ..core import Table
 from ..utils import (scale_image, scale_pdf, segments_in_bbox, text_in_bbox,
                     merge_close_lines, get_table_index, compute_accuracy,
                     compute_whitespace, setup_logging, encode_)
 from ..image_processing import (adaptive_threshold, find_lines,
                                find_table_contours, find_table_joints)
 logger = setup_logging(__name__)
 class Lattice(BaseParser):
    """Lattice method of parsing looks for lines between text
    to form a table.
    Parameters
    ----------
    table_area : list, optional (default: None)
        List of table areas to analyze as strings of the form
        x1,y1,x2,y2 where (x1, y1) -> left-top and
        (x2, y2) -> right-bottom in pdf coordinate space.
    process_background : bool, optional (default: False)
        Whether or not to process lines that are in background.
    line_size_scaling : int, optional (default: 15)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
        The larger this value, smaller the detected lines. Making it
        too large will lead to text being detected as lines.
    copy_text : list, optional (default: None)
        {'h', 'v'}
        Select one or more strings from above and pass them as a list
        to specify the direction in which text should be copied over
        when a cell spans multiple rows or columns.
    shift_text : list, optional (default: ['l', 't'])
        {'l', 'r', 't', 'b'}
        Select one or more strings from above and pass them as a list
        to specify where the text in a spanning cell should flow.
    split_text : bool, optional (default: False)
        Whether or not to split a text line if it spans across
        multiple cells.
    flag_size : bool, optional (default: False)
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
    line_close_tol : int, optional (default: 2)
        Tolerance parameter used to merge vertical and horizontal
        detected lines which lie close to each other.
    joint_close_tol : int, optional (default: 2)
        Tolerance parameter used to decide whether the detected lines
        and points lie close to each other.
    threshold_blocksize : int, optional (default: 15)
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    threshold_constant : int, optional (default: -2)
        Constant subtracted from the mean or weighted mean.
        Normally, it is positive but may be zero or negative as well.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    iterations : int, optional (default: 0)
        Number of times for erosion/dilation is applied.
        For more information, refer `OpenCV's dilate <https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#dilate>`_.
    margins : tuple
        PDFMiner margins. (char_margin, line_margin, word_margin)
        For for information, refer `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
    debug : bool, optional (default: False)
        Whether or not to return all text objects on the page
        which can be used to generate a matplotlib plot, to get
        values for table_area(s) and debugging.
    """
    def __init__(self, table_area=None, process_background=False,
                 line_size_scaling=15, copy_text=None, shift_text=['l', 't'],
                 split_text=False, flag_size=False, line_close_tol=2,
                 joint_close_tol=2, threshold_blocksize=15, threshold_constant=-2,
                 iterations=0, margins=(1.0, 0.5, 0.1), debug=False):
        self.table_area = table_area
        self.process_background = process_background
        self.line_size_scaling = line_size_scaling
        self.copy_text = copy_text
        self.shift_text = shift_text
        self.split_text = split_text
        self.flag_size = flag_size
        self.line_close_tol = line_close_tol
        self.joint_close_tol = joint_close_tol
        self.threshold_blocksize = threshold_blocksize
        self.threshold_constant = threshold_constant
        self.iterations = iterations
        self.char_margin, self.line_margin, self.word_margin = margins
        self.debug = debug
    @staticmethod
    def _reduce_index(t, idx, shift_text):
        """Reduces index of a text object if it lies within a spanning
        cell.
        Parameters
        ----------
        table : camelot.core.Table
        idx : list
            List of tuples of the form (r_idx, c_idx, text).
        shift_text : list
            {'l', 'r', 't', 'b'}
            Select one or more strings from above and pass them as a
            list to specify where the text in a spanning cell should
            flow.
        Returns
        -------
        indices : list
            List of tuples of the form (r_idx, c_idx, text) where
            r_idx and c_idx are new row and column indices for text.
        """
        indices = []
        for r_idx, c_idx, text in idx:
            for d in shift_text:
                if d == 'l':
                    if t.cells[r_idx][c_idx].hspan:
                        while not t.cells[r_idx][c_idx].left:
                            c_idx -= 1
                if d == 'r':
                    if t.cells[r_idx][c_idx].hspan:
                        while not t.cells[r_idx][c_idx].right:
                            c_idx += 1
                if d == 't':
                    if t.cells[r_idx][c_idx].vspan:
                        while not t.cells[r_idx][c_idx].top:
                            r_idx -= 1
                if d == 'b':
                    if t.cells[r_idx][c_idx].vspan:
                        while not t.cells[r_idx][c_idx].bottom:
                            r_idx += 1
            indices.append((r_idx, c_idx, text))
        return indices
    @staticmethod
    def _copy_spanning_text(t, copy_text=None):
        """Copies over text in empty spanning cells.
        Parameters
        ----------
        t : camelot.core.Table
        copy_text : list, optional (default: None)
            {'h', 'v'}
            Select one or more strings from above and pass them as a list
            to specify the direction in which text should be copied over
            when a cell spans multiple rows or columns.
        Returns
        -------
        t : camelot.core.Table
        """
        for f in copy_text:
            if f == "h":
                for i in range(len(t.cells)):
                    for j in range(len(t.cells[i])):
                        if t.cells[i][j].text.strip() == '':
                            if t.cells[i][j].hspan and not t.cells[i][j].left:
                                t.cells[i][j].text = t.cells[i][j - 1].text
            elif f == "v":
                for i in range(len(t.cells)):
                    for j in range(len(t.cells[i])):
                        if t.cells[i][j].text.strip() == '':
                            if t.cells[i][j].vspan and not t.cells[i][j].top:
                                t.cells[i][j].text = t.cells[i - 1][j].text
        return t
    def _generate_image(self):
        self.imagename = ''.join([self.rootname, '.png'])
        gs_call = [
            "-q", "-sDEVICE=png16m", "-o", self.imagename, "-r600", self.filename
        ]
        if "ghostscript" in subprocess.check_output(["gs", "-version"]).lower():
            gs_call.insert(0, "gs")
        else:
            gs_call.insert(0, "gsc")
        subprocess.call(gs_call, stdout=open(os.devnull, 'w'),
            stderr=subprocess.STDOUT)
    def _generate_table_bbox(self):
        self.image, self.threshold = adaptive_threshold(self.imagename, process_background=self.process_background,
            blocksize=self.threshold_blocksize, c=self.threshold_constant)
        image_width = self.image.shape[1]
        image_height = self.image.shape[0]
        image_width_scaler = image_width / float(self.pdf_width)
        image_height_scaler = image_height / float(self.pdf_height)
        pdf_width_scaler = self.pdf_width / float(image_width)
        pdf_height_scaler = self.pdf_height / float(image_height)
        image_scalers = (image_width_scaler, image_height_scaler, self.pdf_height)
        pdf_scalers = (pdf_width_scaler, pdf_height_scaler, image_height)
        vertical_mask, vertical_segments = find_lines(
            self.threshold, direction='vertical',
            line_size_scaling=self.line_size_scaling, iterations=self.iterations)
        horizontal_mask, horizontal_segments = find_lines(
            self.threshold, direction='horizontal',
            line_size_scaling=self.line_size_scaling, iterations=self.iterations)
        if self.table_area is not None:
            areas = []
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = float(x1)
                y1 = float(y1)
                x2 = float(x2)
                y2 = float(y2)
                x1, y1, x2, y2 = scale_pdf((x1, y1, x2, y2), image_scalers)
                areas.append((x1, y1, abs(x2 - x1), abs(y2 - y1)))
            table_bbox = find_table_joints(areas, vertical_mask, horizontal_mask)
        else:
            contours = find_table_contours(vertical_mask, horizontal_mask)
            table_bbox = find_table_joints(contours, vertical_mask, horizontal_mask)
        self.table_bbox_unscaled = copy.deepcopy(table_bbox)
        self.table_bbox, self.vertical_segments, self.horizontal_segments = scale_image(
            table_bbox, vertical_segments, horizontal_segments, pdf_scalers)
    def _generate_columns_and_rows(self, table_idx, tk):
        # select elements which lie within table_bbox
        t_bbox = {}
        v_s, h_s = segments_in_bbox(
            tk, self.vertical_segments, self.horizontal_segments)
        t_bbox['horizontal'] = text_in_bbox(tk, self.horizontal_text)
        t_bbox['vertical'] = text_in_bbox(tk, self.vertical_text)
        self.t_bbox = t_bbox
        for direction in t_bbox:
            t_bbox[direction].sort(key=lambda x: (-x.y0, x.x0))
        cols, rows = zip(*self.table_bbox[tk])
        cols, rows = list(cols), list(rows)
        cols.extend([tk[0], tk[2]])
        rows.extend([tk[1], tk[3]])
        # sort horizontal and vertical segments
        cols = merge_close_lines(
            sorted(cols), line_close_tol=self.line_close_tol)
        rows = merge_close_lines(
            sorted(rows, reverse=True), line_close_tol=self.line_close_tol)
        # make grid using x and y coord of shortlisted rows and cols
        cols = [(cols[i], cols[i + 1])
                for i in range(0, len(cols) - 1)]
        rows = [(rows[i], rows[i + 1])
                for i in range(0, len(rows) - 1)]
        return cols, rows, v_s, h_s
    def _generate_table(self, table_idx, cols, rows, **kwargs):
        v_s = kwargs.get('v_s')
        h_s = kwargs.get('h_s')
        if v_s is None or h_s is None:
            raise ValueError('No segments found on {}'.format(self.rootname))
        table = Table(cols, rows)
        # set table edges to True using ver+hor lines
        table = table.set_edges(v_s, h_s, joint_close_tol=self.joint_close_tol)
        # set table border edges to True
        table = table.set_border()
        # set spanning cells to True
        table = table.set_span()
        pos_errors = []
        for direction in self.t_bbox:
            for t in self.t_bbox[direction]:
                indices, error = get_table_index(
                    table, t, direction, split_text=self.split_text,
                    flag_size=self.flag_size)
                if indices[:2] != (-1, -1):
                    pos_errors.append(error)
                    indices = Lattice._reduce_index(table, indices, shift_text=self.shift_text)
                    for r_idx, c_idx, text in indices:
                        table.cells[r_idx][c_idx].text = text
        accuracy = compute_accuracy([[100, pos_errors]])
        if self.copy_text is not None:
            table = Lattice._copy_spanning_text(table, copy_text=self.copy_text)
        data = table.data
        data = encode_(data)
        table.df = pd.DataFrame(data)
        table.shape = table.df.shape
        whitespace = compute_whitespace(data)
        table.accuracy = accuracy
        table.whitespace = whitespace
        table.order = table_idx + 1
        table.page = int(os.path.basename(self.rootname).replace('page-', ''))
        return table
    def extract_tables(self, filename):
        logger.info('Processing {}'.format(os.path.basename(filename)))
        self._generate_layout(filename)
        if not self.horizontal_text:
            logger.info("No tables found on {}".format(
                os.path.basename(self.rootname)))
            return [], self.g
        self._generate_image()
        self._generate_table_bbox()
        _tables = []
        # sort tables based on y-coord
        for table_idx, tk in enumerate(sorted(self.table_bbox.keys(),
                key=lambda x: x[1], reverse=True)):
            cols, rows, v_s, h_s = self._generate_columns_and_rows(table_idx, tk)
            table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
            _tables.append(table)
        if self.debug:
            text = []
            text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.horizontal_text])
            text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.vertical_text])
            self.g.text = text
            self.g.images = (self.image, self.table_bbox_unscaled)
            self.g.segments = (self.vertical_segments, self.horizontal_segments)
            self.g.tables = _tables
        return _tables, self.g
--- a/camelot/parsers/stream.py
+++ b/camelot/parsers/stream.py
@ -0,0 +1,370 @@
 from __future__ import division
 import os
 import logging
 import numpy as np
 import pandas as pd
 from .base import BaseParser
 from ..core import Table
 from ..utils import (text_in_bbox, get_table_index, compute_accuracy,
                     compute_whitespace, setup_logging, encode_)
 logger = setup_logging(__name__)
 class Stream(BaseParser):
    """Stream method of parsing looks for spaces between text
    to form a table.
    If you want to specify columns when specifying multiple table
    areas, make sure that the length of both lists are equal.
    Parameters
    ----------
    table_area : list, optional (default: None)
        List of table areas to analyze as strings of the form
        x1,y1,x2,y2 where (x1, y1) -> left-top and
        (x2, y2) -> right-bottom in pdf coordinate space.
    columns : list, optional (default: None)
        List of column x-coordinates as strings where the coordinates
        are comma-separated.
    split_text : bool, optional (default: False)
        Whether or not to split a text line if it spans across
        multiple cells.
    flag_size : bool, optional (default: False)
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
    row_close_tol : int, optional (default: 2)
        Rows will be formed by combining text vertically
        within this tolerance.
    col_close_tol : int, optional (default: 0)
        Columns will be formed by combining text horizontally
        within this tolerance.
    margins : tuple, optional (default: (1.0, 0.5, 0.1))
        PDFMiner margins. (char_margin, line_margin, word_margin)
        For for information, refer `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
    debug : bool, optional (default: False)
        Whether or not to return all text objects on the page
        which can be used to generate a matplotlib plot, to get
        values for table_area(s), columns and debugging.
    """
    def __init__(self, table_area=None, columns=None, split_text=False,
                 flag_size=False, row_close_tol=2, col_close_tol=0,
                 margins=(1.0, 0.5, 0.1), debug=False):
        self.table_area = table_area
        self.columns = columns
        self._validate_columns()
        self.split_text = split_text
        self.flag_size = flag_size
        self.row_close_tol = row_close_tol
        self.col_close_tol = col_close_tol
        self.char_margin, self.line_margin, self.word_margin = margins
        self.debug = debug
    @staticmethod
    def _text_bbox(t_bbox):
        """Returns bounding box for the text present on a page.
        Parameters
        ----------
        t_bbox : dict
            Dict with two keys 'horizontal' and 'vertical' with lists of
            LTTextLineHorizontals and LTTextLineVerticals respectively.
        Returns
        -------
        text_bbox : tuple
            Tuple (x0, y0, x1, y1) in pdf coordinate space.
        """
        xmin = min([t.x0 for direction in t_bbox for t in t_bbox[direction]])
        ymin = min([t.y0 for direction in t_bbox for t in t_bbox[direction]])
        xmax = max([t.x1 for direction in t_bbox for t in t_bbox[direction]])
        ymax = max([t.y1 for direction in t_bbox for t in t_bbox[direction]])
        text_bbox = (xmin, ymin, xmax, ymax)
        return text_bbox
    @staticmethod
    def _group_rows(text, row_close_tol=2):
        """Groups PDFMiner text objects into rows vertically
        within a tolerance.
        Parameters
        ----------
        text : list
            List of PDFMiner text objects.
        row_close_tol : int, optional (default: 2)
        Returns
        -------
        rows : list
            Two-dimensional list of text objects grouped into rows.
        """
        row_y = 0
        rows = []
        temp = []
        for t in text:
            # is checking for upright necessary?
            # if t.get_text().strip() and all([obj.upright for obj in t._objs if
            # type(obj) is LTChar]):
            if t.get_text().strip():
                if not np.isclose(row_y, t.y0, atol=row_close_tol):
                    rows.append(sorted(temp, key=lambda t: t.x0))
                    temp = []
                    row_y = t.y0
                temp.append(t)
        rows.append(sorted(temp, key=lambda t: t.x0))
        __ = rows.pop(0) # hacky
        return rows
    @staticmethod
    def _merge_columns(l, col_close_tol=0):
        """Merges column boundaries horizontally if they overlap
        or lie within a tolerance.
        Parameters
        ----------
        l : list
            List of column x-coordinate tuples.
        col_close_tol : int, optional (default: 0)
        Returns
        -------
        merged : list
            List of merged column x-coordinate tuples.
        """
        merged = []
        for higher in l:
            if not merged:
                merged.append(higher)
            else:
                lower = merged[-1]
                if col_close_tol >= 0:
                    if (higher[0] <= lower[1] or
                            np.isclose(higher[0], lower[1], atol=col_close_tol)):
                        upper_bound = max(lower[1], higher[1])
                        lower_bound = min(lower[0], higher[0])
                        merged[-1] = (lower_bound, upper_bound)
                    else:
                        merged.append(higher)
                elif col_close_tol < 0:
                    if higher[0] <= lower[1]:
                        if np.isclose(higher[0], lower[1], atol=abs(col_close_tol)):
                            merged.append(higher)
                        else:
                            upper_bound = max(lower[1], higher[1])
                            lower_bound = min(lower[0], higher[0])
                            merged[-1] = (lower_bound, upper_bound)
                    else:
                        merged.append(higher)
        return merged
    @staticmethod
    def _join_rows(rows_grouped, text_y_max, text_y_min):
        """Makes row coordinates continuous.
        Parameters
        ----------
        rows_grouped : list
            Two-dimensional list of text objects grouped into rows.
        text_y_max : int
        text_y_min : int
        Returns
        -------
        rows : list
            List of continuous row y-coordinate tuples.
        """
        row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
                    if len(r) > 0 else 0 for r in rows_grouped]
        rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
        rows.insert(0, text_y_max)
        rows.append(text_y_min)
        rows = [(rows[i], rows[i + 1])
                for i in range(0, len(rows) - 1)]
        return rows
    @staticmethod
    def _add_columns(cols, text, row_close_tol):
        """Adds columns to existing list by taking into account
        the text that lies outside the current column x-coordinates.
        Parameters
        ----------
        cols : list
            List of column x-coordinate tuples.
        text : list
            List of PDFMiner text objects.
        ytol : int
        Returns
        -------
        cols : list
            Updated list of column x-coordinate tuples.
        """
        if text:
            text = Stream._group_rows(text, row_close_tol=row_close_tol)
            elements = [len(r) for r in text]
            new_cols = [(t.x0, t.x1)
                for r in text if len(r) == max(elements) for t in r]
            cols.extend(Stream._merge_columns(sorted(new_cols)))
        return cols
    @staticmethod
    def _join_columns(cols, text_x_min, text_x_max):
        """Makes column coordinates continuous.
        Parameters
        ----------
        cols : list
            List of column x-coordinate tuples.
        text_x_min : int
        text_y_max : int
        Returns
        -------
        cols : list
            Updated list of column x-coordinate tuples.
        """
        cols = sorted(cols)
        cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
        cols.insert(0, text_x_min)
        cols.append(text_x_max)
        cols = [(cols[i], cols[i + 1])
                for i in range(0, len(cols) - 1)]
        return cols
    def _validate_columns(self):
        if self.table_area is not None and self.columns is not None:
            if len(self.table_area) != len(self.columns):
                raise ValueError("Length of table_area and columns"
                                 " should be equal")
    def _generate_table_bbox(self):
        if self.table_area is not None:
            table_bbox = {}
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = float(x1)
                y1 = float(y1)
                x2 = float(x2)
                y2 = float(y2)
                table_bbox[(x1, y2, x2, y1)] = None
        else:
            table_bbox = {(0, 0, self.pdf_width, self.pdf_height): None}
        self.table_bbox = table_bbox
    def _generate_columns_and_rows(self, table_idx, tk):
        # select elements which lie within table_bbox
        t_bbox = {}
        t_bbox['horizontal'] = text_in_bbox(tk, self.horizontal_text)
        t_bbox['vertical'] = text_in_bbox(tk, self.vertical_text)
        self.t_bbox = t_bbox
        for direction in self.t_bbox:
            self.t_bbox[direction].sort(key=lambda x: (-x.y0, x.x0))
        text_x_min, text_y_min, text_x_max, text_y_max = self._text_bbox(self.t_bbox)
        rows_grouped = self._group_rows(self.t_bbox['horizontal'], row_close_tol=self.row_close_tol)
        rows = self._join_rows(rows_grouped, text_y_max, text_y_min)
        elements = [len(r) for r in rows_grouped]
        if self.columns is not None and self.columns[table_idx] != "":
            # user has to input boundary columns too
            # take (0, pdf_width) by default
            # similar to else condition
            # len can't be 1
            cols = self.columns[table_idx].split(',')
            cols = [float(c) for c in cols]
            cols.insert(0, text_x_min)
            cols.append(text_x_max)
            cols = [(cols[i], cols[i + 1]) for i in range(0, len(cols) - 1)]
        else:
            ncols = max(set(elements), key=elements.count)
            if ncols == 1:
                logger.info("No tables found on {}".format(
                    os.path.basename(self.rootname)))
            cols = [(t.x0, t.x1)
                for r in rows_grouped if len(r) == ncols for t in r]
            cols = self._merge_columns(sorted(cols), col_close_tol=self.col_close_tol)
            inner_text = []
            for i in range(1, len(cols)):
                left = cols[i - 1][1]
                right = cols[i][0]
                inner_text.extend([t for direction in self.t_bbox
                                     for t in self.t_bbox[direction]
                                     if t.x0 > left and t.x1 < right])
            outer_text = [t for direction in self.t_bbox
                            for t in self.t_bbox[direction]
                            if t.x0 > cols[-1][1] or t.x1 < cols[0][0]]
            inner_text.extend(outer_text)
            cols = self._add_columns(cols, inner_text, self.row_close_tol)
            cols = self._join_columns(cols, text_x_min, text_x_max)
        return cols, rows
    def _generate_table(self, table_idx, cols, rows, **kwargs):
        table = Table(cols, rows)
        table = table.set_all_edges()
        pos_errors = []
        for direction in self.t_bbox:
            for t in self.t_bbox[direction]:
                indices, error = get_table_index(
                    table, t, direction, split_text=self.split_text,
                    flag_size=self.flag_size)
                if indices[:2] != (-1, -1):
                    pos_errors.append(error)
                    for r_idx, c_idx, text in indices:
                        table.cells[r_idx][c_idx].text = text
        accuracy = compute_accuracy([[100, pos_errors]])
        data = table.data
        data = encode_(data)
        table.df = pd.DataFrame(data)
        table.shape = table.df.shape
        whitespace = compute_whitespace(data)
        table.accuracy = accuracy
        table.whitespace = whitespace
        table.order = table_idx + 1
        table.page = int(os.path.basename(self.rootname).replace('page-', ''))
        return table
    def extract_tables(self, filename):
        logger.info('Processing {}'.format(os.path.basename(filename)))
        self._generate_layout(filename)
        if not self.horizontal_text:
            logger.info("No tables found on {}".format(
                os.path.basename(self.rootname)))
            return [], self.g
        self._generate_table_bbox()
        _tables = []
        # sort tables based on y-coord
        for table_idx, tk in enumerate(sorted(self.table_bbox.keys(),
                key=lambda x: x[1], reverse=True)):
            cols, rows = self._generate_columns_and_rows(table_idx, tk)
            table = self._generate_table(table_idx, cols, rows)
            _tables.append(table)
        if self.debug:
            text = []
            text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.horizontal_text])
            text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.vertical_text])
            self.g.text = text
            self.g.tables = _tables
        return _tables, self.g
--- a/camelot/pdf.py
+++ b/camelot/pdf.py
@ -1,268 +0,0 @@
 import os
 import shutil
 import tempfile
 import itertools
 import multiprocessing as mp
 from functools import partial
 import cv2
 from PyPDF2 import PdfFileReader, PdfFileWriter
 from .utils import get_page_layout, get_text_objects, get_rotation
 __all__ = ['Pdf']
 def _parse_page_numbers(pagenos):
    """Converts list of dicts to list of ints.
    Parameters
    ----------
    pagenos : list
        List of dicts representing page ranges. A dict must have only
        two keys named 'start' and 'end' having int as their value.
    Returns
    -------
    page_numbers : list
        List of int page numbers.
    """
    page_numbers = []
    for p in pagenos:
        page_numbers.extend(range(p['start'], p['end'] + 1))
    page_numbers = sorted(set(page_numbers))
    return page_numbers
 def _save_page(temp, pdfname, pageno):
    with open(pdfname, 'rb') as pdffile:
        infile = PdfFileReader(pdffile, strict=False)
        sp_path = os.path.join(temp, 'page-{0}.pdf'.format(pageno))
        sp_name, sp_ext = os.path.splitext(sp_path)
        page = infile.getPage(pageno - 1)
        outfile = PdfFileWriter()
        outfile.addPage(page)
        with open(sp_path, 'wb') as f:
            outfile.write(f)
        layout, dim = get_page_layout(sp_path)
        lttextlh = get_text_objects(layout, ltype="lh")
        lttextlv = get_text_objects(layout, ltype="lv")
        ltchar = get_text_objects(layout, ltype="char")
        rotation = get_rotation(lttextlh, lttextlv, ltchar)
        if rotation != '':
            sp_new_path = ''.join([sp_name.replace('page', 'p'), '_rotated', sp_ext])
            os.rename(sp_path, sp_new_path)
            sp_in = PdfFileReader(open(sp_new_path, 'rb'),
                strict=False)
            sp_out = PdfFileWriter()
            sp_page = sp_in.getPage(0)
            if rotation == 'left':
                sp_page.rotateClockwise(90)
            elif rotation == 'right':
                sp_page.rotateCounterClockwise(90)
            sp_out.addPage(sp_page)
            with open(sp_path, 'wb') as pdf_out:
                sp_out.write(pdf_out)
 class Pdf:
    """Pdf manager.
    Handles all operations like temp directory creation, splitting file
    into single page pdfs, running extraction using multiple processes
    and removing the temp directory.
    Parameters
    ----------
    extractor : object
        camelot.stream.Stream or camelot.lattice.Lattice extractor
        object.
    pdfname : string
        Path to pdf file.
    pagenos : list
        List of dicts representing page ranges. A dict must have only
        two keys named 'start' and 'end' having int as their value.
        (optional, default: [{'start': 1, 'end': 1}])
    parallel : bool
        Whether or not to run using multiple processes.
        (optional, default: False)
    clean : bool
        Whether or not to remove the temp directory.
        (optional, default: False)
    """
    def __init__(self, extractor, pdfname, pagenos=[{'start': 1, 'end': 1}],
                 parallel=False, clean=False):
        self.extractor = extractor
        self.pdfname = pdfname
        if not self.pdfname.endswith('.pdf'):
            raise TypeError("File format not supported.")
        self.pagenos = _parse_page_numbers(pagenos)
        self.parallel = parallel
        if self.parallel:
            self.cpu_count = mp.cpu_count()
            self.pool = mp.Pool(processes=self.cpu_count)
        self.clean = clean
        self.temp = tempfile.mkdtemp()
    def split(self):
        """Splits file into single page pdfs.
        """
        if self.parallel:
            pfunc = partial(_save_page, self.temp, self.pdfname)
            self.pool.map(pfunc, self.pagenos)
        else:
            for p in self.pagenos:
                _save_page(self.temp, self.pdfname, p)
    def extract(self):
        """Runs table extraction by calling extractor.get_tables
        on all single page pdfs.
        """
        self.split()
        pages = [os.path.join(self.temp, 'page-{0}.pdf'.format(p))
                 for p in self.pagenos]
        if self.parallel:
            tables = self.pool.map(self.extractor.get_tables, pages)
            tables = {k: v for d in tables if d is not None for k, v in d.items()}
        else:
            tables = {}
            if self.extractor.debug:
                if self.extractor.method == 'stream':
                    self.debug = self.extractor.debug
                    self.debug_text = []
                elif self.extractor.method in ['lattice', 'ocrl']:
                    self.debug = self.extractor.debug
                    self.debug_images = []
                    self.debug_segments = []
                    self.debug_tables = []
                elif self.extractor.method == 'ocrs':
                    self.debug = self.extractor.debug
                    self.debug_images = []
            for p in pages:
                table = self.extractor.get_tables(p)
                if table is not None:
                    tables.update(table)
                if self.extractor.debug:
                    if self.extractor.method == 'stream':
                        self.debug_text.append(self.extractor.debug_text)
                    elif self.extractor.method in ['lattice', 'ocr']:
                        self.debug_images.append(self.extractor.debug_images)
                        self.debug_segments.append(self.extractor.debug_segments)
                        self.debug_tables.append(self.extractor.debug_tables)
                    elif self.extractor.method == 'ocrs':
                        self.debug_images.append(self.extractor.debug_images)
        if self.clean:
            self.remove_tempdir()
        return tables
    def remove_tempdir(self):
        """Removes temporary directory that was created to save single
        page pdfs and their images.
        """
        shutil.rmtree(self.temp)
    def debug_plot(self):
        """Generates a matplotlib plot based on the selected extractor
        debug option.
        """
        import matplotlib.pyplot as plt
        import matplotlib.patches as patches
        if self.debug is True:
            if hasattr(self, 'debug_text'):
                for text in self.debug_text:
                    fig = plt.figure()
                    ax = fig.add_subplot(111, aspect='equal')
                    xs, ys = [], []
                    for t in text:
                        xs.extend([t[0], t[1]])
                        ys.extend([t[2], t[3]])
                        ax.add_patch(
                            patches.Rectangle(
                                (t[0], t[1]),
                                t[2] - t[0],
                                t[3] - t[1]
                            )
                        )
                    ax.set_xlim(min(xs) - 10, max(xs) + 10)
                    ax.set_ylim(min(ys) - 10, max(ys) + 10)
                    plt.show()
            elif hasattr(self, 'debug_images'):
                for img in self.debug_images:
                    plt.imshow(img)
                    plt.show()
        elif self.debug == 'contour':
            try:
                for img, table_bbox in self.debug_images:
                    for t in table_bbox.keys():
                        cv2.rectangle(img, (t[0], t[1]),
                                      (t[2], t[3]), (255, 0, 0), 3)
                    plt.imshow(img)
                    plt.show()
            except AttributeError:
                raise ValueError("This option can only be used with Lattice.")
        elif self.debug == 'joint':
            try:
                for img, table_bbox in self.debug_images:
                    x_coord = []
                    y_coord = []
                    for k in table_bbox.keys():
                        for coord in table_bbox[k]:
                            x_coord.append(coord[0])
                            y_coord.append(coord[1])
                    max_x, max_y = max(x_coord), max(y_coord)
                    plt.plot(x_coord, y_coord, 'ro')
                    plt.axis([0, max_x + 100, max_y + 100, 0])
                    plt.imshow(img)
                    plt.show()
            except AttributeError:
                raise ValueError("This option can only be used with Lattice.")
        elif self.debug == 'line':
            try:
                for v_s, h_s in self.debug_segments:
                    for v in v_s:
                        plt.plot([v[0], v[2]], [v[1], v[3]])
                    for h in h_s:
                        plt.plot([h[0], h[2]], [h[1], h[3]])
                    plt.show()
            except AttributeError:
                raise ValueError("This option can only be used with Lattice.")
        elif self.debug == 'table':
            try:
                for tables in self.debug_tables:
                    for table in tables:
                        for r in range(len(table.rows)):
                            for c in range(len(table.cols)):
                                if table.cells[r][c].left:
                                    plt.plot([table.cells[r][c].lb[0],
                                              table.cells[r][c].lt[0]],
                                             [table.cells[r][c].lb[1],
                                              table.cells[r][c].lt[1]])
                                if table.cells[r][c].right:
                                    plt.plot([table.cells[r][c].rb[0],
                                              table.cells[r][c].rt[0]],
                                             [table.cells[r][c].rb[1],
                                              table.cells[r][c].rt[1]])
                                if table.cells[r][c].top:
                                    plt.plot([table.cells[r][c].lt[0],
                                              table.cells[r][c].rt[0]],
                                             [table.cells[r][c].lt[1],
                                              table.cells[r][c].rt[1]])
                                if table.cells[r][c].bottom:
                                    plt.plot([table.cells[r][c].lb[0],
                                              table.cells[r][c].rb[0]],
                                             [table.cells[r][c].lb[1],
                                              table.cells[r][c].rb[1]])
                    plt.show()
            except AttributeError:
                raise ValueError("This option can only be used with Lattice.")
        else:
            raise UserWarning("This method can only be called after"
                " debug has been specified.")
--- a/camelot/plotting.py
+++ b/camelot/plotting.py
@ -0,0 +1,174 @@
 import cv2
 import matplotlib.pyplot as plt
 import matplotlib.patches as patches
 from .handlers import PDFHandler
 def plot_geometry(filepath, pages='1', mesh=False, geometry_type='text', **kwargs):
    """Plot geometry found on pdf page based on type specified,
    useful for debugging and playing with different parameters to get
    the best output.
    Note: kwargs annotated with ^ can only be used with mesh=False
    and kwargs annotated with * can only be used with mesh=True.
    Parameters
    ----------
    filepath : str
        Path to pdf file.
    pages : str
        Comma-separated page numbers to parse.
        Example: 1,3,4 or 1,4-end
    mesh : bool (default: False)
        Whether or not to use Lattice method of parsing. Stream
        is used by default.
    geometry_type : str, optional (default: 'text')
        'text' : Plot text objects found on page, useful to get
                 table_area and columns coordinates.
        'table' : Plot parsed table.
        'contour'* : Plot detected rectangles.
        'joint'* : Plot detected line intersections.
        'line'* : Plot detected lines.
    table_area : list, optional (default: None)
        List of table areas to analyze as strings of the form
        x1,y1,x2,y2 where (x1, y1) -> left-top and
        (x2, y2) -> right-bottom in pdf coordinate space.
    columns^ : list, optional (default: None)
        List of column x-coordinates as strings where the coordinates
        are comma-separated.
    split_text : bool, optional (default: False)
        Whether or not to split a text line if it spans across
        multiple cells.
    flag_size : bool, optional (default: False)
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
    row_close_tol^ : int, optional (default: 2)
        Rows will be formed by combining text vertically
        within this tolerance.
    col_close_tol^ : int, optional (default: 0)
        Columns will be formed by combining text horizontally
        within this tolerance.
    process_background* : bool, optional (default: False)
        Whether or not to process lines that are in background.
    line_size_scaling* : int, optional (default: 15)
        Factor by which the page dimensions will be divided to get
        smallest length of lines that should be detected.
        The larger this value, smaller the detected lines. Making it
        too large will lead to text being detected as lines.
    copy_text* : list, optional (default: None)
        {'h', 'v'}
        Select one or more strings from above and pass them as a list
        to specify the direction in which text should be copied over
        when a cell spans multiple rows or columns.
    shift_text* : list, optional (default: ['l', 't'])
        {'l', 'r', 't', 'b'}
        Select one or more strings from above and pass them as a list
        to specify where the text in a spanning cell should flow.
    line_close_tol* : int, optional (default: 2)
        Tolerance parameter used to merge vertical and horizontal
        detected lines which lie close to each other.
    joint_close_tol* : int, optional (default: 2)
        Tolerance parameter used to decide whether the detected lines
        and points lie close to each other.
    threshold_blocksize : int, optional (default: 15)
        Size of a pixel neighborhood that is used to calculate a
        threshold value for the pixel: 3, 5, 7, and so on.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    threshold_constant : int, optional (default: -2)
        Constant subtracted from the mean or weighted mean.
        Normally, it is positive but may be zero or negative as well.
        For more information, refer `OpenCV's adaptiveThreshold <https://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold>`_.
    iterations : int, optional (default: 0)
        Number of times for erosion/dilation is applied.
        For more information, refer `OpenCV's dilate <https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#dilate>`_.
    margins : tuple
        PDFMiner margins. (char_margin, line_margin, word_margin)
        For for information, refer `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
    debug : bool, optional (default: False)
        Whether or not to return all text objects on the page
        which can be used to generate a matplotlib plot, to get
        values for table_area(s) and debugging.
    """
    # validate kwargs?
    p = PDFHandler(filepath, pages)
    debug = True if geometry_type else False
    kwargs.update({'debug': debug})
    __, geometry = p.parse(mesh=mesh, **kwargs)
    if geometry_type == 'text':
        for text in geometry.text:
            fig = plt.figure()
            ax = fig.add_subplot(111, aspect='equal')
            xs, ys = [], []
            for t in text:
                xs.extend([t[0], t[1]])
                ys.extend([t[2], t[3]])
                ax.add_patch(
                    patches.Rectangle(
                        (t[0], t[1]),
                        t[2] - t[0],
                        t[3] - t[1]
                    )
                )
            ax.set_xlim(min(xs) - 10, max(xs) + 10)
            ax.set_ylim(min(ys) - 10, max(ys) + 10)
            plt.show()
    elif geometry_type == 'table':
        for tables in geometry.tables:
            for table in tables:
                for row in table.cells:
                    for cell in row:
                        if cell.left:
                            plt.plot([cell.lb[0], cell.lt[0]],
                                     [cell.lb[1], cell.lt[1]])
                        if cell.right:
                            plt.plot([cell.rb[0], cell.rt[0]],
                                     [cell.rb[1], cell.rt[1]])
                        if cell.top:
                            plt.plot([cell.lt[0], cell.rt[0]],
                                     [cell.lt[1], cell.rt[1]])
                        if cell.bottom:
                            plt.plot([cell.lb[0], cell.rb[0]],
                                     [cell.lb[1], cell.rb[1]])
            plt.show()
    elif geometry_type == 'contour':
        if not mesh:
            raise ValueError("Use mesh=True")
        for img, table_bbox in geometry.images:
            for t in table_bbox.keys():
                cv2.rectangle(img, (t[0], t[1]),
                              (t[2], t[3]), (255, 0, 0), 3)
            plt.imshow(img)
            plt.show()
    elif geometry_type == 'joint':
        if not mesh:
            raise ValueError("Use mesh=True")
        for img, table_bbox in geometry.images:
            x_coord = []
            y_coord = []
            for k in table_bbox.keys():
                for coord in table_bbox[k]:
                    x_coord.append(coord[0])
                    y_coord.append(coord[1])
            max_x, max_y = max(x_coord), max(y_coord)
            plt.plot(x_coord, y_coord, 'ro')
            plt.axis([0, max_x + 100, max_y + 100, 0])
            plt.imshow(img)
            plt.show()
    elif geometry_type == 'line':
        if not mesh:
            raise ValueError("Use mesh=True")
        for v_s, h_s in geometry.segments:
            for v in v_s:
                plt.plot([v[0], v[2]], [v[1], v[3]])
            for h in h_s:
                plt.plot([h[0], h[2]], [h[1], h[3]])
            plt.show()
--- a/camelot/stream.py
+++ b/camelot/stream.py
@ -1,428 +0,0 @@
 from __future__ import division
 import os
 import copy
 import types
 import logging
 import copy_reg
 import warnings
 import numpy as np
 from .table import Table
 from .utils import (text_in_bbox, get_table_index, get_score, count_empty,
                    encode_list, get_text_objects, get_page_layout)
 __all__ = ['Stream']
 logger = logging.getLogger('app_logger')
 def _reduce_method(m):
    if m.im_self is None:
        return getattr, (m.im_class, m.im_func.func_name)
    else:
        return getattr, (m.im_self, m.im_func.func_name)
 copy_reg.pickle(types.MethodType, _reduce_method)
 def _text_bbox(t_bbox):
    """Returns bounding box for the text present on a page.
    Parameters
    ----------
    t_bbox : dict
        Dict with two keys 'horizontal' and 'vertical' with lists of
        LTTextLineHorizontals and LTTextLineVerticals respectively.
    Returns
    -------
    text_bbox : tuple
        Tuple of the form (x0, y0, x1, y1) in PDFMiner's coordinate
        space.
    """
    xmin = min([t.x0 for direction in t_bbox for t in t_bbox[direction]])
    ymin = min([t.y0 for direction in t_bbox for t in t_bbox[direction]])
    xmax = max([t.x1 for direction in t_bbox for t in t_bbox[direction]])
    ymax = max([t.y1 for direction in t_bbox for t in t_bbox[direction]])
    text_bbox = (xmin, ymin, xmax, ymax)
    return text_bbox
 def _group_rows(text, ytol=2):
    """Groups PDFMiner text objects into rows using their
    y-coordinates taking into account some tolerance ytol.
    Parameters
    ----------
    text : list
        List of PDFMiner text objects.
    ytol : int
        Tolerance parameter.
        (optional, default: 2)
    Returns
    -------
    rows : list
        Two-dimensional list of text objects grouped into rows.
    """
    row_y = 0
    rows = []
    temp = []
    for t in text:
        # is checking for upright necessary?
        # if t.get_text().strip() and all([obj.upright for obj in t._objs if
        # type(obj) is LTChar]):
        if t.get_text().strip():
            if not np.isclose(row_y, t.y0, atol=ytol):
                rows.append(sorted(temp, key=lambda t: t.x0))
                temp = []
                row_y = t.y0
            temp.append(t)
    rows.append(sorted(temp, key=lambda t: t.x0))
    __ = rows.pop(0) # hacky
    return rows
 def _merge_columns(l, mtol=0):
    """Merges column boundaries if they overlap or lie within some
    tolerance mtol.
    Parameters
    ----------
    l : list
        List of column coordinate tuples.
    mtol : int
        TODO
        (optional, default: 0)
    Returns
    -------
    merged : list
        List of merged column coordinate tuples.
    """
    merged = []
    for higher in l:
        if not merged:
            merged.append(higher)
        else:
            lower = merged[-1]
            if mtol >= 0:
                if (higher[0] <= lower[1] or
                        np.isclose(higher[0], lower[1], atol=mtol)):
                    upper_bound = max(lower[1], higher[1])
                    lower_bound = min(lower[0], higher[0])
                    merged[-1] = (lower_bound, upper_bound)
                else:
                    merged.append(higher)
            elif mtol < 0:
                if higher[0] <= lower[1]:
                    if np.isclose(higher[0], lower[1], atol=abs(mtol)):
                        merged.append(higher)
                    else:
                        upper_bound = max(lower[1], higher[1])
                        lower_bound = min(lower[0], higher[0])
                        merged[-1] = (lower_bound, upper_bound)
                else:
                    merged.append(higher)
    return merged
 def _join_rows(rows_grouped, text_y_max, text_y_min):
    """Makes row coordinates continuous.
    Parameters
    ----------
    rows_grouped : list
        Two-dimensional list of text objects grouped into rows.
    text_y_max : int
    text_y_min : int
    Returns
    -------
    rows : list
        List of continuous row coordinate tuples.
    """
    row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
                if len(r) > 0 else 0 for r in rows_grouped]
    rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
    rows.insert(0, text_y_max)
    rows.append(text_y_min)
    rows = [(rows[i], rows[i + 1])
            for i in range(0, len(rows) - 1)]
    return rows
 def _join_columns(cols, text_x_min, text_x_max):
    """Makes column coordinates continuous.
    Parameters
    ----------
    cols : list
        List of column coordinate tuples.
    text_x_min : int
    text_y_max : int
    Returns
    -------
    cols : list
        Updated list of column coordinate tuples.
    """
    cols = sorted(cols)
    cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
    cols.insert(0, text_x_min)
    cols.append(text_x_max)
    cols = [(cols[i], cols[i + 1])
            for i in range(0, len(cols) - 1)]
    return cols
 def _add_columns(cols, text, ytol):
    """Adds columns to existing list by taking into account
    the text that lies outside the current column coordinates.
    Parameters
    ----------
    cols : list
        List of column coordinate tuples.
    text : list
        List of PDFMiner text objects.
    ytol : int
        Tolerance parameter.
    Returns
    -------
    cols : list
        Updated list of column coordinate tuples.
    """
    if text:
        text = _group_rows(text, ytol=ytol)
        elements = [len(r) for r in text]
        new_cols = [(t.x0, t.x1)
            for r in text if len(r) == max(elements) for t in r]
        cols.extend(_merge_columns(sorted(new_cols)))
    return cols
 class Stream:
    """Stream looks for spaces between text elements to form a table.
    If you want to give columns, ytol or mtol for each table
    when specifying multiple table areas, make sure that their length
    is equal to the length of table_area. Mapping between them is based
    on index.
    If you don't want to specify columns for the some tables in a pdf
    page having multiple tables, pass them as empty strings.
    For example: ['', 'x1,x2,x3,x4', '']
    Parameters
    ----------
    table_area : list
        List of strings of the form x1,y1,x2,y2 where
        (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
        coordinate space, denoting table areas to analyze.
        (optional, default: None)
    columns : list
        List of strings where each string is comma-separated values of
        x-coordinates in PDFMiner's coordinate space.
        (optional, default: None)
    ytol : list
        List of ints specifying the y-tolerance parameters.
        (optional, default: [2])
    mtol : list
        List of ints specifying the m-tolerance parameters.
        (optional, default: [0])
    margins : tuple
        PDFMiner margins. (char_margin, line_margin, word_margin)
        (optional, default: (1.0, 0.5, 0.1))
    split_text : bool
        Whether or not to split a text line if it spans across
        different cells.
        (optional, default: False)
    flag_size : bool
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
        (optional, default: True)
    debug : bool
        Set to True to generate a matplotlib plot of
        LTTextLineHorizontals in order to select table_area, columns.
        (optional, default: False)
    """
    def __init__(self, table_area=None, columns=None, ytol=[2], mtol=[0],
                 margins=(1.0, 0.5, 0.1), split_text=False, flag_size=True,
                 debug=False):
        self.method = 'stream'
        self.table_area = table_area
        self.columns = columns
        self.ytol = ytol
        self.mtol = mtol
        self.char_margin, self.line_margin, self.word_margin = margins
        self.split_text = split_text
        self.flag_size = flag_size
        self.debug = debug
    def get_tables(self, pdfname):
        """Expects a single page pdf as input with rotation corrected.
        Parameters
        ---------
        pdfname : string
            Path to single page pdf file.
        Returns
        -------
        page : dict
        """
        layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
            line_margin=self.line_margin, word_margin=self.word_margin)
        lttextlh = get_text_objects(layout, ltype="lh")
        lttextlv = get_text_objects(layout, ltype="lv")
        ltchar = get_text_objects(layout, ltype="char")
        width, height = dim
        bname, __ = os.path.splitext(pdfname)
        logger.info('Processing {0}.'.format(os.path.basename(bname)))
        if not lttextlh:
            warnings.warn("{0}: Page contains no text.".format(
                os.path.basename(bname)))
            return {os.path.basename(bname): None}
        if self.debug:
            self.debug_text = []
            self.debug_text.extend([(t.x0, t.y0, t.x1, t.y1) for t in lttextlh])
            self.debug_text.extend([(t.x0, t.y0, t.x1, t.y1) for t in lttextlv])
            return None
        if self.table_area is not None:
            if self.columns is not None:
                if len(self.table_area) != len(self.columns):
                    raise ValueError("{0}: Length of table area and columns"
                                     " should be equal.".format(os.path.basename(bname)))
            table_bbox = {}
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
                x1 = float(x1)
                y1 = float(y1)
                x2 = float(x2)
                y2 = float(y2)
                table_bbox[(x1, y2, x2, y1)] = None
        else:
            table_bbox = {(0, 0, width, height): None}
        if len(self.ytol) == 1 and self.ytol[0] == 2:
            ytolerance = copy.deepcopy(self.ytol) * len(table_bbox)
        else:
            ytolerance = copy.deepcopy(self.ytol)
        if len(self.mtol) == 1 and self.mtol[0] == 0:
            mtolerance = copy.deepcopy(self.mtol) * len(table_bbox)
        else:
            mtolerance = copy.deepcopy(self.mtol)
        page = {}
        tables = {}
        # sort tables based on y-coord
        for table_no, k in enumerate(sorted(table_bbox.keys(), key=lambda x: x[1], reverse=True)):
            # select elements which lie within table_bbox
            table_data = {}
            t_bbox = {}
            t_bbox['horizontal'] = text_in_bbox(k, lttextlh)
            t_bbox['vertical'] = text_in_bbox(k, lttextlv)
            char_bbox = text_in_bbox(k, ltchar)
            table_data['text_p'] = 100 * (1 - (len(char_bbox) / len(ltchar)))
            for direction in t_bbox:
                t_bbox[direction].sort(key=lambda x: (-x.y0, x.x0))
            text_x_min, text_y_min, text_x_max, text_y_max = _text_bbox(t_bbox)
            rows_grouped = _group_rows(t_bbox['horizontal'], ytol=ytolerance[table_no])
            rows = _join_rows(rows_grouped, text_y_max, text_y_min)
            elements = [len(r) for r in rows_grouped]
            guess = False
            if self.columns is not None and self.columns[table_no] != "":
                # user has to input boundary columns too
                # take (0, width) by default
                # similar to else condition
                # len can't be 1
                cols = self.columns[table_no].split(',')
                cols = [float(c) for c in cols]
                cols.insert(0, text_x_min)
                cols.append(text_x_max)
                cols = [(cols[i], cols[i + 1]) for i in range(0, len(cols) - 1)]
            else:
                guess = True
                ncols = max(set(elements), key=elements.count)
                len_non_mode = len(filter(lambda x: x != ncols, elements))
                if ncols == 1:
                    # no tables detected
                    warnings.warn("{0}: Page contains no tables.".format(
                        os.path.basename(bname)))
                cols = [(t.x0, t.x1)
                    for r in rows_grouped if len(r) == ncols for t in r]
                cols = _merge_columns(sorted(cols), mtol=mtolerance[table_no])
                inner_text = []
                for i in range(1, len(cols)):
                    left = cols[i - 1][1]
                    right = cols[i][0]
                    inner_text.extend([t for direction in t_bbox
                                       for t in t_bbox[direction]
                                       if t.x0 > left and t.x1 < right])
                outer_text = [t for direction in t_bbox
                              for t in t_bbox[direction]
                              if t.x0 > cols[-1][1] or t.x1 < cols[0][0]]
                inner_text.extend(outer_text)
                cols = _add_columns(cols, inner_text, ytolerance[table_no])
                cols = _join_columns(cols, text_x_min, text_x_max)
            table = Table(cols, rows)
            table = table.set_all_edges()
            assignment_errors = []
            table_data['split_text'] = []
            table_data['superscript'] = []
            for direction in t_bbox:
                for t in t_bbox[direction]:
                    indices, error = get_table_index(
                        table, t, direction, split_text=self.split_text,
                        flag_size=self.flag_size)
                    assignment_errors.append(error)
                    if len(indices) > 1:
                        table_data['split_text'].append(indices)
                    for r_idx, c_idx, text in indices:
                        if all(s in text for s in ['<s>', '</s>']):
                            table_data['superscript'].append((r_idx, c_idx, text))
                        table.cells[r_idx][c_idx].add_text(text)
            if guess:
                score = get_score([[66, assignment_errors], [34, [len_non_mode / len(elements)]]])
            else:
                score = get_score([[100, assignment_errors]])
            table_data['score'] = score
            ar = table.get_list()
            ar = encode_list(ar)
            table_data['data'] = ar
            empty_p, r_nempty_cells, c_nempty_cells = count_empty(ar)
            table_data['empty_p'] = empty_p
            table_data['r_nempty_cells'] = r_nempty_cells
            table_data['c_nempty_cells'] = c_nempty_cells
            table_data['nrows'] = len(ar)
            table_data['ncols'] = len(ar[0])
            tables['table-{0}'.format(table_no + 1)] = table_data
        page[os.path.basename(bname)] = tables
        return page
--- a/camelot/table.py
+++ b/camelot/table.py
@ -1,236 +0,0 @@
 import numpy as np
 from .cell import Cell
 class Table:
    """Table.
    Defines a table object with coordinates relative to a left-bottom
    origin, which is also PDFMiner's coordinate space.
    Parameters
    ----------
    cols : list
        List of tuples representing column x-coordinates in increasing
        order.
    rows : list
        List of tuples representing row y-coordinates in decreasing
        order.
    Attributes
    ----------
    cells : list
        List of cell objects with row-major ordering.
    nocont_ : int
        Number of lines that did not contribute to setting cell edges.
    """
    def __init__(self, cols, rows):
        self.cols = cols
        self.rows = rows
        self.cells = [[Cell(c[0], r[1], c[1], r[0])
                       for c in cols] for r in rows]
        self.nocont_ = 0
        self.image = None
    def set_all_edges(self):
        """Sets all table edges to True.
        """
        for r in range(len(self.rows)):
            for c in range(len(self.cols)):
                self.cells[r][c].left = True
                self.cells[r][c].right = True
                self.cells[r][c].top = True
                self.cells[r][c].bottom = True
        return self
    def set_border_edges(self):
        """Sets table border edges to True.
        """
        for r in range(len(self.rows)):
            self.cells[r][0].left = True
            self.cells[r][len(self.cols) - 1].right = True
        for c in range(len(self.cols)):
            self.cells[0][c].top = True
            self.cells[len(self.rows) - 1][c].bottom = True
        return self
    def set_edges(self, vertical, horizontal, jtol=2):
        """Sets a cell's edges to True depending on whether they
        overlap with lines found by imgproc.
        Parameters
        ----------
        vertical : list
            List of vertical lines detected by imgproc. Coordinates
            scaled and translated to the PDFMiner's coordinate space.
        horizontal : list
            List of horizontal lines detected by imgproc. Coordinates
            scaled and translated to the PDFMiner's coordinate space.
        """
        for v in vertical:
            # find closest x coord
            # iterate over y coords and find closest points
            i = [i for i, t in enumerate(self.cols)
                 if np.isclose(v[0], t[0], atol=jtol)]
            j = [j for j, t in enumerate(self.rows)
                 if np.isclose(v[3], t[0], atol=jtol)]
            k = [k for k, t in enumerate(self.rows)
                 if np.isclose(v[1], t[0], atol=jtol)]
            if not j:
                self.nocont_ += 1
                continue
            J = j[0]
            if i == [0]:  # only left edge
                I = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][I].left = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][I].left = True
                        J += 1
            elif i == []:  # only right edge
                I = len(self.cols) - 1
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][I].right = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][I].right = True
                        J += 1
            else:  # both left and right edges
                I = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[J][I].left = True
                        self.cells[J][I - 1].right = True
                        J += 1
                else:
                    K = len(self.rows)
                    while J < K:
                        self.cells[J][I].left = True
                        self.cells[J][I - 1].right = True
                        J += 1
        for h in horizontal:
            #  find closest y coord
            # iterate over x coords and find closest points
            i = [i for i, t in enumerate(self.rows)
                 if np.isclose(h[1], t[0], atol=jtol)]
            j = [j for j, t in enumerate(self.cols)
                 if np.isclose(h[0], t[0], atol=jtol)]
            k = [k for k, t in enumerate(self.cols)
                 if np.isclose(h[2], t[0], atol=jtol)]
            if not j:
                self.nocont_ += 1
                continue
            J = j[0]
            if i == [0]:  # only top edge
                I = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[I][J].top = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[I][J].top = True
                        J += 1
            elif i == []:  # only bottom edge
                I = len(self.rows) - 1
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[I][J].bottom = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[I][J].bottom = True
                        J += 1
            else:  # both top and bottom edges
                I = i[0]
                if k:
                    K = k[0]
                    while J < K:
                        self.cells[I][J].top = True
                        self.cells[I - 1][J].bottom = True
                        J += 1
                else:
                    K = len(self.cols)
                    while J < K:
                        self.cells[I][J].top = True
                        self.cells[I - 1][J].bottom = True
                        J += 1
        return self
    def set_spanning(self):
        """Sets a cell's spanning_h or spanning_v attribute to True
        depending on whether the cell spans/extends horizontally or
        vertically.
        """
        for r in range(len(self.rows)):
            for c in range(len(self.cols)):
                bound = self.cells[r][c].get_bounded_edges()
                if bound == 4:
                    continue
                elif bound == 3:
                    if not self.cells[r][c].left:
                        if (self.cells[r][c].right and
                                self.cells[r][c].top and
                                self.cells[r][c].bottom):
                            self.cells[r][c].spanning_h = True
                    elif not self.cells[r][c].right:
                        if (self.cells[r][c].left and
                                self.cells[r][c].top and
                                self.cells[r][c].bottom):
                            self.cells[r][c].spanning_h = True
                    elif not self.cells[r][c].top:
                        if (self.cells[r][c].left and
                                self.cells[r][c].right and
                                self.cells[r][c].bottom):
                            self.cells[r][c].spanning_v = True
                    elif not self.cells[r][c].bottom:
                        if (self.cells[r][c].left and
                                self.cells[r][c].right and
                                self.cells[r][c].top):
                            self.cells[r][c].spanning_v = True
                elif bound == 2:
                    if self.cells[r][c].left and self.cells[r][c].right:
                        if (not self.cells[r][c].top and
                                not self.cells[r][c].bottom):
                            self.cells[r][c].spanning_v = True
                    elif self.cells[r][c].top and self.cells[r][c].bottom:
                        if (not self.cells[r][c].left and
                                not self.cells[r][c].right):
                            self.cells[r][c].spanning_h = True
        return self
    def get_list(self):
        """Returns a two-dimensional list of text assigned to each
        cell.
        Returns
        -------
        ar : list
        """
        ar = []
        for r in range(len(self.rows)):
            ar.append([self.cells[r][c].get_text().strip()
                       for c in range(len(self.cols))])
        return ar
--- a/camelot/utils.py
+++ b/camelot/utils.py
@ -18,18 +18,47 @@ from pdfminer.layout import (LAParams, LTAnno, LTChar, LTTextLineHorizontal,
                             LTTextLineVertical)
 def setup_logging(name):
    """Sets up a logger with StreamHandler.
    Parameters
    ----------
    name : str
    Returns
    -------
    logger : logging.Logger
    """
    logger = logging.getLogger(name)
    format_string = '%(asctime)s - %(levelname)s - %(funcName)s - %(message)s'
    formatter = logging.Formatter(format_string, datefmt='%Y-%m-%dT%H:%M:%S')
    handler = logging.StreamHandler()
    handler.setLevel(logging.INFO)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger
 logger = setup_logging(__name__)
 def translate(x1, x2):
    """Translates x2 by x1.
    Parameters
    ----------
    x1 : float
    x2 : float
    Returns
    -------
    x2 : float
    """
    x2 += x1
    return x2
@ -41,12 +70,12 @@ def scale(x, s):
    Parameters
    ----------
    x : float
    s : float
    Returns
    -------
    x : float
    """
    x *= s
    return x
@ -58,21 +87,17 @@ def rotate(x1, y1, x2, y2, angle):
    Parameters
    ----------
    x1 : float
    y1 : float
    x2 : float
    y2 : float
    angle : float
        Angle in radians.
    Returns
    -------
    xnew : float
    ynew : float
    """
    s = np.sin(angle)
    c = np.cos(angle)
@ -85,17 +110,16 @@ def rotate(x1, y1, x2, y2, angle):
    return xnew, ynew
-def scale_to_image(k, factors):
+def scale_pdf(k, factors):
-    """Translates and scales PDFMiner coordinates to OpenCV's coordinate
+    """Translates and scales pdf coordinate space to image
-    space.
+    coordinate space.
    Parameters
    ----------
    k : tuple
        Tuple (x1, y1, x2, y2) representing table bounding box where
-        (x1, y1) -> lt and (x2, y2) -> rb in PDFMiner's coordinate
+        (x1, y1) -> lt and (x2, y2) -> rb in PDFMiner coordinate
        space.
    factors : tuple
        Tuple (scaling_factor_x, scaling_factor_y, pdf_y) where the
        first two elements are scaling factors and pdf_y is height of
@ -105,8 +129,9 @@ def scale_to_image(k, factors):
    -------
    knew : tuple
        Tuple (x1, y1, x2, y2) representing table bounding box where
-        (x1, y1) -> lt and (x2, y2) -> rb in OpenCV's coordinate
+        (x1, y1) -> lt and (x2, y2) -> rb in OpenCV coordinate
        space.
    """
    x1, y1, x2, y2 = k
    scaling_factor_x, scaling_factor_y, pdf_y = factors
@ -118,22 +143,19 @@ def scale_to_image(k, factors):
    return knew
-def scale_to_pdf(tables, v_segments, h_segments, factors):
+def scale_image(tables, v_segments, h_segments, factors):
-    """Translates and scales OpenCV coordinates to PDFMiner's coordinate
+    """Translates and scales image coordinate space to pdf
-    space.
+    coordinate space.
    Parameters
    ----------
    tables : dict
        Dict with table boundaries as keys and list of intersections
-        in that boundary as their value.
+        in that boundary as value.
    v_segments : list
        List of vertical line segments.
    h_segments : list
        List of horizontal line segments.
    factors : tuple
        Tuple (scaling_factor_x, scaling_factor_y, img_y) where the
        first two elements are scaling factors and img_y is height of
@ -142,10 +164,9 @@ def scale_to_pdf(tables, v_segments, h_segments, factors):
    Returns
    -------
    tables_new : dict
    v_segments_new : dict
    h_segments_new : dict
    """
    scaling_factor_x, scaling_factor_y, img_y = factors
    tables_new = {}
@ -178,54 +199,26 @@ def scale_to_pdf(tables, v_segments, h_segments, factors):
    return tables_new, v_segments_new, h_segments_new
 def setup_logging(log_filepath):
    """Setup logging
    Args:
        log_filepath (string): Path to log file
    Returns:
        logging.Logger: Logger object
    """
    logger = logging.getLogger("app_logger")
    logger.setLevel(logging.DEBUG)
    # Log File Handler (Associating one log file per webservice run)
    log_file_handler = logging.FileHandler(log_filepath,
                                           mode='a',
                                           encoding='utf-8')
    log_file_handler.setLevel(logging.DEBUG)
    format_string = '%(asctime)s - %(levelname)s - %(funcName)s - %(message)s'
    formatter = logging.Formatter(format_string, datefmt='%Y-%m-%dT%H:%M:%S')
    log_file_handler.setFormatter(formatter)
    logger.addHandler(log_file_handler)
    # Stream Log Handler (For console)
    stream_log_handler = logging.StreamHandler()
    stream_log_handler.setLevel(logging.INFO)
    formatter = logging.Formatter(format_string, datefmt='%Y-%m-%dT%H:%M:%S')
    stream_log_handler.setFormatter(formatter)
    logger.addHandler(stream_log_handler)
    return logger
 def get_rotation(lttextlh, lttextlv, ltchar):
-    """Detects if text in table is vertical or not using the current
+    """Detects if text in table is rotated or not using the current
    transformation matrix (CTM) and returns its orientation.
    Parameters
    ----------
    lttextlh : list
        List of PDFMiner LTTextLineHorizontal objects.
    lttextlv : list
        List of PDFMiner LTTextLineVertical objects.
    ltchar : list
        List of PDFMiner LTChar objects.
    Returns
    -------
    rotation : string
-        {'', 'left', 'right'}
+        '' if text in table is upright, 'anticlockwise' if
-        '' if text in table is upright, 'left' if rotated 90 degree
+        rotated 90 degree anticlockwise and 'clockwise' if
-        anti-clockwise and 'right' if rotated 90 degree clockwise.
+        rotated 90 degree clockwise.
    """
    rotation = ''
    hlen = len([t for t in lttextlh if t.get_text().strip()])
@ -233,23 +226,21 @@ def get_rotation(lttextlh, lttextlv, ltchar):
    if hlen < vlen:
        clockwise = sum(t.matrix[1] < 0 and t.matrix[2] > 0 for t in ltchar)
        anticlockwise = sum(t.matrix[1] > 0 and t.matrix[2] < 0 for t in ltchar)
-        rotation = 'left' if clockwise < anticlockwise else 'right'
+        rotation = 'anticlockwise' if clockwise < anticlockwise else 'clockwise'
    return rotation
-def segments_bbox(bbox, v_segments, h_segments):
+def segments_in_bbox(bbox, v_segments, h_segments):
-    """Returns all line segments present inside a
+    """Returns all line segments present inside a bounding box.
    table's bounding box.
    Parameters
    ----------
    bbox : tuple
-        Tuple (x1, y1, x2, y2) representing table bounding box where
+        Tuple (x1, y1, x2, y2) representing a bounding box where
-        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.
+        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner coordinate
-
+        space.
    v_segments : list
        List of vertical line segments.
    h_segments : list
        List of vertical horizontal segments.
@ -257,9 +248,9 @@ def segments_bbox(bbox, v_segments, h_segments):
    -------
    v_s : list
        List of vertical line segments that lie inside table.
    h_s : list
        List of horizontal line segments that lie inside table.
    """
    lb = (bbox[0], bbox[1])
    rt = (bbox[2], bbox[3])
@ -271,45 +262,43 @@ def segments_bbox(bbox, v_segments, h_segments):
 def text_in_bbox(bbox, text):
-    """Returns all text objects present inside a
+    """Returns all text objects present inside a bounding box.
    table's bounding box.
    Parameters
    ----------
    bbox : tuple
-        Tuple (x1, y1, x2, y2) representing table bounding box where
+        Tuple (x1, y1, x2, y2) representing a bounding box where
-        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.
+        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner coordinate
-
+        space.
-    text : list
+    text : List of PDFMiner text objects.
        List of PDFMiner text objects.
    Returns
    -------
    t_bbox : list
        List of PDFMiner text objects that lie inside table.
    """
    lb = (bbox[0], bbox[1])
    rt = (bbox[2], bbox[3])
    t_bbox = [t for t in text if lb[0] - 2 <= (t.x0 + t.x1) / 2.0
-                 <= rt[0] + 2 and lb[1] - 2 <= (t.y0 + t.y1) / 2.0
+                <= rt[0] + 2 and lb[1] - 2 <= (t.y0 + t.y1) / 2.0
-                 <= rt[1] + 2]
+                <= rt[1] + 2]
    return t_bbox
-def remove_close_values(ar, mtol=2):
+def remove_close_lines(ar, line_close_tol=2):
-    """Removes values which are within a tolerance of mtol of another value
+    """Removes lines which are within a tolerance, based on their x or
-    present in list.
+    y axis projections.
    Parameters
    ----------
    ar : list
-
+    line_close_tol : int, optional (default: 2)
    mtol : int
        (optional, default: 2)
    Returns
    -------
    ret : list
    """
    ret = []
    for a in ar:
@ -317,27 +306,26 @@ def remove_close_values(ar, mtol=2):
            ret.append(a)
        else:
            temp = ret[-1]
-            if np.isclose(temp, a, atol=mtol):
+            if np.isclose(temp, a, atol=line_close_tol):
                pass
            else:
                ret.append(a)
    return ret
-def merge_close_values(ar, mtol=2):
+def merge_close_lines(ar, line_close_tol=2):
-    """Merges values which are within a tolerance of mtol by calculating
+    """Merges lines which are within a tolerance by calculating a
-    a moving mean.
+    moving mean, based on their x or y axis projections.
    Parameters
    ----------
    ar : list
-
+    line_close_tol : int, optional (default: 2)
    mtol : int
        (optional, default: 2)
    Returns
    -------
    ret : list
    """
    ret = []
    for a in ar:
@ -345,7 +333,7 @@ def merge_close_values(ar, mtol=2):
            ret.append(a)
        else:
            temp = ret[-1]
-            if np.isclose(temp, a, atol=mtol):
+            if np.isclose(temp, a, atol=line_close_tol):
                temp = (temp + a) / 2.0
                ret[-1] = temp
            else:
@ -353,22 +341,21 @@ def merge_close_values(ar, mtol=2):
    return ret
-def flag_on_size(textline, direction):
+def flag_font_size(textline, direction):
-    """Flags a super/subscript by enclosing it with <s></s>. May give
+    """Flags super/subscripts in text by enclosing them with <s></s>.
-    false positives.
+    May give false positives.
    Parameters
    ----------
    textline : list
        List of PDFMiner LTChar objects.
    direction : string
        {'horizontal', 'vertical'}
        Direction of the PDFMiner LTTextLine object.
    Returns
    -------
    fstring : string
    """
    if direction == 'horizontal':
        d = [(t.get_text(), np.round(t.height, decimals=6)) for t in textline if not isinstance(t, LTAnno)]
@ -395,33 +382,28 @@ def flag_on_size(textline, direction):
    return fstring
-def split_textline(table, textline, direction, flag_size=True):
+def split_textline(table, textline, direction, flag_size=False):
    """Splits PDFMiner LTTextLine into substrings if it spans across
    multiple rows/columns.
    Parameters
    ----------
-    table : object
+    table : camelot.core.Table
        camelot.pdf.Pdf
    textline : object
        PDFMiner LTTextLine object.
    direction : string
        {'horizontal', 'vertical'}
        Direction of the PDFMiner LTTextLine object.
-
+    flag_size : bool, optional (default: False)
    flag_size : bool
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
        (optional, default: True)
    Returns
    -------
    grouped_chars : list
        List of tuples of the form (idx, text) where idx is the index
        of row/column and text is the an lttextline substring.
    """
    idx = 0
    cut_text = []
@ -466,46 +448,37 @@ def split_textline(table, textline, direction, flag_size=True):
    grouped_chars = []
    for key, chars in groupby(cut_text, itemgetter(0, 1)):
        if flag_size:
-            grouped_chars.append((key[0], key[1], flag_on_size([t[2] for t in chars], direction)))
+            grouped_chars.append((key[0], key[1], flag_font_size([t[2] for t in chars], direction)))
        else:
            gchars = [t[2].get_text() for t in chars]
            grouped_chars.append((key[0], key[1], ''.join(gchars).strip('\n')))
    return grouped_chars
-def get_table_index(table, t, direction, split_text=False, flag_size=True):
+def get_table_index(table, t, direction, split_text=False, flag_size=False):
-    """Gets indices of the cell where given text object lies by
+    """Gets indices of the table cell where given text object lies by
    comparing their y and x-coordinates.
    Parameters
    ----------
-    table : object
+    table : camelot.core.Table
        camelot.table.Table
    t : object
        PDFMiner LTTextLine object.
    direction : string
        {'horizontal', 'vertical'}
        Direction of the PDFMiner LTTextLine object.
-
+    split_text : bool, optional (default: False)
    split_text : bool
        Whether or not to split a text line if it spans across
        multiple cells.
-        (optional, default: False)
+    flag_size : bool, optional (default: False)
    flag_size : bool
        Whether or not to highlight a substring using <s></s>
        if its size is different from rest of the string, useful for
        super and subscripts.
        (optional, default: True)
    Returns
    -------
    indices : list
-        List of tuples of the form (idx, text) where idx is the index
+        List of tuples of the form (r_idx, c_idx, text) where r_idx
-        of row/column and text is the an lttextline substring.
+        and c_idx are row and column indices.
    error : float
        Assignment error, percentage of text area that lies outside
        a cell.
@ -514,6 +487,7 @@ def get_table_index(table, t, direction, split_text=False, flag_size=True):
        |   [Text bounding box]
        |       |
        +-------+
    """
    r_idx, c_idx = [-1] * 2
    for r in range(len(table.rows)):
@ -528,7 +502,11 @@ def get_table_index(table, t, direction, split_text=False, flag_size=True):
                else:
                    lt_col_overlap.append(-1)
            if len(filter(lambda x: x != -1, lt_col_overlap)) == 0:
-                logging.warning("Text did not fit any column.")
+                text = t.get_text().strip('\n')
                text_range = (t.x0, t.x1)
                col_range = (table.cols[0][0], table.cols[-1][1])
                logger.info("{} {} does not lie in column range {}".format(
                    text, text_range, col_range))
            r_idx = r
            c_idx = lt_col_overlap.index(max(lt_col_overlap))
            break
@ -552,14 +530,14 @@ def get_table_index(table, t, direction, split_text=False, flag_size=True):
        return split_textline(table, t, direction, flag_size=flag_size), error
    else:
        if flag_size:
-            return [(r_idx, c_idx, flag_on_size(t._objs, direction))], error
+            return [(r_idx, c_idx, flag_font_size(t._objs, direction))], error
        else:
            return [(r_idx, c_idx, t.get_text().strip('\n'))], error
-def get_score(error_weights):
+def compute_accuracy(error_weights):
-    """Calculates score based on weights assigned to various parameters,
+    """Calculates a score based on weights assigned to various
-    and their error percentages.
+    parameters and their error percentages.
    Parameters
    ----------
@ -571,6 +549,7 @@ def get_score(error_weights):
    Returns
    -------
    score : float
    """
    SCORE_VAL = 100
    try:
@ -586,6 +565,30 @@ def get_score(error_weights):
    return score
 def compute_whitespace(d):
    """Calculates the percentage of empty strings in a
    two-dimensional list.
    Parameters
    ----------
    d : list
    Returns
    -------
    whitespace : float
        Percentage of empty cells.
    """
    whitespace = 0
    r_nempty_cells, c_nempty_cells = [], []
    for i in d:
        for j in i:
            if j.strip() == '':
                whitespace += 1
    whitespace = 100 * (whitespace / float(len(d) * len(d[0])))
    return whitespace
 def remove_empty(d):
    """Removes empty rows and columns from a two-dimensional list.
@ -596,6 +599,7 @@ def remove_empty(d):
    Returns
    -------
    d : list
    """
    for i, row in enumerate(d):
        if row == [''] * len(row):
@ -606,50 +610,8 @@ def remove_empty(d):
    return d
-def count_empty(d):
+def encode_(ar):
-    """Counts empty rows and columns in a two-dimensional list.
+    """Encodes two-dimensional list into unicode.
    Parameters
    ----------
    d : list
    Returns
    -------
    n_empty_rows : list
        Number of empty rows.
    n_empty_cols : list
        Number of empty columns.
    empty_p : float
        Percentage of empty cells.
    """
    empty_p = 0
    r_nempty_cells, c_nempty_cells = [], []
    for i in d:
        for j in i:
            if j.strip() == '':
                empty_p += 1
    empty_p = 100 * (empty_p / float(len(d) * len(d[0])))
    for row in d:
        r_nempty_c = 0
        for r in row:
            if r.strip() != '':
                r_nempty_c += 1
        r_nempty_cells.append(r_nempty_c)
    d = zip(*d)
    d = [list(col) for col in d]
    for col in d:
        c_nempty_c = 0
        for c in col:
            if c.strip() != '':
                c_nempty_c += 1
        c_nempty_cells.append(c_nempty_c)
    return empty_p, r_nempty_cells, c_nempty_cells
 def encode_list(ar):
    """Encodes list of text.
    Parameters
    ----------
@ -658,52 +620,13 @@ def encode_list(ar):
    Returns
    -------
    ar : list
    """
    ar = [[r.encode('utf-8') for r in row] for row in ar]
    return ar
-def get_text_objects(layout, ltype="char", t=None):
+def get_page_layout(filename, char_margin=1.0, line_margin=0.5, word_margin=0.1,
    """Recursively parses pdf layout to get a list of
    text objects.
    Parameters
    ----------
    layout : object
        PDFMiner LTPage object.
    ltype : string
        {'char', 'lh', 'lv'}
        Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
        and LTTextLineVertical objects respectively.
    t : list
    Returns
    -------
    t : list
        List of PDFMiner text objects.
    """
    if ltype == "char":
        LTObject = LTChar
    elif ltype == "lh":
        LTObject = LTTextLineHorizontal
    elif ltype == "lv":
        LTObject = LTTextLineVertical
    if t is None:
        t = []
    try:
        for obj in layout._objs:
            if isinstance(obj, LTObject):
                t.append(obj)
            else:
                t += get_text_objects(obj, ltype=ltype)
    except AttributeError:
        pass
    return t
 def get_page_layout(pname, char_margin=1.0, line_margin=0.5, word_margin=0.1,
               detect_vertical=True, all_texts=True):
    """Returns a PDFMiner LTPage object and page dimension of a single
    page pdf. See https://euske.github.io/pdfminer/ to get definitions
@ -711,28 +634,23 @@ def get_page_layout(pname, char_margin=1.0, line_margin=0.5, word_margin=0.1,
    Parameters
    ----------
-    pname : string
+    filename : string
        Path to pdf file.
    char_margin : float
    line_margin : float
    word_margin : float
    detect_vertical : bool
    all_texts : bool
    Returns
    -------
    layout : object
        PDFMiner LTPage object.
    dim : tuple
-        pdf page dimension of the form (width, height).
+        Dimension of pdf page in the form (width, height).
    """
-    with open(pname, 'r') as f:
+    with open(filename, 'r') as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
@ -754,16 +672,56 @@ def get_page_layout(pname, char_margin=1.0, line_margin=0.5, word_margin=0.1,
        return layout, dim
 def get_text_objects(layout, ltype="char", t=None):
    """Recursively parses pdf layout to get a list of
    PDFMiner text objects.
    Parameters
    ----------
    layout : object
        PDFMiner LTPage object.
    ltype : string
        Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
        and LTTextLineVertical objects respectively.
    t : list
    Returns
    -------
    t : list
        List of PDFMiner text objects.
    """
    if ltype == "char":
        LTObject = LTChar
    elif ltype == "lh":
        LTObject = LTTextLineHorizontal
    elif ltype == "lv":
        LTObject = LTTextLineVertical
    if t is None:
        t = []
    try:
        for obj in layout._objs:
            if isinstance(obj, LTObject):
                t.append(obj)
            else:
                t += get_text_objects(obj, ltype=ltype)
    except AttributeError:
        pass
    return t
 def merge_tuples(tuples):
    """Merges a list of overlapping tuples.
    Parameters
    ----------
    tuples : list
        List of tuples where a tuple is a single axis coordinate pair.
    Yields
    ------
    tuple
    Returns
    -------
    merged : list
    """
    merged = list(tuples[0])
    for s, e in tuples:
--- a/debug/hough_opencv.py
+++ b/debug/hough_opencv.py
@ -1,53 +0,0 @@
 """
 usage: python hough_opencv.py file.png
 finds lines present in an image using opencv's hough transform.
 """
 import sys
 import time
 import cv2
 import numpy as np
 import matplotlib.pyplot as plt
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
@timeit
 def main():
    image = cv2.imread(sys.argv[1])
    print "image dimensions -> {0}".format(image.shape)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)
    lines = cv2.HoughLines(edges, 1, np.pi / 180, 200)
    print "found {0} lines".format(len(lines))
    for line in lines:
        r, theta = line[0]
        # filter horizontal and vertical lines
        if theta == 0 or np.isclose(theta, np.pi / 2):
            x0 = r * np.cos(theta)
            y0 = r * np.sin(theta)
            x1 = int(x0 + 10000 * (-np.sin(theta)))
            y1 = int(y0 + 10000 * (np.cos(theta)))
            x2 = int(x0 - 10000 * (-np.sin(theta)))
            y2 = int(y0 - 10000 * (np.cos(theta)))
            cv2.line(image, (x1, y1), (x2, y2), (0, 0, 255), 5)
    plt.imshow(image)
    plt.show()
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/hough_skimage.py
+++ b/debug/hough_skimage.py
@ -1,75 +0,0 @@
 """
 usage: python hough_skimage.py file.png
 finds lines present in an image using scikit-image's hough transform.
 """
 import sys
 import time
 import cv2
 import numpy as np
 from scipy.misc import imread
 import matplotlib.pyplot as plt
 from skimage.transform import hough_line, hough_line_peaks
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
@timeit
 def main():
    image = cv2.imread(sys.argv[1])
    print "image dimensions -> {0}".format(image.shape)
    ret, binary = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    binary = np.min(binary, axis=2)
    binary = np.where(binary == 255, 0, 255)
    rows, cols = binary.shape
    pixel = np.zeros(binary.shape)
    fig, ax = plt.subplots(1, 1, figsize=(8,4))
    ax.imshow(image, cmap=plt.cm.gray)
    theta_in = np.linspace(0, np.pi / 2, 10)
    h, theta, d = hough_line(binary, theta_in)
    for _, angle, dist in zip(*hough_line_peaks(h, theta, d)):
        x0 = dist * np.cos(angle)
        y0 = dist * np.sin(angle)
        x1 = int(x0 + 1000 * (-np.sin(angle)))
        y1 = int(y0 + 1000 * (np.cos(angle)))
        x2 = int(x0 - 1000 * (-np.sin(angle)))
        y2 = int(y0 - 1000 * (np.cos(angle)))
        ax.plot((x1, x2), (y1, y2), '-r')
        a = np.cos(angle)
        b = np.sin(angle)
        x = np.arange(binary.shape[1])
        y = np.arange(binary.shape[0])
        x = a * x
        y = b * y
        R = np.round(np.add(y.reshape((binary.shape[0], 1)), x.reshape((1, binary.shape[1]))))
        pixel += np.isclose(R, np.round(dist))
    pixel = np.clip(pixel, 0, 1)
    pixel = np.where(pixel == 1, 0, 1)
    binary = np.where(binary == 0, 255, 0)
    binary *= pixel.astype(np.int64)
    ax.imshow(binary, cmap=plt.cm.gray)
    ax.axis((0, cols, rows, 0))
    ax.set_title('Detected lines')
    ax.set_axis_off()
    ax.set_adjustable('box-forced')
    plt.show()
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/houghp_skimage.py
+++ b/debug/houghp_skimage.py
@ -1,49 +0,0 @@
 """
 usage: python hough_prob.py file.png
 finds lines present in an image using scikit-image's hough transform.
 """
 import sys
 import time
 from scipy.misc import imread
 import matplotlib.pyplot as plt
 from skimage.feature import canny
 from skimage.transform import probabilistic_hough_line
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
@timeit
 def main():
    image = imread(sys.argv[1], mode='L')
    edges = canny(image, 2, 1, 25)
    lines = probabilistic_hough_line(edges, threshold=1000)
    fig, ax = plt.subplots(1, 1, figsize=(8,4), sharex=True, sharey=True)
    ax.imshow(edges * 0)
    for line in lines:
        p0, p1 = line
        ax.plot((p0[0], p1[0]), (p0[1], p1[1]))
    ax.set_title('Probabilistic Hough')
    ax.set_axis_off()
    ax.set_adjustable('box-forced')
    plt.show()
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/morph_transform.py
+++ b/debug/morph_transform.py
@ -1,114 +0,0 @@
 """
 usage: python morph_transform.py file.png scale={int} invert={bool}
 finds lines present in an image using opencv's morph transform.
 """
 import sys
 import time
 import cv2
 import numpy as np
 import matplotlib.pyplot as plt
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
 def mt(imagename, scale=40, invert=False):
    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    if invert:
        threshold = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 15, -2)
    else:
        threshold = cv2.adaptiveThreshold(np.invert(gray), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 15, -2)
    vertical = threshold
    horizontal = threshold
    verticalsize = vertical.shape[0] / scale
    horizontalsize = horizontal.shape[1] / scale
    ver = cv2.getStructuringElement(cv2.MORPH_RECT, (1, verticalsize))
    hor = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontalsize, 1))
    vertical = cv2.erode(vertical, ver, (-1, -1))
    vertical = cv2.dilate(vertical, ver, (-1, -1))
    horizontal = cv2.erode(horizontal, hor, (-1, -1))
    horizontal = cv2.dilate(horizontal, hor, (-1, -1))
    mask = vertical + horizontal
    joints = np.bitwise_and(vertical, horizontal)
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:10]
    tables = {}
    for c in contours:
        x, y, w, h = cv2.boundingRect(c)
        x1, x2 = x, x + w
        y1, y2 = y, y + h
        # find number of non-zero values in joints using what boundingRect returns
        roi = joints[y:y+h, x:x+w]
        jc, _ = cv2.findContours(roi, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
        if len(jc) <= 4: # remove contours with less than <=4 joints
            continue
        joint_coords = []
        for j in jc:
            jx, jy, jw, jh = cv2.boundingRect(j)
            c1, c2 = x + (2*jx + jw) / 2, y + (2*jy + jh) / 2
            joint_coords.append((c1, c2))
        tables[(x1, y2, x2, y1)] = joint_coords
    vcontours, _ = cv2.findContours(vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    for vc in vcontours:
        x, y, w, h = cv2.boundingRect(vc)
        x1, x2 = x, x + w
        y1, y2 = y, y + h
        plt.plot([(x1 + x2) / 2, (x1 + x2) / 2], [y2, y1])
    hcontours, _ = cv2.findContours(horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    for hc in hcontours:
        x, y, w, h = cv2.boundingRect(hc)
        x1, x2 = x, x + w
        y1, y2 = y, y + h
        plt.plot([x1, x2], [(y1 + y2) / 2, (y1 + y2) / 2])
    x_coord = []
    y_coord = []
    for k in tables.keys():
        for coord in tables[k]:
            x_coord.append(coord[0])
            y_coord.append(coord[1])
    plt.plot(x_coord, y_coord, 'ro')
    plt.imshow(img)
    plt.show()
    return tables
@timeit
 def main():
    try:
        scale = int(sys.argv[2].split('=')[1])
    except IndexError:
        scale = 40
    try:
        invert = bool(sys.argv[3].split('=')[1])
    except IndexError:
        invert = False
    t = mt(sys.argv[1], scale=scale, invert=invert)
    print 'tables found: ', len(t.keys())
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/plot_geo.py
+++ b/debug/plot_geo.py
@ -1,167 +0,0 @@
 """
 usage:  python plot_geo.py file.pdf
        python plot_geo.py file.pdf file.png
 prints lines and rectangles present in a pdf file.
 """
 import sys
 import time
 import cv2
 import numpy as np
 import matplotlib.pyplot as plt
 import matplotlib.patches as patches
 from pdfminer.pdfpage import PDFPage
 from pdfminer.pdfdevice import PDFDevice
 from pdfminer.pdfparser import PDFParser
 from pdfminer.pdfdocument import PDFDocument
 from pdfminer.converter import PDFPageAggregator
 from pdfminer.pdfinterp import PDFResourceManager
 from pdfminer.pdfinterp import PDFPageInterpreter
 from pdfminer.layout import LAParams, LTLine, LTRect
 from pdfminer.pdfpage import PDFTextExtractionNotAllowed
 MIN_LENGTH = 1
 pdf_x, pdf_y, image_x, image_y = [0] * 4
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
 def remove_coords(coords):
    merged = []
    for coord in coords:
        if not merged:
            merged.append(coord)
        else:
            last = merged[-1]
            if np.isclose(last, coord, atol=2):
                pass
            else:
                merged.append(coord)
    return merged
 def parse_layout(pdfname):
    global pdf_x, pdf_y
    def is_horizontal(line):
        if line[0] == line[2]:
            return True
        return False
    def is_vertical(line):
        if line[1] == line[3]:
            return True
        return False
    vertical, horizontal = [], []
    with open(pdfname, 'rb') as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed
        laparams = LAParams()
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            pdf_x, pdf_y = layout.bbox[2], layout.bbox[3]
            for obj in layout._objs:
                if isinstance(obj, LTLine):
                    line = (obj.x0, obj.y0, obj.x1, obj.y1)
                    if is_vertical(line):
                        vertical.append(line)
                    elif is_horizontal(line):
                        horizontal.append(line)
                elif isinstance(obj, LTRect):
                    vertical.append((obj.x0, obj.y1, obj.x0, obj.y0))
                    vertical.append((obj.x1, obj.y1, obj.x1, obj.y0))
                    horizontal.append((obj.x0, obj.y1, obj.x1, obj.y1))
                    horizontal.append((obj.x0, obj.y0, obj.x1, obj.y0))
    return vertical, horizontal
 def hough_transform(imagename):
    global pdf_x, pdf_y, image_x, image_y
    img = cv2.imread(imagename)
    image_x, image_y = img.shape[1], img.shape[0]
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)
    lines = cv2.HoughLines(edges, 1, np.pi/180, 1000)
    x = []
    for line in lines:
        r, theta = line[0]
        x0 = r * np.cos(theta)
        x0 *= pdf_x / float(image_x)
        x.append(x0)
    y = []
    for line in lines:
        r, theta = line[0]
        y0 = r * np.sin(theta)
        y0 = abs(y0 - image_y)
        y0 *= pdf_y / float(image_y)
        y.append(y0)
    x = remove_coords(sorted(set([x0 for x0 in x if x0 > 0])))
    y = remove_coords(sorted(set(y), reverse=True))
    return x, y
 def plot_lines1(vertical, horizontal):
    fig = plt.figure()
    ax = fig.add_subplot(111, aspect='equal')
    ax.set_xlim(0, 1000)
    ax.set_ylim(0, 1000)
    vertical = filter(lambda x: abs(x[1] - x[3]) > MIN_LENGTH, vertical)
    horizontal = filter(lambda x: abs(x[0] - x[2]) > MIN_LENGTH, horizontal)
    for v in vertical:
        ax.plot([v[0], v[2]], [v[1], v[3]])
    for h in horizontal:
        ax.plot([h[0], h[2]], [h[1], h[3]])
    plt.show()
 def plot_lines2(imagename, vertical, horizontal):
    x, y = hough_transform(imagename)
    fig = plt.figure()
    ax = fig.add_subplot(111, aspect='equal')
    ax.set_xlim(0, 1000)
    ax.set_ylim(0, 1000)
    for x0 in x:
        for v in vertical:
            if np.isclose(x0, v[0], atol=2):
                ax.plot([v[0], v[2]], [v[1], v[3]])
    for y0 in y:
        for h in horizontal:
            if np.isclose(y0, h[1], atol=2):
                ax.plot([h[0], h[2]], [h[1], h[3]])
    plt.show()
@timeit
 def main():
    vertical, horizontal = parse_layout(sys.argv[1])
    if len(sys.argv) == 2:
        plot_lines1(vertical, horizontal)
    elif len(sys.argv) == 3:
        plot_lines1(vertical, horizontal)
        plot_lines2(sys.argv[2], vertical, horizontal)
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/plot_intensity.py
+++ b/debug/plot_intensity.py
@ -1,69 +0,0 @@
 """
 usage: python plot_intensity.py file.png threshold
 plots sum of pixel intensities on both axes for an image.
 """
 import sys
 import time
 from itertools import groupby
 from operator import itemgetter
 import cv2
 import numpy as np
 import matplotlib.pyplot as plt
 from pylab import barh
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
 def plot_barchart(ar):
    n = len(ar)
    ind = np.arange(n)
    width = 0.35
    plt.bar(ind, ar, width, color='r', zorder=1)
    plt.show()
 def merge_lines(lines):
    ranges = []
    for k, g in groupby(enumerate(lines), lambda (i, x): i-x):
        group = map(itemgetter(1), g)
        ranges.append((group[0], group[-1]))
    merged = []
    for r in ranges:
        merged.append((r[0] + r[1]) / 2)
    return merged
 def plot_lines(image, lines):
    for y in lines:
        plt.plot([0, image.shape[1]], [y, y])
    plt.imshow(image)
    plt.show()
@timeit
 def main():
    image = cv2.imread(sys.argv[1])
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    threshold = cv2.adaptiveThreshold(np.invert(gray), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 15, -2)
    y_proj = np.sum(threshold, axis=1)
    line_threshold = int(sys.argv[2])
    lines = np.where(y_proj < line_threshold)[0]
    lines = merge_lines(lines)
    plot_lines(image, lines)
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/print_text.py
+++ b/debug/print_text.py
@ -1,83 +0,0 @@
 """
 usage: python print_text.py file.pdf
 prints horizontal and vertical text lines present in a pdf file.
 """
 import sys
 import time
 from pprint import pprint
 from pdfminer.layout import LAParams
 from pdfminer.pdfpage import PDFPage
 from pdfminer.pdfdevice import PDFDevice
 from pdfminer.pdfparser import PDFParser
 from pdfminer.pdfdocument import PDFDocument
 from pdfminer.converter import PDFPageAggregator
 from pdfminer.pdfinterp import PDFPageInterpreter
 from pdfminer.pdfinterp import PDFResourceManager
 from pdfminer.pdfpage import PDFTextExtractionNotAllowed
 from pdfminer.layout import (LAParams, LTChar, LTAnno, LTTextBoxHorizontal,
                             LTTextLineHorizontal, LTTextLineVertical, LTLine)
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
 def extract_text_objects(layout, LTObject, t=None):
    if t is None:
        t = []
    try:
        for obj in layout._objs:
            if isinstance(obj, LTObject):
                t.append(obj)
            else:
                t += extract_text_objects(obj, LTObject)
    except AttributeError:
        pass
    return t
@timeit
 def main():
    with open(sys.argv[1], 'rb') as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed
        # 2.0, 0.5, 0.1
        kwargs = {
            'char_margin': 1.0,
            'line_margin': 0.5,
            'word_margin': 0.1,
            'detect_vertical': True
        }
        laparams = LAParams(**kwargs)
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            lh = extract_text_objects(layout, LTTextLineHorizontal)
            lv = extract_text_objects(layout, LTTextLineVertical)
            print "number of horizontal text lines -> {0}".format(len(lh))
            print "horizontal text lines ->"
            pprint([t.get_text() for t in lh])
            print "number of vertical text lines -> {0}".format(len(lv))
            print "vertical text lines ->"
            pprint([t.get_text() for t in lv])
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/debug/threshold.py
+++ b/debug/threshold.py
@ -1,41 +0,0 @@
 """
 usage: python threshold.py file.png blocksize threshold_constant
 shows thresholded image.
 """
 import sys
 import time
 import cv2
 import numpy as np
 import matplotlib.pyplot as plt
 def timeit(func):
    def timed(*args, **kw):
        start = time.time()
        result = func(*args, **kw)
        end = time.time()
        print 'Function: %r took: %2.4f seconds' % (func.__name__, end - start)
        return result
    return timed
@timeit
 def main():
    img = cv2.imread(sys.argv[1])
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    blocksize = int(sys.argv[2])
    threshold_constant = float(sys.argv[3])
    threshold = cv2.adaptiveThreshold(np.invert(gray), 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blocksize, threshold_constant)
    plt.imshow(img)
    plt.show()
 if __name__ == '__main__':
    if len(sys.argv) == 1:
        print __doc__
    else:
        main()
--- a/docs/api.rst
+++ b/docs/api.rst
@ -4,17 +4,37 @@
 API Reference
 =============
-Pdf
+camelot.read_pdf
-===
+================
-.. automodule:: camelot.pdf
+.. automodule:: camelot.read_pdf
   :members:
-Lattice
+camelot.handlers.PDFHandler
-=======
+===========================
-.. automodule:: camelot.lattice
+.. automodule:: camelot.handlers.PDFHandler
   :members:
-Stream
+camelot.parsers.Stream
-======
+======================
-.. automodule:: camelot.stream
+.. automodule:: camelot.parsers.Stream
   :members:
 camelot.parsers.Lattice
 =======================
 .. automodule:: camelot.parsers.Lattice
   :members:
 camelot.core.Cell
 =================
 .. automodule:: camelot.core.Cell
   :members:
 camelot.core.Table
 ==================
 .. automodule:: camelot.core.Table
   :members:
 camelot.core.TableList
 ======================
 .. automodule:: camelot.core.TableList
   :members:
--- a/docs/index.rst
+++ b/docs/index.rst
@ -3,11 +3,11 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
-==================================
+=====================================
-Camelot: pdf parsing made simpler!
+Camelot: PDF Table Parsing for Humans
-==================================
+=====================================
-Camelot is a Python 2.7 library and command-line tool for getting tables out of pdf files.
+Camelot is a Python 2.7 library and command-line tool for extracting tabular data from PDF files.
 Why another pdf table parsing library?
 ======================================
@ -32,12 +32,22 @@ Usage
 ::
-    >>> from camelot.pdf import Pdf
+    >>> import camelot
-    >>> from camelot.lattice import Lattice
+    >>> tables = camelot.read_pdf("foo.pdf")
-
+    >>> tables
-    >>> manager = Pdf(Lattice(), 'us-030.pdf')
+    <TableList n=2>
-    >>> tables = manager.extract()
+    >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
-    >>> print tables['page-1']['table-1']['data']
+    >>> tables[0]
    <Table shape=(3,4)>
    >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
    >>> tables[0].parsing_report
    {
        "accuracy": 96,
        "whitespace": 80,
        "order": 1,
        "page": 1
    }
    >>> df = tables[0].df
 .. csv-table::
   :header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -49,45 +59,6 @@ Usage
   "2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
   "4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
 Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.
 ::
    Camelot: PDF parsing made simpler!
    usage:
     camelot [options] <method> [<args>...]
    options:
     -h, --help                Show this screen.
     -v, --version             Show version.
     -V, --verbose             Verbose.
     -p, --pages <pageno>      Comma-separated list of page numbers.
                               Example: -p 1,3-6,10  [default: 1]
     -P, --parallel            Parallelize the parsing process.
     -f, --format <format>     Output format. (csv,tsv,html,json,xlsx) [default: csv]
     -l, --log                 Log to file.
     -o, --output <directory>  Output directory.
     -M, --cmargin <cmargin>   Char margin. Chars closer than cmargin are
                               grouped together to form a word. [default: 1.0]
     -L, --lmargin <lmargin>   Line margin. Lines closer than lmargin are
                               grouped together to form a textbox. [default: 0.5]
     -W, --wmargin <wmargin>   Word margin. Insert blank spaces between chars
                               if distance between words is greater than word
                               margin. [default: 0.1]
     -J, --split_text          Split text lines if they span across multiple cells.
     -K, --flag_size           Flag substring if its size differs from the whole string.
                               Useful for super and subscripts.
     -X, --print-stats         List stats on the parsing process.
     -Y, --save-stats          Save stats to a file.
     -Z, --plot <dist>         Plot distributions. (page,all,rc)
    camelot methods:
     lattice  Looks for lines between data.
     stream   Looks for spaces between data.
    See 'camelot <method> -h' for more information on a specific method.
 Installation
 ============
@ -95,42 +66,41 @@ Make sure you have the most updated versions for `pip` and `setuptools`. You can
    pip install -U pip setuptools
-The required dependencies include `numpy`_, `OpenCV`_ and `ImageMagick`_.
+The dependencies include `tk`_ and `ghostscript`_.
-.. _numpy: http://www.numpy.org/
+.. _tk: https://wiki.tcl.tk/3743
-.. _OpenCV: http://opencv.org/
+.. _ghostscript: https://www.ghostscript.com/
 .. _ImageMagick: http://www.imagemagick.org/script/index.php
 Installing dependencies
 -----------------------
-numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.
+tk and ghostscript can be installed using your system's default package manager.
 Linux
 ^^^^^
 * Arch Linux
 ::
    sudo pacman -S opencv imagemagick
 * Ubuntu
 ::
-    sudo apt-get install libopencv-dev python-opencv imagemagick
+    sudo apt-get install python-opencv python-tk ghostscript
 * Arch Linux
 ::
    sudo pacman -S opencv tk ghostscript
 OS X
 ^^^^
 ::
-    brew install homebrew/science/opencv imagemagick
+    brew install homebrew/science/opencv ghostscript
 Finally, `cd` into the project directory and install by::
-    make install
+    python setup.py install
 API Reference
 =============
@ -150,14 +120,14 @@ You can check the latest sources with the command::
 Contributing
 ------------
-See :doc:`Contributing doc <contributing>`.
+See :doc:`Contributing guidelines <contributing>`.
 Testing
 -------
 ::
-    make test
+    python setup.py test
 License
 =======
--- a/examples/demo_lattice.py
+++ b/examples/demo_lattice.py
@ -1,11 +0,0 @@
 from camelot import Pdf
 from camelot import Lattice
 extractor = Lattice(Pdf("files/column_span_1.pdf", clean=True), scale=30)
 tables = extractor.get_tables()
 print tables
 extractor = Lattice(Pdf("files/column_span_2.pdf"), clean=True, scale=30)
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_lattice_fill.py
+++ b/examples/demo_lattice_fill.py
@ -1,13 +0,0 @@
 from camelot import Pdf
 from camelot import Lattice
 extractor = Lattice(
    Pdf("files/row_span_1.pdf", clean=True), fill='v', scale=40)
 tables = extractor.get_tables()
 print tables
 extractor = Lattice(
    Pdf("files/row_span_2.pdf", clean=True), fill='v', scale=30)
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_lattice_invert.py
+++ b/examples/demo_lattice_invert.py
@ -1,13 +0,0 @@
 from camelot import Pdf
 from camelot import Lattice
 extractor = Lattice(Pdf("files/lines_in_background_1.pdf",
                        clean=True), scale=30, invert=True)
 tables = extractor.get_tables()
 print tables
 extractor = Lattice(Pdf("files/lines_in_background_2.pdf",
                        clean=True), scale=30, invert=True)
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_lattice_rotation.py
+++ b/examples/demo_lattice_rotation.py
@ -1,11 +0,0 @@
 from camelot import Pdf
 from camelot import Lattice
 extractor = Lattice(Pdf("files/left_rotated_table.pdf", clean=True), scale=30)
 tables = extractor.get_tables()
 print tables
 extractor = Lattice(Pdf("files/right_rotated_table.pdf", clean=True), scale=30)
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_lattice_twotables.py
+++ b/examples/demo_lattice_twotables.py
@ -1,11 +0,0 @@
 from camelot import Pdf
 from camelot import Lattice
 extractor = Lattice(Pdf("files/twotables_1.pdf", clean=True), scale=40)
 tables = extractor.get_tables()
 print tables
 extractor = Lattice(Pdf("files/twotables_2.pdf", clean=True), scale=30)
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_stream.py
+++ b/examples/demo_stream.py
@ -1,8 +0,0 @@
 from camelot import Pdf
 from camelot import Stream
 extractor = Stream(Pdf("files/budget_2014-15.pdf",
                       char_margin=1.0, clean=True))
 tables = extractor.get_tables()
 print tables
--- a/examples/demo_stream_columns.py
+++ b/examples/demo_stream_columns.py
@ -1,13 +0,0 @@
 from camelot import Pdf
 from camelot import Stream
 extractor = Stream(Pdf("files/inconsistent_rows.pdf", char_margin=1.0),
                   columns="65,95,285,640,715,780", ytol=10)
 tables = extractor.get_tables()
 print tables
 extractor = Stream(Pdf("files/consistent_rows.pdf", char_margin=1.0),
                   columns="28,67,180,230,425,475,700", ytol=5)
 tables = extractor.get_tables()
 print tables
--- a/examples/files/consistent_rows.pdf
+++ b/examples/files/consistent_rows.pdf
--- a/examples/files/inconsistent_rows.pdf
+++ b/examples/files/inconsistent_rows.pdf
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@ -0,0 +1,11 @@
 click==6.7
 matplotlib==2.2.3
 numpy==1.13.3
 opencv-python==3.4.2.17
 pandas==0.23.4
 pdfminer==20140328
 Pillow==5.2.0
 PyPDF2==1.26.0
 pytest==3.8.0
 pytest-runner==4.2
 Sphinx==1.8.0b1
--- a/requirements.txt
+++ b/requirements.txt
@ -1,9 +1,8 @@
-docopt
+click==6.7
-matplotlib
+matplotlib==2.2.3
-nose
+numpy==1.13.3
-pdfminer
+opencv-python==3.4.2.17
-pyexcel-xlsx
+pandas==0.23.4
-Pillow
+pdfminer==20140328
-pyocr
+Pillow==5.2.0
-PyPDF2
+PyPDF2==1.26.0
 Sphinx
--- a/setup.cfg
+++ b/setup.cfg
@ -0,0 +1,6 @@
 [aliases]
 test=pytest
 [tool:pytest]
 addopts = --verbose
 python_files = tests/test_*.py
--- a/setup.py
+++ b/setup.py
@ -4,12 +4,12 @@ import camelot
 NAME = 'camelot'
 VERSION = camelot.__version__
-DESCRIPTION = 'camelot parses tables from PDFs!'
+DESCRIPTION = 'PDF Table Parsing for Humans'
 with open('README.md') as f:
    LONG_DESCRIPTION = f.read()
 URL = 'https://github.com/socialcopsdev/camelot'
 AUTHOR = 'Vinayak Mehta'
-AUTHOR_EMAIL = 'vinayak@socialcops.com'
+AUTHOR_EMAIL = 'vmehta94@gmail.com'
 LICENSE = 'BSD License'
 opencv_min_version = '2.4.8'
@ -48,10 +48,8 @@ def setup_package():
                    author=AUTHOR,
                    author_email=AUTHOR_EMAIL,
                    license=LICENSE,
                    keywords='parse scrape pdf table',
                    packages=['camelot'],
-                    install_requires=reqs,
+                    install_requires=reqs)
                    scripts=['tools/camelot'])
    try:
        from setuptools import setup
@ -60,18 +58,14 @@ def setup_package():
    opencv_status = get_opencv_status()
    opencv_req_str = "camelot requires OpenCV >= {0}.\n".format(opencv_min_version)
    instructions = ("Installation instructions are available in the README at "
                    "https://github.com/socialcopsdev/camelot")
    if opencv_status['up_to_date'] is False:
        if opencv_status['version']:
-            raise ImportError("Your installation of OpenCV "
+            raise ImportError("Your installation of OpenCV {} is out-of-date.\n{}"
-                              "{0} is out-of-date.\n{1}{2}"
+                              .format(opencv_status['version'], opencv_req_str))
                              .format(opencv_status['version'],
                                      opencv_req_str, instructions))
        else:
-            raise ImportError("OpenCV is not installed.\n{0}{1}"
+            raise ImportError("OpenCV is not installed.\n{}"
-                              .format(opencv_req_str, instructions))
+                              .format(opencv_req_str))
    setup(**metadata)
--- a/tests/budget_2014-15.pdf
+++ b/tests/budget_2014-15.pdf
--- a/tests/column_span_1.pdf
+++ b/tests/column_span_1.pdf
--- a/tests/column_span_2.pdf
+++ b/tests/column_span_2.pdf
--- a/tests/files/agstat.pdf
+++ b/tests/files/agstat.pdf
--- a/tests/files/anticlockwise_table_1.pdf
+++ b/tests/files/anticlockwise_table_1.pdf
--- a/tests/files/anticlockwise_table_2.pdf
+++ b/tests/files/anticlockwise_table_2.pdf
--- a/tests/files/assam.pdf
+++ b/tests/files/assam.pdf
--- a/examples/files/lines_in_background_1.pdf
+++ b/examples/files/lines_in_background_1.pdf
--- a/examples/files/lines_in_background_2.pdf
+++ b/examples/files/lines_in_background_2.pdf
--- a/examples/files/budget_2014-15.pdf
+++ b/examples/files/budget_2014-15.pdf
--- a/examples/files/right_rotated_table.pdf
+++ b/examples/files/right_rotated_table.pdf
--- a/tests/files/clockwise_table_2.pdf
+++ b/tests/files/clockwise_table_2.pdf
--- a/examples/files/column_span_1.pdf
+++ b/examples/files/column_span_1.pdf
--- a/examples/files/column_span_2.pdf
+++ b/examples/files/column_span_2.pdf
--- a/tests/files/district_health.pdf
+++ b/tests/files/district_health.pdf
--- a/tests/files/electoral_roll.pdf
+++ b/tests/files/electoral_roll.pdf
--- a/tests/files/health.pdf
+++ b/tests/files/health.pdf
--- a/tests/files/medicine.pdf
+++ b/tests/files/medicine.pdf
--- a/tests/files/mexican_towns.pdf
+++ b/tests/files/mexican_towns.pdf
--- a/examples/files/missing_values.pdf
+++ b/examples/files/missing_values.pdf
--- a/tests/files/population_growth.pdf
+++ b/tests/files/population_growth.pdf
--- a/tests/files/rainfall_distribution.pdf
+++ b/tests/files/rainfall_distribution.pdf
--- a/examples/files/row_span_1.pdf
+++ b/examples/files/row_span_1.pdf
--- a/examples/files/row_span_2.pdf
+++ b/examples/files/row_span_2.pdf
--- a/tests/files/row_span_3.pdf
+++ b/tests/files/row_span_3.pdf
--- a/tests/files/tableception.pdf
+++ b/tests/files/tableception.pdf
--- a/tests/tabula_test_pdfs/12s0324.pdf
+++ b/tests/tabula_test_pdfs/12s0324.pdf
--- a/tests/tabula_test_pdfs/20.pdf
+++ b/tests/tabula_test_pdfs/20.pdf
--- a/tests/tabula_test_pdfs/S2MNCEbirdisland.pdf
+++ b/tests/tabula_test_pdfs/S2MNCEbirdisland.pdf
--- a/tests/tabula_test_pdfs/arabic.pdf
+++ b/tests/tabula_test_pdfs/arabic.pdf
--- a/tests/tabula_test_pdfs/argentina_diputados_voting_record.pdf
+++ b/tests/tabula_test_pdfs/argentina_diputados_voting_record.pdf
--- a/tests/tabula_test_pdfs/campaign_donors.pdf
+++ b/tests/tabula_test_pdfs/campaign_donors.pdf
--- a/tests/tabula_test_pdfs/china.pdf
+++ b/tests/tabula_test_pdfs/china.pdf
--- a/tests/tabula_test_pdfs/eu-002.pdf
+++ b/tests/tabula_test_pdfs/eu-002.pdf
--- a/tests/tabula_test_pdfs/eu-017.pdf
+++ b/tests/tabula_test_pdfs/eu-017.pdf
--- a/tests/tabula_test_pdfs/failing_sort.pdf
+++ b/tests/tabula_test_pdfs/failing_sort.pdf
--- a/tests/tabula_test_pdfs/frx_2012_disclosure.pdf
+++ b/tests/tabula_test_pdfs/frx_2012_disclosure.pdf
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001-reg.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001-reg.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001-str.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001-str.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001.json
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001.json
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001.pdf
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-001.pdf
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002-reg.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002-reg.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002-str.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002-str.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002.json
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002.json
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002.pdf
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-002.pdf
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003-reg.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003-reg.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003-str.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003-str.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003.json
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003.json
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003.pdf
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-003.pdf
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004-reg.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004-reg.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004-str.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004-str.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004.json
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004.json
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004.pdf
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-004.pdf
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-005-reg.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-005-reg.xml
--- a/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-005-str.xml
+++ b/tests/tabula_test_pdfs/icdar2013-dataset/competition-dataset-eu/eu-005-str.xml
--- a/Show More
+++ b/Show More
		`@ -0,0 +1,2 @@`
							`from .stream import Stream`
							`from .lattice import Lattice`