Update docs

* Update README * Update index.rst * Update docstrings * Fix typo * Edit docs * Add error messages
2016-10-04 17:50:48 +05:30 · 2016-10-04 17:50:48 +05:30 · 4b8e96a86a
parent d46eeeab1a
commit 4b8e96a86a
11 changed files with 715 additions and 446 deletions
--- a/README.md
+++ b/README.md
@ -8,26 +8,38 @@ Camelot is a Python 2.7 library and command-line tool for getting tables out of
 from camelot.pdf import Pdf
 from camelot.lattice import Lattice

-extractor = Lattice(Pdf("/path/to/pdf", pagenos=[{'start': 2, 'end': 4}]))
-tables = extractor.get_tables()
+manager = Pdf(Lattice(), "/path/to/pdf")
+tables = manager.extract()
 </pre>

-Camelot comes with a command-line tool in which you can specify the output format (csv, tsv, html, json, and xlsx), page numbers you want to parse and the output directory in which you want the output files to be placed. By default, the output files are placed in the same directory as the PDF.
+Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.

 <pre>
-camelot parses tables from PDFs!
+Camelot: PDF parsing made simpler!

 usage:
- camelot.py [options] <method> [<args>...]
+ camelot [options] &lt;method&gt; [&lt;args&gt;...]

 options:
 -h, --help                Show this screen.
 -v, --version             Show version.
+ -V, --verbose             Verbose.
 -p, --pages &lt;pageno&gt;      Comma-separated list of page numbers.
                           Example: -p 1,3-6,10  [default: 1]
+ -P, --parallel            Parallelize the parsing process.
 -f, --format &lt;format&gt;     Output format. (csv,tsv,html,json,xlsx) [default: csv]
- -l, --log                       Print log to file.
+ -l, --log                 Log to file.
 -o, --output &lt;directory&gt;  Output directory.
+ -M, --cmargin &lt;cmargin&gt;   Char margin. Chars closer than cmargin are
+                           grouped together to form a word. [default: 2.0]
+ -L, --lmargin &lt;lmargin&gt;   Line margin. Lines closer than lmargin are
+                           grouped together to form a textbox. [default: 0.5]
+ -W, --wmargin &lt;wmargin&gt;   Word margin. Insert blank spaces between chars
+                           if distance between words is greater than word
+                           margin. [default: 0.1]
+ -S, --print-stats         List stats on the parsing process.
+ -T, --save-stats          Save stats to a file.
+ -X, --plot &lt;dist&gt;         Plot distributions. (page,all,rc)

 camelot methods:
 lattice  Looks for lines between data.
@ -47,48 +59,12 @@ The required dependencies include [numpy](http://www.numpy.org/), [OpenCV](http:
 Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by

 <pre>
-pip install -U pip, setuptools
-</pre>
-
-We strongly recommend that you use a [virtual environment](http://virtualenvwrapper.readthedocs.io/en/latest/install.html#basic-installation) to install Camelot. If you don't want to use a virtual environment, then skip the next section.
-
-### Installing virtualenvwrapper
-
-You'll need to install [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io/en/latest/).
-
-<pre>
-pip install virtualenvwrapper
-</pre>
-
-or
-<pre>
-sudo pip install virtualenvwrapper
-</pre>
-
-After installing virtualenvwrapper, add the following lines to your `.bashrc` and source it.
-
-<pre>
-export WORKON_HOME=$HOME/.virtualenvs
-source /usr/bin/virtualenvwrapper.sh
-</pre>
-
-The path to `virtualenvwrapper.sh` could be different on your system.
-
-Finally make a virtual environment using
-
-<pre>
-mkvirtualenv camelot
+pip install -U pip setuptools
 </pre>

 ### Installing dependencies

-numpy can be install using pip.
-
-<pre>
-pip install numpy
-</pre>
-
-OpenCV and imagemagick can be installed using your system's default package manager.
+numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.

 #### Linux

@ -110,13 +86,7 @@ sudo apt-get install libopencv-dev python-opencv imagemagick
 brew install homebrew/science/opencv imagemagick
 </pre>

-If you're working in a virtualenv, you'll need to create a symbolic link for the OpenCV shared object file
-
-<pre>
-sudo ln -s /path/to/system/site-packages/cv2.so ~/path/to/virtualenv/site-packages/cv2.so
-</pre>
-
-Finally, `cd` into the project directory and install by doing
+Finally, `cd` into the project directory and install by

 <pre>
 make install
--- a/camelot/cell.py
+++ b/camelot/cell.py
@ -1,41 +1,63 @@
 class Cell:
-    """Cell
+    """Cell.
+    Defines a cell object with coordinates relative to a left-bottom
+    origin, which is also PDFMiner's coordinate space.

    Parameters
    ----------
-    x1 : int
+    x1 : float
+        x-coordinate of left-bottom point.

-    y1 : int
+    y1 : float
+        y-coordinate of left-bottom point.

-    x2 : int
+    x2 : float
+        x-coordinate of right-top point.

-    y2 : int
+    y2 : float
+        y-coordinate of right-top point.

    Attributes
    ----------
    lb : tuple
+        Tuple representing left-bottom coordinates.

    lt : tuple
+        Tuple representing left-top coordinates.

    rb : tuple
+        Tuple representing right-bottom coordinates.

    rt : tuple
+        Tuple representing right-top coordinates.

    bbox : tuple
+        Tuple representing the cell's bounding box using the
+        lower-bottom and right-top coordinates.

    left : bool
+        Whether or not cell is bounded on the left.

    right : bool
+        Whether or not cell is bounded on the right.

    top : bool
+        Whether or not cell is bounded on the top.

    bottom : bool
+        Whether or not cell is bounded on the bottom.
+
+    text_objects : list
+        List of text objects assigned to cell.

    text : string
+        Text assigned to cell.

    spanning_h : bool
+        Whether or not cell spans/extends horizontally.

    spanning_v : bool
+        Whether or not cell spans/extends vertically.
    """

    def __init__(self, x1, y1, x2, y2):
@ -53,13 +75,13 @@ class Cell:
        self.right = False
        self.top = False
        self.bottom = False
-        self.text = ''
        self.text_objects = []
+        self.text = ''
        self.spanning_h = False
        self.spanning_v = False

    def add_text(self, text):
-        """Adds text to cell object.
+        """Adds text to cell.

        Parameters
        ----------
@ -68,7 +90,7 @@ class Cell:
        self.text = ''.join([self.text, text])

    def get_text(self):
-        """Returns text from cell object.
+        """Returns text assigned to cell.

        Returns
        -------
@ -77,16 +99,29 @@ class Cell:
        return self.text

    def add_object(self, t_object):
+        """Adds PDFMiner text object to cell.
+
+        Parameters
+        ----------
+        t_object : object
+        """
        self.text_objects.append(t_object)

    def get_objects(self):
+        """Returns list of text objects assigned to cell.
+
+        Returns
+        -------
+        text_objects : list
+        """
        return self.text_objects

    def get_bounded_edges(self):
-        """Returns number of edges by which a cell is bounded.
+        """Returns the number of edges by which a cell is bounded.

        Returns
        -------
        bounded_edges : int
        """
-        return self.top + self.bottom + self.left + self.right
+        self.bounded_edges = self.top + self.bottom + self.left + self.right
+        return self.bounded_edges
--- a/camelot/imgproc.py
+++ b/camelot/imgproc.py
@ -3,6 +3,26 @@ import numpy as np


 def adaptive_threshold(imagename, invert=False):
+    """Thresholds an image using OpenCV's adaptiveThreshold.
+
+    Parameters
+    ----------
+    imagename : string
+        Path to image file.
+
+    invert : bool
+        Whether or not to invert the image. Useful when pdfs have
+        tables with lines in background.
+        (optional, default: False)
+
+    Returns
+    -------
+    img : object
+        numpy.ndarray representing the original image.
+
+    threshold : object
+        numpy.ndarray representing the thresholded image.
+    """
    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

@ -18,7 +38,35 @@ def adaptive_threshold(imagename, invert=False):
    return img, threshold


-def find_lines(threshold, direction=None, scale=15):
+def find_lines(threshold, direction='horizontal', scale=15):
+    """Finds horizontal and vertical lines by applying morphological
+    transformations on an image.
+
+    Parameters
+    ----------
+    threshold : object
+        numpy.ndarray representing the thresholded image.
+
+    direction : string
+        Specifies whether to find vertical or horizontal lines.
+        (default: 'horizontal')
+
+    scale : int
+        Used to divide the height/width to get a structuring element
+        for morph transform.
+        (optional, default: 15)
+
+    Returns
+    -------
+    dmask : object
+        numpy.ndarray representing pixels where vertical/horizontal
+        lines lie.
+
+    lines : list
+        List of tuples representing vertical/horizontal lines with
+        coordinates relative to a left-top origin in
+        OpenCV's coordinate space.
+    """
    lines = []

    if direction == 'vertical':
@ -56,6 +104,23 @@ def find_lines(threshold, direction=None, scale=15):


 def find_table_contours(vertical, horizontal):
+    """Finds table boundaries using OpenCV's findContours.
+
+    Parameters
+    ----------
+    vertical : object
+        numpy.ndarray representing pixels where vertical lines lie.
+
+    horizontal : object
+        numpy.ndarray representing pixels where horizontal lines lie.
+
+    Returns
+    -------
+    cont : list
+        List of tuples representing table boundaries. Each tuple is of
+        the form (x, y, w, h) where (x, y) -> left-top, w -> width and
+        h -> height in OpenCV's coordinate space.
+    """
    mask = vertical + horizontal

    try:
@ -75,6 +140,30 @@ def find_table_contours(vertical, horizontal):
        

 def find_table_joints(contours, vertical, horizontal):
+    """Finds joints/intersections present inside each table boundary.
+
+    Parameters
+    ----------
+    contours : list
+        List of tuples representing table boundaries. Each tuple is of
+        the form (x, y, w, h) where (x, y) -> left-top, w -> width and
+        h -> height in OpenCV's coordinate space.
+
+    vertical : object
+        numpy.ndarray representing pixels where vertical lines lie.
+
+    horizontal : object
+        numpy.ndarray representing pixels where horizontal lines lie.
+
+    Returns
+    -------
+    tables : dict
+        Dict with table boundaries as keys and list of intersections
+        in that boundary as their value.
+
+        Keys are of the form (x1, y1, x2, y2) where (x1, y1) -> lb
+        and (x2, y2) -> rt in OpenCV's coordinate space.
+    """
    joints = np.bitwise_and(vertical, horizontal)
    tables = {}
    for c in contours:
--- a/camelot/lattice.py
+++ b/camelot/lattice.py
@ -8,11 +8,10 @@ import subprocess
 from .imgproc import (adaptive_threshold, find_lines, find_table_contours,
                      find_table_joints)
 from .table import Table
-from .utils import (scale_to_pdf, scale_to_image, segments_bbox, text_bbox,
-                    get_rotation, merge_close_values, get_row_index,
-                    get_column_index, get_score, reduce_index, outline,
-                    fill_spanning, count_empty, encode_list, get_page_layout,
-                    get_text_objects)
+from .utils import (scale_to_pdf, scale_to_image, get_rotation, segments_bbox,
+                    text_bbox, merge_close_values, get_row_index,
+                    get_column_index, get_score, count_empty, encode_list,
+                    get_text_objects, get_page_layout)


 __all__ = ['Lattice']
@ -26,41 +25,165 @@ def _reduce_method(m):
 copy_reg.pickle(types.MethodType, _reduce_method)


-class Lattice:
-    """Lattice algorithm
-
-    Makes use of pdf geometry by processing its image, to make a table
-    and fills text objects in table cells.
+def _fill_spanning(t, fill=None):
+    """Fills spanning cells.

    Parameters
    ----------
-    pdfobject : camelot.pdf.Pdf
+    t : object
+        camelot.table.Table

    fill : string
-        Fill data in horizontal and/or vertical spanning
-        cells. (optional, default: None) {None, 'h', 'v', 'hv'}
+        {'h', 'v', 'hv'}
+        Specify to fill spanning cells in horizontal, vertical or both
+        directions.
+        (optional, default: None)
+
+    Returns
+    -------
+    t : object
+        camelot.table.Table
+    """
+    if fill == "h":
+        for i in range(len(t.cells)):
+            for j in range(len(t.cells[i])):
+                if t.cells[i][j].get_text().strip() == '':
+                    if t.cells[i][j].spanning_h:
+                        t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
+    elif fill == "v":
+        for i in range(len(t.cells)):
+            for j in range(len(t.cells[i])):
+                if t.cells[i][j].get_text().strip() == '':
+                    if t.cells[i][j].spanning_v:
+                        t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
+    elif fill == "hv":
+        for i in range(len(t.cells)):
+            for j in range(len(t.cells[i])):
+                if t.cells[i][j].get_text().strip() == '':
+                    if t.cells[i][j].spanning_h:
+                        t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
+                    elif t.cells[i][j].spanning_v:
+                        t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
+    return t
+
+
+def _outline(t):
+    """Sets table border edges to True.
+
+    Parameters
+    ----------
+    t : object
+        camelot.table.Table
+
+    Returns
+    -------
+    t : object
+        camelot.table.Table
+    """
+    for i in range(len(t.cells)):
+        t.cells[i][0].left = True
+        t.cells[i][len(t.cells[i]) - 1].right = True
+    for i in range(len(t.cells[0])):
+        t.cells[0][i].top = True
+        t.cells[len(t.cells) - 1][i].bottom = True
+    return t
+
+
+def _reduce_index(t, rotation, r_idx, c_idx):
+    """Reduces index of a text object if it lies within a spanning
+    cell taking in account table rotation.
+
+    Parameters
+    ----------
+    t : object
+        camelot.table.Table
+
+    rotation : string
+        {'', 'left', 'right'}
+
+    r_idx : int
+        Current row index.
+
+    c_idx : int
+        Current column index.
+
+    Returns
+    -------
+    r_idx : int
+        Reduced row index.
+
+    c_idx : int
+        Reduced column index.
+    """
+    if not rotation:
+        if t.cells[r_idx][c_idx].spanning_h:
+            while not t.cells[r_idx][c_idx].left:
+                c_idx -= 1
+        if t.cells[r_idx][c_idx].spanning_v:
+            while not t.cells[r_idx][c_idx].top:
+                r_idx -= 1
+    elif rotation == 'left':
+        if t.cells[r_idx][c_idx].spanning_h:
+            while not t.cells[r_idx][c_idx].left:
+                c_idx -= 1
+        if t.cells[r_idx][c_idx].spanning_v:
+            while not t.cells[r_idx][c_idx].bottom:
+                r_idx += 1
+    elif rotation == 'right':
+        if t.cells[r_idx][c_idx].spanning_h:
+            while not t.cells[r_idx][c_idx].right:
+                c_idx += 1
+        if t.cells[r_idx][c_idx].spanning_v:
+            while not t.cells[r_idx][c_idx].top:
+                r_idx -= 1
+    return r_idx, c_idx
+
+
+class Lattice:
+    """Lattice looks for lines in the pdf to form a table.
+
+    If you want to give fill and mtol for each table when specifying
+    multiple table areas, make sure that the length of fill and mtol
+    is equal to the length of table_area. Mapping between them is based
+    on index.
+
+    Parameters
+    ----------
+    table_area : list
+        List of tuples of the form (x1, y1, x2, y2) where
+        (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
+        coordinate space, denoting table areas to analyze.
+        (optional, default: None)
+
+    fill : list
+        List of strings specifying directions to fill spanning cells.
+        {'h', 'v', 'hv'} to fill spanning cells in horizontal, vertical
+        or both directions.
+        (optional, default: None)
+
+    mtol : list
+        List of ints specifying m-tolerance parameters.
+        (optional, default: [2])

    scale : int
-        Scaling factor. Large scaling factor leads to smaller lines
-        being detected. (optional, default: 15)
-
-    mtol : int
-        Tolerance to account for when merging lines which are
-        very close. (optional, default: 2)
+        Used to divide the height/width of a pdf to get a structuring
+        element for image processing.
+        (optional, default: 15)

    invert : bool
-        Invert pdf image to make sure that lines are in foreground.
+        Whether or not to invert the image. Useful when pdfs have
+        tables with lines in background.
        (optional, default: False)

-    debug : string
-        Debug by visualizing pdf geometry.
-        (optional, default: None) {'contour', 'line', 'joint', 'table'}
+    margins : tuple
+        PDFMiner margins. (char_margin, line_margin, word_margin)
+        (optional, default: (1.0, 0.5, 0.1))

-    Attributes
-    ----------
-    tables : dict
-        Dictionary with page number as key and list of tables on that
-        page as value.
+    debug : string
+        {'contour', 'line', 'joint', 'table'}
+        Set to one of the above values to generate a matplotlib plot
+        of detected contours, lines, joints and the table generated.
+        (optional, default: None)
    """
    def __init__(self, table_area=None, fill=None, mtol=[2], scale=15,
                 invert=False, margins=(1.0, 0.5, 0.1), debug=None):
@ -75,13 +198,16 @@ class Lattice:
        self.debug = debug

    def get_tables(self, pdfname):
-        """Returns all tables found in given pdf.
+        """get_tables
+
+        Parameters
+        ----------
+        pdfname : string
+            Path to single page pdf file.

        Returns
        -------
-        tables : dict
-            Dictionary with page number as key and list of tables on that
-            page as value.
+        page : dict
        """
        layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
            line_margin=self.line_margin, word_margin=self.word_margin)
@ -125,7 +251,7 @@ class Lattice:
        if self.table_area is not None:
            if self.fill is not None:
                if len(self.table_area) != len(self.fill):
-                    raise ValueError("message")
+                    raise ValueError("Length of fill should be equal to table_area.")
            areas = []
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
@ -187,7 +313,7 @@ class Lattice:
            # set spanning cells to True
            table = table.set_spanning()
            # set table border edges to True
-            table = outline(table)
+            table = _outline(table)

            if self.debug:
                self.debug_tables.append(table)
@ -207,7 +333,7 @@ class Lattice:
                    continue
                rerror.append(rass_error)
                cerror.append(cass_error)
-                r_idx, c_idx = reduce_index(table, table_rotation, r_idx, c_idx)
+                r_idx, c_idx = _reduce_index(table, table_rotation, r_idx, c_idx)
                table.cells[r_idx][c_idx].add_object(t)

            for i in range(len(table.cells)):
@ -232,7 +358,7 @@ class Lattice:
            table_data['score'] = score

            if self.fill is not None:
-                table = fill_spanning(table, fill=self.fill[table_no])
+                table = _fill_spanning(table, fill=self.fill[table_no])
            ar = table.get_list()
            if table_rotation == 'left':
                ar = zip(*ar[::-1])
--- a/camelot/pdf.py
+++ b/camelot/pdf.py
@ -12,17 +12,18 @@ __all__ = ['Pdf']


 def _parse_page_numbers(pagenos):
-    """Converts list of page ranges to a list of page numbers.
+    """Converts list of dicts to list of ints.

    Parameters
    ----------
    pagenos : list
-        List of dicts containing page ranges.
+        List of dicts representing page ranges. A dict must have only
+        two keys named 'start' and 'end' having int as their value.

    Returns
    -------
    page_numbers : list
-        List of page numbers.
+        List of int page numbers.
    """
    page_numbers = []
    for p in pagenos:
@ -32,32 +33,32 @@ def _parse_page_numbers(pagenos):


 class Pdf:
-    """Handles all pdf operations which include:
-
-        1. Split pdf into single page pdfs using given page numbers
-        2. Convert single page pdfs into images
-        3. Extract text from single page pdfs
+    """Pdf manager.
+    Handles all operations like temp directory creation, splitting file
+    into single page pdfs, running extraction using multiple processes
+    and removing the temp directory.

    Parameters
    ----------
+    extractor : object
+        camelot.stream.Stream or camelot.lattice.Lattice extractor
+        object.
+
    pdfname : string
-        Path to pdf.
+        Path to pdf file.

    pagenos : list
-        List of dicts which specify pdf page ranges.
+        List of dicts representing page ranges. A dict must have only
+        two keys named 'start' and 'end' having int as their value.
        (optional, default: [{'start': 1, 'end': 1}])

-    char_margin : float
-        Chars closer than char_margin are grouped together to form a
-        word. (optional, default: 2.0)
+    parallel : bool
+        Whether or not to run using multiple processes.
+        (optional, default: False)

-    line_margin : float
-        Lines closer than line_margin are grouped together to form a
-        textbox. (optional, default: 0.5)
-
-    word_margin : float
-        Insert blank spaces between chars if distance between words
-        is greater than word_margin. (optional, default: 0.1)
+    clean : bool
+        Whether or not to remove the temp directory.
+        (optional, default: False)
    """

    def __init__(self, extractor, pdfname, pagenos=[{'start': 1, 'end': 1}],
@ -75,7 +76,7 @@ class Pdf:
        self.temp = tempfile.mkdtemp()

    def split(self):
-        """Splits pdf into single page pdfs.
+        """Splits file into single page pdfs.
        """
        infile = PdfFileReader(open(self.pdfname, 'rb'), strict=False)
        for p in self.pagenos:
@ -85,11 +86,9 @@ class Pdf:
            with open(os.path.join(self.temp, 'page-{0}.pdf'.format(p)), 'wb') as f:
                outfile.write(f)

-    def remove_tempdir(self):
-        shutil.rmtree(self.temp)
-
    def extract(self):
-        """Extracts text objects, width, height from a pdf.
+        """Runs table extraction by calling extractor.get_tables
+        on all single page pdfs.
        """
        self.split()
        pages = [os.path.join(self.temp, 'page-{0}.pdf'.format(p))
@ -123,10 +122,15 @@ class Pdf:
            self.remove_tempdir()
        return tables

+    def remove_tempdir(self):
+        """Removes temporary directory that was created to save single
+        page pdfs and their images.
+        """
+        shutil.rmtree(self.temp)
+
    def debug_plot(self):
-        """Plots all text objects and various pdf geometries so that
-        user can choose number of columns, columns x-coordinates for
-        Stream or tweak Lattice parameters (mtol, scale).
+        """Generates a matplotlib plot based on the selected extractor
+        debug option.
        """
        import matplotlib.pyplot as plt
        import matplotlib.patches as patches
--- a/camelot/stream.py
+++ b/camelot/stream.py
@ -7,8 +7,8 @@ import copy_reg
 import numpy as np

 from .table import Table
-from .utils import (rotate, get_row_index, get_score, count_empty, encode_list,
-                    get_page_layout, get_text_objects, text_bbox, get_rotation)
+from .utils import (rotate, get_rotation, text_bbox, get_row_index, get_score,
+                    count_empty, encode_list, get_text_objects, get_page_layout)


 __all__ = ['Stream']
@ -23,21 +23,22 @@ copy_reg.pickle(types.MethodType, _reduce_method)


 def _group_rows(text, ytol=2):
-    """Groups text objects into rows using ytol.
+    """Groups PDFMiner text objects into rows using their
+    y-coordinates taking into account some tolerance ytol.

    Parameters
    ----------
    text : list
-        List of text objects.
+        List of PDFMiner text objects.

    ytol : int
-        Tolerance to account for when grouping rows
-        together. (optional, default: 2)
+        Tolerance parameter.
+        (optional, default: 2)

    Returns
    -------
    rows : list
-        List of grouped text rows.
+        Two-dimensional list of text objects grouped into rows.
    """
    row_y = 0
    rows = []
@ -58,18 +59,22 @@ def _group_rows(text, ytol=2):


 def _merge_columns(l, mtol=0):
-    """Merges overlapping columns and returns list with updated
-    columns boundaries.
+    """Merges column boundaries if they overlap or lie within some
+    tolerance mtol.

    Parameters
    ----------
    l : list
-        List of column x-coordinates.
+        List of column coordinate tuples.
+
+    mtol : int
+        TODO
+        (optional, default: 0)

    Returns
    -------
    merged : list
-        List of merged column x-coordinates.
+        List of merged column coordinate tuples.
    """
    merged = []
    for higher in l:
@ -98,19 +103,104 @@ def _merge_columns(l, mtol=0):
    return merged


+def _join_rows(rows_grouped, text_y_max, text_y_min):
+    """Makes row coordinates continuous.
+
+    Parameters
+    ----------
+    rows_grouped : list
+        Two-dimensional list of text objects grouped into rows.
+
+    text_y_max : int
+
+    text_y_min : int
+
+    Returns
+    -------
+    rows : list
+        List of continuous row coordinate tuples.
+    """
+    row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
+                if len(r) > 0 else 0 for r in rows_grouped]
+    rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
+    rows.insert(0, text_y_max)
+    rows.append(text_y_min)
+    rows = [(rows[i], rows[i + 1])
+            for i in range(0, len(rows) - 1)]
+    return rows
+
+
+def _join_columns(cols, text_x_min, text_x_max):
+    """Makes column coordinates continuous.
+
+    Parameters
+    ----------
+    cols : list
+        List of column coordinate tuples.
+
+    text_x_min : int
+
+    text_y_max : int
+
+    Returns
+    -------
+    cols : list
+        Updated list of column coordinate tuples.
+    """
+    cols = sorted(cols)
+    cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
+    cols.insert(0, text_x_min)
+    cols.append(text_x_max)
+    cols = [(cols[i], cols[i + 1])
+            for i in range(0, len(cols) - 1)]
+    return cols
+
+
+def _add_columns(cols, text, ytol):
+    """Adds columns to existing list by taking into account
+    the text that lies outside the current column coordinates.
+
+    Parameters
+    ----------
+    cols : list
+        List of column coordinate tuples.
+
+    text : list
+        List of PDFMiner text objects.
+
+    ytol : int
+        Tolerance parameter.
+
+    Returns
+    -------
+    cols : list
+        Updated list of column coordinate tuples.
+    """
+    if text:
+        text = _group_rows(text, ytol=ytol)
+        elements = [len(r) for r in text]
+        new_cols = [(t.x0, t.x1)
+            for r in text if len(r) == max(elements) for t in r]
+        cols.extend(_merge_columns(sorted(new_cols)))
+    return cols
+
+
 def _get_column_index(t, columns):
-    """Gets index of the column in which the given object falls by
-    comparing their co-ordinates.
+    """Gets index of the column in which the given text object lies by
+    comparing their x-coordinates.

    Parameters
    ----------
    t : object

    columns : list
+        List of column coordinate tuples.

    Returns
    -------
-    c : int
+    c_idx : int
+
+    error : float
    """
    offset1, offset2 = 0, 0
    lt_col_overlap = []
@ -134,69 +224,51 @@ def _get_column_index(t, columns):
    return c_idx, error


-def _join_rows(rows_grouped, text_y_max, text_y_min):
-    row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
-                if len(r) > 0 else 0 for r in rows_grouped]
-    rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
-    rows.insert(0, text_y_max)
-    rows.append(text_y_min)
-    rows = [(rows[i], rows[i + 1])
-            for i in range(0, len(rows) - 1)]
-    return rows
-
-
-def _add_columns(cols, text, ytolerance):
-    if text:
-        text = _group_rows(text, ytol=ytolerance)
-        elements = [len(r) for r in text]
-        new_cols = [(t.x0, t.x1)
-            for r in text if len(r) == max(elements) for t in r]
-        cols.extend(_merge_columns(sorted(new_cols)))
-    return cols
-
-
-def _join_columns(cols, text_x_min, text_x_max):
-    cols = sorted(cols)
-    cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
-    cols.insert(0, text_x_min)
-    cols.append(text_x_max)
-    cols = [(cols[i], cols[i + 1])
-            for i in range(0, len(cols) - 1)]
-    return cols
-
-
 class Stream:
-    """Stream algorithm
+    """Stream looks for spaces between text elements to form a table.

-    Groups text objects into rows and guesses number of columns
-    using mode of the number of text objects in each row.
+    If you want to give columns, ncolumns, ytol or mtol for each table
+    when specifying multiple table areas, make sure that their length
+    is equal to the length of table_area. Mapping between them is based
+    on index.

-    The number of columns can be passed explicitly or specified by a
-    list of column x-coordinates.
+    Also, if you want to specify columns for the first table and
+    ncolumns for the second table in a pdf having two tables, pass
+    columns as ['x1,x2,x3,x4', ''] and ncolumns as [-1, 5].

    Parameters
    ----------
-    pdfobject : camelot.pdf.Pdf
-
-    ncolumns : int
-        Number of columns. (optional, default: 0)
-
-    columns : string
-        Comma-separated list of column x-coordinates.
+    table_area : list
+        List of tuples of the form (x1, y1, x2, y2) where
+        (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
+        coordinate space, denoting table areas to analyze.
        (optional, default: None)

-    ytol : int
-        Tolerance to account for when grouping rows
-        together. (optional, default: 2)
+    columns : list
+        List of strings where each string is comma-separated values of
+        x-coordinates in PDFMiner's coordinate space.
+        (optional, default: None)
+
+    ncolumns : list
+        List of ints specifying the number of columns in each table.
+        (optional, default: None)
+
+    ytol : list
+        List of ints specifying the y-tolerance parameters.
+        (optional, default: [2])
+
+    mtol : list
+        List of ints specifying the m-tolerance parameters.
+        (optional, default: [0])
+
+    margins : tuple
+        PDFMiner margins. (char_margin, line_margin, word_margin)
+        (optional, default: (1.0, 0.5, 0.1))

    debug : bool
-        Debug by visualizing textboxes. (optional, default: False)
-
-    Attributes
-    ----------
-    tables : dict
-        Dictionary with page number as key and list of tables on that
-        page as value.
+        Set to True to generate a matplotlib plot of
+        LTTextLineHorizontals in order to select table_area, columns.
+        (optional, default: False)
    """
    def __init__(self, table_area=None, columns=None, ncolumns=None, ytol=[2],
                 mtol=[0], margins=(1.0, 0.5, 0.1), debug=False):
@ -211,13 +283,16 @@ class Stream:
        self.debug = debug

    def get_tables(self, pdfname):
-        """Returns all tables found in given pdf.
+        """get_tables
+
+        Parameters
+        ---------
+        pdfname : string
+            Path to single page pdf file.

        Returns
        -------
-        tables : dict
-            Dictionary with page number as key and list of tables on that
-            page as value.
+        page : dict
        """
        layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
            line_margin=self.line_margin, word_margin=self.word_margin)
@ -237,10 +312,10 @@ class Stream:
        if self.table_area is not None:
            if self.columns is not None:
                if len(self.table_area) != len(self.columns):
-                    raise ValueError("message")
+                    raise ValueError("Length of columns should be equal to table_area.")
            if self.ncolumns is not None:
                if len(self.table_area) != len(self.ncolumns):
-                    raise ValueError("message")
+                    raise ValueError("Length of ncolumns should be equal to table_area.")
            table_bbox = {}
            for area in self.table_area:
                x1, y1, x2, y2 = area.split(",")
@ -369,7 +444,8 @@ class Stream:
                score = get_score([[50, rerror], [50, cerror]])

            table_data['score'] = score
-            ar = encode_list(table.get_list())
+            ar = table.get_list()
+            ar = encode_list(ar)
            table_data['data'] = ar
            empty_p, r_nempty_cells, c_nempty_cells = count_empty(ar)
            table_data['empty_p'] = empty_p
--- a/camelot/table.py
+++ b/camelot/table.py
@ -4,20 +4,27 @@ from .cell import Cell


 class Table:
-    """Table
+    """Table.
+    Defines a table object with coordinates relative to a left-bottom
+    origin, which is also PDFMiner's coordinate space.

    Parameters
    ----------
    cols : list
-        List of column x-coordinates.
+        List of tuples representing column x-coordinates in increasing
+        order.

    rows : list
-        List of row y-coordinates.
+        List of tuples representing row y-coordinates in decreasing
+        order.

    Attributes
    ----------
    cells : list
-        2-D list of cell objects.
+        List of cell objects with row-major ordering.
+
+    nocont_ : int
+        Number of lines that did not contribute to setting cell edges.
    """

    def __init__(self, cols, rows):
@ -29,20 +36,18 @@ class Table:
        self.nocont_ = 0

    def set_edges(self, vertical, horizontal, jtol=2):
-        """Sets cell edges to True if corresponding line segments
-        are detected in the pdf image.
+        """Sets a cell's edges to True depending on whether they
+        overlap with lines found by imgproc.

        Parameters
        ----------
        vertical : list
-            List of vertical line segments.
+            List of vertical lines detected by imgproc. Coordinates
+            scaled and translated to the PDFMiner's coordinate space.

        horizontal : list
-            List of horizontal line segments.
-
-        jtol : int
-            Tolerance to account for when comparing joint and line
-            coordinates. (optional, default: 2)
+            List of horizontal lines detected by imgproc. Coordinates
+            scaled and translated to the PDFMiner's coordinate space.
        """
        for v in vertical:
            # find closest x coord
@ -151,8 +156,9 @@ class Table:
        return self

    def set_spanning(self):
-        """Sets spanning values of a cell to True if it isn't
-        bounded by four edges.
+        """Sets a cell's spanning_h or spanning_v attribute to True
+        depending on whether the cell spans/extends horizontally or
+        vertically.
        """
        for i in range(len(self.cells)):
            for j in range(len(self.cells[i])):
@ -199,7 +205,8 @@ class Table:
        return self

    def get_list(self):
-        """Returns text from all cells as list of lists.
+        """Returns a two-dimensional list of text assigned to each
+        cell.

        Returns
        -------
--- a/camelot/utils.py
+++ b/camelot/utils.py
@ -82,28 +82,58 @@ def rotate(x1, y1, x2, y2, angle):


 def scale_to_image(k, factors):
+    """Translates and scales PDFMiner coordinates to OpenCV's coordinate
+    space.
+
+    Parameters
+    ----------
+    k : tuple
+        Tuple (x1, y1, x2, y2) representing table bounding box where
+        (x1, y1) -> lt and (x2, y2) -> rb in PDFMiner's coordinate
+        space.
+
+    factors : tuple
+        Tuple (scaling_factor_x, scaling_factor_y, pdf_y) where the
+        first two elements are scaling factors and pdf_y is height of
+        pdf.
+
+    Returns
+    -------
+    knew : tuple
+        Tuple (x1, y1, x2, y2) representing table bounding box where
+        (x1, y1) -> lt and (x2, y2) -> rb in OpenCV's coordinate
+        space.
+    """
    x1, y1, x2, y2 = k
    scaling_factor_x, scaling_factor_y, pdf_y = factors
    x1 = scale(x1, scaling_factor_x)
    y1 = scale(abs(translate(-pdf_y, y1)), scaling_factor_y)
    x2 = scale(x2, scaling_factor_x)
    y2 = scale(abs(translate(-pdf_y, y2)), scaling_factor_y)
-    return int(x1), int(y1), int(x2), int(y2)
+    knew = (int(x1), int(y1), int(x2), int(y2))
+    return knew


 def scale_to_pdf(tables, v_segments, h_segments, factors):
-    """Translates and scales OpenCV coordinates to PDFMiner coordinate
+    """Translates and scales OpenCV coordinates to PDFMiner's coordinate
    space.

    Parameters
    ----------
    tables : dict
+        Dict with table boundaries as keys and list of intersections
+        in that boundary as their value.

    v_segments : list
+        List of vertical line segments.

    h_segments : list
+        List of horizontal line segments.

    factors : tuple
+        Tuple (scaling_factor_x, scaling_factor_y, img_y) where the
+        first two elements are scaling factors and img_y is height of
+        image.

    Returns
    -------
@ -145,16 +175,28 @@ def scale_to_pdf(tables, v_segments, h_segments, factors):


 def get_rotation(ltchar, lttextlh=None, lttextlv=None):
-    """Detects if text in table is vertical or not and returns
-    its orientation.
+    """Detects if text in table is vertical or not using the current
+    transformation matrix (CTM) and returns its orientation.

    Parameters
    ----------
-    text : list
+    ltchar : list
+        List of PDFMiner LTChar objects.
+
+    lttextlh : list
+        List of PDFMiner LTTextLineHorizontal objects.
+        (optional, default: None)
+
+    lttextlv : list
+        List of PDFMiner LTTextLineVertical objects.
+        (optional, default: None)

    Returns
    -------
    rotation : string
+        {'', 'left', 'right'}
+        '' if text in table is upright, 'left' if rotated 90 degree
+        anti-clockwise and 'right' if rotated 90 degree clockwise.
    """
    rotation = ''
    if lttextlh is not None and lttextlv is not None:
@ -173,26 +215,28 @@ def get_rotation(ltchar, lttextlh=None, lttextlv=None):


 def segments_bbox(bbox, v_segments, h_segments):
-    """Returns all text objects and line segments present inside a
+    """Returns all line segments present inside a
    table's bounding box.

    Parameters
    ----------
    bbox : tuple
-
-    text : list
+        Tuple (x1, y1, x2, y2) representing table bounding box where
+        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.

    v_segments : list
+        List of vertical line segments.

    h_segments : list
+        List of vertical horizontal segments.

    Returns
    -------
-    text_bbox : list
-
    v_s : list
+        List of vertical line segments that lie inside table.

    h_s : list
+        List of horizontal line segments that lie inside table.
    """
    lb = (bbox[0], bbox[1])
    rt = (bbox[2], bbox[3])
@ -204,6 +248,23 @@ def segments_bbox(bbox, v_segments, h_segments):


 def text_bbox(bbox, text):
+    """Returns all text objects present inside a
+    table's bounding box.
+
+    Parameters
+    ----------
+    bbox : tuple
+        Tuple (x1, y1, x2, y2) representing table bounding box where
+        (x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.
+
+    text : list
+        List of PDFMiner text objects.
+
+    Returns
+    -------
+    t_bbox : list
+        List of PDFMiner text objects that lie inside table.
+    """
    lb = (bbox[0], bbox[1])
    rt = (bbox[2], bbox[3])
    t_bbox = [t for t in text if lb[0] - 2 <= (t.x0 + t.x1) / 2.0
@ -270,18 +331,21 @@ def merge_close_values(ar, mtol=2):


 def get_row_index(t, rows):
-    """Gets index of the row in which the given object falls by
-    comparing their co-ordinates.
+    """Gets index of the row in which the given text object lies by
+    comparing their y-coordinates.

    Parameters
    ----------
    t : object

-    rows : list, sorted in decreasing order
+    rows : list
+        List of row coordinate tuples, sorted in decreasing order.

    Returns
    -------
    r : int
+
+    error : float
    """
    offset1, offset2 = 0, 0
    for r in range(len(rows)):
@ -298,18 +362,21 @@ def get_row_index(t, rows):


 def get_column_index(t, columns):
-    """Gets index of the column in which the given object falls by
-    comparing their co-ordinates.
+    """Gets index of the column in which the given text object lies by
+    comparing their x-coordinates.

    Parameters
    ----------
    t : object

    columns : list
+        List of column coordinate tuples.

    Returns
    -------
    c : int
+
+    error : float
    """
    offset1, offset2 = 0, 0
    for c in range(len(columns)):
@ -331,10 +398,10 @@ def get_score(error_weights):

    Parameters
    ----------
-    error_weights : dict
-        Dict with a tuple of error percentages as key and weightage
-        assigned to them as value. Sum of all values should be equal
-        to 100.
+    error_weights : list
+        Two-dimensional list of the form [[p1, e1], [p2, e2], ...]
+        where pn is the weight assigned to list of errors en.
+        Sum of pn should be equal to 100.

    Returns
    -------
@ -352,109 +419,8 @@ def get_score(error_weights):
    return score


-def reduce_index(t, rotation, r_idx, c_idx):
-    """Reduces index of a text object if it lies within a spanning
-    cell taking in account table rotation.
-
-    Parameters
-    ----------
-    t : object
-
-    rotation : string
-
-    r_idx : int
-
-    c_idx : int
-
-    Returns
-    -------
-    r_idx : int
-
-    c_idx : int
-    """
-    if not rotation:
-        if t.cells[r_idx][c_idx].spanning_h:
-            while not t.cells[r_idx][c_idx].left:
-                c_idx -= 1
-        if t.cells[r_idx][c_idx].spanning_v:
-            while not t.cells[r_idx][c_idx].top:
-                r_idx -= 1
-    elif rotation == 'left':
-        if t.cells[r_idx][c_idx].spanning_h:
-            while not t.cells[r_idx][c_idx].left:
-                c_idx -= 1
-        if t.cells[r_idx][c_idx].spanning_v:
-            while not t.cells[r_idx][c_idx].bottom:
-                r_idx += 1
-    elif rotation == 'right':
-        if t.cells[r_idx][c_idx].spanning_h:
-            while not t.cells[r_idx][c_idx].right:
-                c_idx += 1
-        if t.cells[r_idx][c_idx].spanning_v:
-            while not t.cells[r_idx][c_idx].top:
-                r_idx -= 1
-    return r_idx, c_idx
-
-
-def outline(t):
-    """Sets table border edges to True.
-
-    Parameters
-    ----------
-    t : object
-
-    Returns
-    -------
-    t : object
-    """
-    for i in range(len(t.cells)):
-        t.cells[i][0].left = True
-        t.cells[i][len(t.cells[i]) - 1].right = True
-    for i in range(len(t.cells[0])):
-        t.cells[0][i].top = True
-        t.cells[len(t.cells) - 1][i].bottom = True
-    return t
-
-
-def fill_spanning(t, fill=None):
-    """Fills spanning cells.
-
-    Parameters
-    ----------
-    t : object
-
-    f : string
-        (optional, default: None)
-
-    Returns
-    -------
-    t : object
-    """
-    if fill == "h":
-        for i in range(len(t.cells)):
-            for j in range(len(t.cells[i])):
-                if t.cells[i][j].get_text().strip() == '':
-                    if t.cells[i][j].spanning_h:
-                        t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
-    elif fill == "v":
-        for i in range(len(t.cells)):
-            for j in range(len(t.cells[i])):
-                if t.cells[i][j].get_text().strip() == '':
-                    if t.cells[i][j].spanning_v:
-                        t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
-    elif fill == "hv":
-        for i in range(len(t.cells)):
-            for j in range(len(t.cells[i])):
-                if t.cells[i][j].get_text().strip() == '':
-                    if t.cells[i][j].spanning_h:
-                        t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
-                    elif t.cells[i][j].spanning_v:
-                        t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
-    return t
-
-
 def remove_empty(d):
-    """Removes empty rows and columns from list of lists.
+    """Removes empty rows and columns from a two-dimensional list.

    Parameters
    ----------
@ -474,7 +440,7 @@ def remove_empty(d):


 def count_empty(d):
-    """Counts empty rows and columns from list of lists.
+    """Counts empty rows and columns in a two-dimensional list.

    Parameters
    ----------
@ -532,17 +498,19 @@ def get_text_objects(layout, LTType="char", t=None):
    Parameters
    ----------
    layout : object
-        Layout object.
+        PDFMiner LTPage object.

-    LTObject : object
-        Text object, either LTChar or LTTextLineHorizontal.
+    LTType : string
+        {'char', 'lh', 'lv'}
+        Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
+        and LTTextLineVertical objects respectively.

-    t : list (optional, default: None)
+    t : list

    Returns
    -------
    t : list
-        List of text objects.
+        List of PDFMiner text objects.
    """
    if LTType == "char":
        LTObject = LTChar
@ -565,6 +533,33 @@ def get_text_objects(layout, LTType="char", t=None):

 def get_page_layout(pname, char_margin=2.0, line_margin=0.5, word_margin=0.1,
               detect_vertical=True, all_texts=True):
+    """Returns a PDFMiner LTPage object and page dimension of a single
+    page pdf. See https://euske.github.io/pdfminer/ to get definitions
+    of kwargs.
+
+    Parameters
+    ----------
+    pname : string
+        Path to pdf file.
+
+    char_margin : float
+
+    line_margin : float
+
+    word_margin : float
+
+    detect_vertical : bool
+
+    all_texts : bool
+
+    Returns
+    -------
+    layout : object
+        PDFMiner LTPage object.
+
+    dim : tuple
+        pdf page dimension of the form (width, height).
+    """
    with open(pname, 'r') as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
--- a/docs/index.rst
+++ b/docs/index.rst
@ -4,26 +4,24 @@
   contain the root `toctree` directive.

 ==================================
-Camelot: PDF parsing made simpler!
+Camelot: pdf parsing made simpler!
 ==================================

-Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.
+Camelot is a Python 2.7 library and command-line tool for getting tables out of pdf files.

-Why another PDF table parsing library?
+Why another pdf table parsing library?
 ======================================

-We tried a lot of tools available online to get tables out of PDFs, but each one had its limitations. `PDFTables`_ stopped its open source development in 2013. `SolidConverter`_ which powers `Smallpdf`_ is closed source. Recently, `Docparser`_ was launched, which again is closed source. `Tabula`_, though being open source, doesn't always give correct output. In most cases, we had to resort to writing custom scripts for each type of PDF.
+We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf.

 .. _PDFTables: https://pdftables.com/
 .. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
-.. _Smallpdf: smallpdf.com
-.. _Docparser: https://docparser.com/
 .. _Tabula: http://tabula.technology/

-PDFs have feelings too
-======================
+Some background
+===============

-PDF started as `The Camelot Project`_ when people wanted a cross-platform way to share documents, since a document looked different on each system. A PDF contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
+PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.

 Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements.

@ -37,9 +35,9 @@ Usage
    >>> from camelot.pdf import Pdf
    >>> from camelot.lattice import Lattice

-    >>> extractor = Lattice(Pdf('us-030.pdf'))
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Lattice(), 'us-030.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::
   :header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -51,7 +49,7 @@ Usage
   "2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
   "4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"

-Camelot comes with a command-line tool in which you can specify the output format (csv, tsv, html, json, and xlsx), page numbers you want to parse and the output directory in which you want the output files to be placed. By default, the output files are placed in the same directory as the PDF.
+Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.

 ::

@ -63,11 +61,23 @@ Camelot comes with a command-line tool in which you can specify the output forma
    options:
     -h, --help                Show this screen.
     -v, --version             Show version.
+     -V, --verbose             Verbose.
     -p, --pages <pageno>      Comma-separated list of page numbers.
                               Example: -p 1,3-6,10  [default: 1]
+     -P, --parallel            Parallelize the parsing process.
     -f, --format <format>     Output format. (csv,tsv,html,json,xlsx) [default: csv]
-     -l, --log                 Print log to file.
+     -l, --log                 Log to file.
     -o, --output <directory>  Output directory.
+     -M, --cmargin <cmargin>   Char margin. Chars closer than cmargin are
+                               grouped together to form a word. [default: 2.0]
+     -L, --lmargin <lmargin>   Line margin. Lines closer than lmargin are
+                               grouped together to form a textbox. [default: 0.5]
+     -W, --wmargin <wmargin>   Word margin. Insert blank spaces between chars
+                               if distance between words is greater than word
+                               margin. [default: 0.1]
+     -S, --print-stats         List stats on the parsing process.
+     -T, --save-stats          Save stats to a file.
+     -X, --plot <dist>         Plot distributions. (page,all,rc)

    camelot methods:
     lattice  Looks for lines between data.
@ -80,7 +90,7 @@ Installation

 Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::

-    pip install -U pip, setuptools
+    pip install -U pip setuptools

 The required dependencies include `numpy`_, `OpenCV`_ and `ImageMagick`_.

@ -88,46 +98,10 @@ The required dependencies include `numpy`_, `OpenCV`_ and `ImageMagick`_.
 .. _OpenCV: http://opencv.org/
 .. _ImageMagick: http://www.imagemagick.org/script/index.php

-We strongly recommend that you use a `virtual environment`_ to install Camelot. If you don't want to use a virtual environment, then skip the next section.
-
-Installing virtualenvwrapper
----------------------------
-
-You'll need to install `virtualenvwrapper`_.
-
-::
-
-    pip install virtualenvwrapper
-
-or
-
-::
-
-    sudo pip install virtualenvwrapper
-
-After installing virtualenvwrapper, add the following lines to your `.bashrc` and source it.
-
-::
-
-    export WORKON_HOME=$HOME/.virtualenvs
-    source /usr/bin/virtualenvwrapper.sh
-
-.. note:: The path to `virtualenvwrapper.sh` could be different on your system.
-
-Finally make a virtual environment using::
-
-    mkvirtualenv camelot
-
 Installing dependencies
 -----------------------

-`numpy` can be install using `pip`.
-
-::
-
-    pip install numpy
-
-`OpenCV` and `imagemagick` can be installed using your system's default package manager.
+numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.

 Linux
 ^^^^^
@ -151,17 +125,10 @@ OS X

    brew install homebrew/science/opencv imagemagick

-If you're working in a virtualenv, you'll need to create a symbolic link for the OpenCV shared object file::
-
-    sudo ln -s /path/to/system/site-packages/cv2.so ~/path/to/virtualenv/site-packages/cv2.so
-
-Finally, `cd` into the project directory and install by doing::
+Finally, `cd` into the project directory and install by::

    make install

-.. _virtual environment: http://virtualenvwrapper.readthedocs.io/en/latest/install.html#basic-installation
-.. _virtualenvwrapper: https://virtualenvwrapper.readthedocs.io/en/latest/
-
 API Reference
 =============

--- a/docs/lattice.rst
+++ b/docs/lattice.rst
@ -4,15 +4,15 @@
 Lattice
 =======

-Lattice method is designed to work on PDFs which have tables with well-defined grids. It looks for lines on a page to form a table representation.
+Lattice method is designed to work on pdf files which have tables with well-defined grids. It looks for lines on a page to form a table.

-Lattice uses OpenCV to apply a set of morphological transformations (erosion and dilation) to find horizontal and vertical line segments in a PDF page after converting it to an image using imagemagick.
+Lattice uses OpenCV to apply a set of morphological transformations (erosion and dilation) to find horizontal and vertical line segments in a pdf page after converting it to an image using imagemagick.

-.. note:: Currently, Lattice only works on PDFs that contain text i.e. they are not composed of an image of the text. However, we plan to add `OCR support`_ in the future.
+.. note:: Currently, Lattice only works on pdf files that contain text. However, we plan to add `OCR support`_ in the future.

 .. _OCR support: https://github.com/socialcopsdev/camelot/issues/14

-Let's see how Lattice processes this PDF, step by step.
+Let's see how Lattice processes this pdf, step by step.

 Line segments are detected in the first step.

@ -40,7 +40,7 @@ The detected line segments are overlapped again, this time by `or` ing their pix
   :scale: 50%
   :align: left

-Since dimensions of a PDF and its image vary; table contours, intersections and segments are scaled and translated to the PDF's coordinate space. A representation of the table is then created using these scaled coordinates.
+Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates.

 .. image:: assets/table.png
   :height: 674
@ -63,9 +63,9 @@ Finally, the characters found on the page are assigned to cells based on their x
    >>> from camelot.pdf import Pdf
    >>> from camelot.lattice import Lattice

-    >>> extractor = Lattice(Pdf('us-030.pdf'))
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Lattice(), 'us-030.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::
   :header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -82,7 +82,7 @@ Scale

 The scale parameter is used to determine the length of the structuring element used for morphological transformations. The length of vertical and horizontal structuring elements are found by dividing the image's height and width respectively, by `scale`. Large `scale` will lead to a smaller structuring element, which means that smaller lines will be detected. The default value for scale is 15.

-Let's consider this PDF.
+Let's consider this pdf file.

 .. .. _this: insert link for row_span_1.pdf

@ -105,16 +105,16 @@ Voila! It detected the smaller lines.
 Fill
 ----

-In the PDF used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the PDF we used above, vertically.
+In the file used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the file we used above, vertically.

 ::

    >>> from camelot.pdf import Pdf
    >>> from camelot.lattice import Lattice

-    >>> extractor = Lattice(Pdf('row_span_1.pdf'), fill='v', scale=40)
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Lattice(fill=['v'], scale=40), 'row_span_1.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::
   :header: "Plan Type","County","Plan  Name","Totals"
@ -162,7 +162,7 @@ In the PDF used above, you can see that some cells spanned a lot of rows, `fill`
 Invert
 ------

-To find line segments, Lattice needs the lines of the PDF to be in foreground. So, if you encounter a PDF like this, just set invert to True.
+To find line segments, Lattice needs the lines of the pdf file to be in foreground. So, if you encounter a file like this, just set invert to True.

 .. .. _this: insert link for lines_in_background_1.pdf

@ -171,9 +171,9 @@ To find line segments, Lattice needs the lines of the PDF to be in foreground. S
    >>> from camelot.pdf import Pdf
    >>> from camelot.lattice import Lattice

-    >>> extractor = Lattice(Pdf('lines_in_background_1.pdf'), invert=True)
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Lattice(invert=True), 'lines_in_background_1.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::
   :header: "State","Date","Halt stations","Halt days","Persons directly reached(in lakh)","Persons trained","Persons counseled","Persons testedfor HIV"
@ -186,8 +186,8 @@ To find line segments, Lattice needs the lines of the PDF to be in foreground. S
   "Kerala","23.2.2010 to 11.3.2010","9","17","1.42","3,559","2,173","855"
   "Total","","47","92","11.81","22,455","19,584","10,644"

-Lattice can also parse PDFs with tables like these that are rotated clockwise/anti-clockwise by 90 degrees.
+Lattice can also parse pdf files with tables like these that are rotated clockwise/anti-clockwise by 90 degrees.

 .. .. _these: insert link for left_rotated_table.pdf

-You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `plot_geometry()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
+You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `debug_plot()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
--- a/docs/stream.rst
+++ b/docs/stream.rst
@ -4,20 +4,20 @@
 Stream
 ======

-Stream method is the complete opposite of Lattice and works on PDFs which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
+Stream method is the complete opposite of Lattice and works on pdf files which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.

-Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a PDF table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
+Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a pdf table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.

-Let's run it on this PDF.
+Let's run it on this pdf.

 ::

    >>> from camelot.pdf import Pdf
    >>> from camelot.stream import Stream

-    >>> extractor = Stream(Pdf('eu-027.pdf'))
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Stream(), 'eu-027.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. .. _this: insert link for eu-027.pdf

@ -66,9 +66,9 @@ But sometimes its guess could be incorrect, like in this case.
    >>> from camelot.pdf import Pdf
    >>> from camelot.stream import Stream

-    >>> extractor = Stream(Pdf('missing_values.pdf'))
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Stream(), 'missing_values.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. .. _this: insert link for missing_values.pdf

@ -118,16 +118,16 @@ But sometimes its guess could be incorrect, like in this case.
   "14...","",""
   "Chronic...","",""

-It guessed that the PDF has 3 columns, because there wasn't any data in the last 2 columns for most rows. So, let's specify the number of columns explicitly, following which, Stream will only consider rows that have 5 words, to decide on column boundaries.
+It guessed that the pdf has 3 columns, because there wasn't any data in the last 2 columns for most rows. So, let's specify the number of columns explicitly, following which, Stream will only consider rows that have 5 words, to decide on column boundaries.

 ::

    >>> from camelot.pdf import Pdf
    >>> from camelot.stream import Stream

-    >>> extractor = Stream(Pdf('missing_values.pdf'), ncolumns=5)
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Stream(ncolumns=[5]), 'missing_values.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::

@ -175,15 +175,15 @@ It guessed that the PDF has 3 columns, because there wasn't any data in the last
   "14...","","","",""
   "Chronic...","","","",""

-We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this PDF.
+We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this pdf file.

 ::

    >>> from camelot.pdf import Pdf
    >>> from camelot.stream import Stream

-    >>> extractor = Stream(Pdf('mexican_towns.pdf'), debug=True)
-    >>> extractor.plot_text()
+    >>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
+    >>> manager.debug_plot()

 .. image:: assets/columns.png
   :height: 674
@ -198,9 +198,9 @@ After getting the x-coordinates, we just need to pass them to Stream, like this.
    >>> from camelot.pdf import Pdf
    >>> from camelot.stream import Stream

-    >>> extractor = Stream(Pdf('mexican_towns.pdf'), columns='28,67,180,230,425,475,700')
-    >>> tables = extractor.get_tables()
-    >>> print tables['page-1'][0]
+    >>> manager = Pdf(Stream(columns=['28,67,180,230,425,475,700']), 'mexican_towns.pdf')
+    >>> tables = manager.extract()
+    >>> print tables['page-1']['table-1']['data']

 .. csv-table::