Update docs

* Update README

* Update index.rst

* Update docstrings

* Fix typo

* Edit docs

* Add error messages
pull/2/head
Vinayak Mehta 2016-10-04 17:50:48 +05:30 committed by GitHub
parent d46eeeab1a
commit 4b8e96a86a
11 changed files with 715 additions and 446 deletions

View File

@ -8,26 +8,38 @@ Camelot is a Python 2.7 library and command-line tool for getting tables out of
from camelot.pdf import Pdf
from camelot.lattice import Lattice
extractor = Lattice(Pdf("/path/to/pdf", pagenos=[{'start': 2, 'end': 4}]))
tables = extractor.get_tables()
manager = Pdf(Lattice(), "/path/to/pdf")
tables = manager.extract()
</pre>
Camelot comes with a command-line tool in which you can specify the output format (csv, tsv, html, json, and xlsx), page numbers you want to parse and the output directory in which you want the output files to be placed. By default, the output files are placed in the same directory as the PDF.
Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.
<pre>
camelot parses tables from PDFs!
Camelot: PDF parsing made simpler!
usage:
camelot.py [options] <method> [<args>...]
camelot [options] &lt;method&gt; [&lt;args&gt;...]
options:
-h, --help Show this screen.
-v, --version Show version.
-V, --verbose Verbose.
-p, --pages &lt;pageno&gt; Comma-separated list of page numbers.
Example: -p 1,3-6,10 [default: 1]
-P, --parallel Parallelize the parsing process.
-f, --format &lt;format&gt; Output format. (csv,tsv,html,json,xlsx) [default: csv]
-l, --log Print log to file.
-l, --log Log to file.
-o, --output &lt;directory&gt; Output directory.
-M, --cmargin &lt;cmargin&gt; Char margin. Chars closer than cmargin are
grouped together to form a word. [default: 2.0]
-L, --lmargin &lt;lmargin&gt; Line margin. Lines closer than lmargin are
grouped together to form a textbox. [default: 0.5]
-W, --wmargin &lt;wmargin&gt; Word margin. Insert blank spaces between chars
if distance between words is greater than word
margin. [default: 0.1]
-S, --print-stats List stats on the parsing process.
-T, --save-stats Save stats to a file.
-X, --plot &lt;dist&gt; Plot distributions. (page,all,rc)
camelot methods:
lattice Looks for lines between data.
@ -47,48 +59,12 @@ The required dependencies include [numpy](http://www.numpy.org/), [OpenCV](http:
Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by
<pre>
pip install -U pip, setuptools
</pre>
We strongly recommend that you use a [virtual environment](http://virtualenvwrapper.readthedocs.io/en/latest/install.html#basic-installation) to install Camelot. If you don't want to use a virtual environment, then skip the next section.
### Installing virtualenvwrapper
You'll need to install [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io/en/latest/).
<pre>
pip install virtualenvwrapper
</pre>
or
<pre>
sudo pip install virtualenvwrapper
</pre>
After installing virtualenvwrapper, add the following lines to your `.bashrc` and source it.
<pre>
export WORKON_HOME=$HOME/.virtualenvs
source /usr/bin/virtualenvwrapper.sh
</pre>
The path to `virtualenvwrapper.sh` could be different on your system.
Finally make a virtual environment using
<pre>
mkvirtualenv camelot
pip install -U pip setuptools
</pre>
### Installing dependencies
numpy can be install using pip.
<pre>
pip install numpy
</pre>
OpenCV and imagemagick can be installed using your system's default package manager.
numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.
#### Linux
@ -110,13 +86,7 @@ sudo apt-get install libopencv-dev python-opencv imagemagick
brew install homebrew/science/opencv imagemagick
</pre>
If you're working in a virtualenv, you'll need to create a symbolic link for the OpenCV shared object file
<pre>
sudo ln -s /path/to/system/site-packages/cv2.so ~/path/to/virtualenv/site-packages/cv2.so
</pre>
Finally, `cd` into the project directory and install by doing
Finally, `cd` into the project directory and install by
<pre>
make install

View File

@ -1,41 +1,63 @@
class Cell:
"""Cell
"""Cell.
Defines a cell object with coordinates relative to a left-bottom
origin, which is also PDFMiner's coordinate space.
Parameters
----------
x1 : int
x1 : float
x-coordinate of left-bottom point.
y1 : int
y1 : float
y-coordinate of left-bottom point.
x2 : int
x2 : float
x-coordinate of right-top point.
y2 : int
y2 : float
y-coordinate of right-top point.
Attributes
----------
lb : tuple
Tuple representing left-bottom coordinates.
lt : tuple
Tuple representing left-top coordinates.
rb : tuple
Tuple representing right-bottom coordinates.
rt : tuple
Tuple representing right-top coordinates.
bbox : tuple
Tuple representing the cell's bounding box using the
lower-bottom and right-top coordinates.
left : bool
Whether or not cell is bounded on the left.
right : bool
Whether or not cell is bounded on the right.
top : bool
Whether or not cell is bounded on the top.
bottom : bool
Whether or not cell is bounded on the bottom.
text_objects : list
List of text objects assigned to cell.
text : string
Text assigned to cell.
spanning_h : bool
Whether or not cell spans/extends horizontally.
spanning_v : bool
Whether or not cell spans/extends vertically.
"""
def __init__(self, x1, y1, x2, y2):
@ -53,13 +75,13 @@ class Cell:
self.right = False
self.top = False
self.bottom = False
self.text = ''
self.text_objects = []
self.text = ''
self.spanning_h = False
self.spanning_v = False
def add_text(self, text):
"""Adds text to cell object.
"""Adds text to cell.
Parameters
----------
@ -68,7 +90,7 @@ class Cell:
self.text = ''.join([self.text, text])
def get_text(self):
"""Returns text from cell object.
"""Returns text assigned to cell.
Returns
-------
@ -77,16 +99,29 @@ class Cell:
return self.text
def add_object(self, t_object):
"""Adds PDFMiner text object to cell.
Parameters
----------
t_object : object
"""
self.text_objects.append(t_object)
def get_objects(self):
"""Returns list of text objects assigned to cell.
Returns
-------
text_objects : list
"""
return self.text_objects
def get_bounded_edges(self):
"""Returns number of edges by which a cell is bounded.
"""Returns the number of edges by which a cell is bounded.
Returns
-------
bounded_edges : int
"""
return self.top + self.bottom + self.left + self.right
self.bounded_edges = self.top + self.bottom + self.left + self.right
return self.bounded_edges

View File

@ -3,6 +3,26 @@ import numpy as np
def adaptive_threshold(imagename, invert=False):
"""Thresholds an image using OpenCV's adaptiveThreshold.
Parameters
----------
imagename : string
Path to image file.
invert : bool
Whether or not to invert the image. Useful when pdfs have
tables with lines in background.
(optional, default: False)
Returns
-------
img : object
numpy.ndarray representing the original image.
threshold : object
numpy.ndarray representing the thresholded image.
"""
img = cv2.imread(imagename)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
@ -18,7 +38,35 @@ def adaptive_threshold(imagename, invert=False):
return img, threshold
def find_lines(threshold, direction=None, scale=15):
def find_lines(threshold, direction='horizontal', scale=15):
"""Finds horizontal and vertical lines by applying morphological
transformations on an image.
Parameters
----------
threshold : object
numpy.ndarray representing the thresholded image.
direction : string
Specifies whether to find vertical or horizontal lines.
(default: 'horizontal')
scale : int
Used to divide the height/width to get a structuring element
for morph transform.
(optional, default: 15)
Returns
-------
dmask : object
numpy.ndarray representing pixels where vertical/horizontal
lines lie.
lines : list
List of tuples representing vertical/horizontal lines with
coordinates relative to a left-top origin in
OpenCV's coordinate space.
"""
lines = []
if direction == 'vertical':
@ -56,6 +104,23 @@ def find_lines(threshold, direction=None, scale=15):
def find_table_contours(vertical, horizontal):
"""Finds table boundaries using OpenCV's findContours.
Parameters
----------
vertical : object
numpy.ndarray representing pixels where vertical lines lie.
horizontal : object
numpy.ndarray representing pixels where horizontal lines lie.
Returns
-------
cont : list
List of tuples representing table boundaries. Each tuple is of
the form (x, y, w, h) where (x, y) -> left-top, w -> width and
h -> height in OpenCV's coordinate space.
"""
mask = vertical + horizontal
try:
@ -75,6 +140,30 @@ def find_table_contours(vertical, horizontal):
def find_table_joints(contours, vertical, horizontal):
"""Finds joints/intersections present inside each table boundary.
Parameters
----------
contours : list
List of tuples representing table boundaries. Each tuple is of
the form (x, y, w, h) where (x, y) -> left-top, w -> width and
h -> height in OpenCV's coordinate space.
vertical : object
numpy.ndarray representing pixels where vertical lines lie.
horizontal : object
numpy.ndarray representing pixels where horizontal lines lie.
Returns
-------
tables : dict
Dict with table boundaries as keys and list of intersections
in that boundary as their value.
Keys are of the form (x1, y1, x2, y2) where (x1, y1) -> lb
and (x2, y2) -> rt in OpenCV's coordinate space.
"""
joints = np.bitwise_and(vertical, horizontal)
tables = {}
for c in contours:

View File

@ -8,11 +8,10 @@ import subprocess
from .imgproc import (adaptive_threshold, find_lines, find_table_contours,
find_table_joints)
from .table import Table
from .utils import (scale_to_pdf, scale_to_image, segments_bbox, text_bbox,
get_rotation, merge_close_values, get_row_index,
get_column_index, get_score, reduce_index, outline,
fill_spanning, count_empty, encode_list, get_page_layout,
get_text_objects)
from .utils import (scale_to_pdf, scale_to_image, get_rotation, segments_bbox,
text_bbox, merge_close_values, get_row_index,
get_column_index, get_score, count_empty, encode_list,
get_text_objects, get_page_layout)
__all__ = ['Lattice']
@ -26,41 +25,165 @@ def _reduce_method(m):
copy_reg.pickle(types.MethodType, _reduce_method)
class Lattice:
"""Lattice algorithm
Makes use of pdf geometry by processing its image, to make a table
and fills text objects in table cells.
def _fill_spanning(t, fill=None):
"""Fills spanning cells.
Parameters
----------
pdfobject : camelot.pdf.Pdf
t : object
camelot.table.Table
fill : string
Fill data in horizontal and/or vertical spanning
cells. (optional, default: None) {None, 'h', 'v', 'hv'}
{'h', 'v', 'hv'}
Specify to fill spanning cells in horizontal, vertical or both
directions.
(optional, default: None)
Returns
-------
t : object
camelot.table.Table
"""
if fill == "h":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_h:
t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
elif fill == "v":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_v:
t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
elif fill == "hv":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_h:
t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
elif t.cells[i][j].spanning_v:
t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
return t
def _outline(t):
"""Sets table border edges to True.
Parameters
----------
t : object
camelot.table.Table
Returns
-------
t : object
camelot.table.Table
"""
for i in range(len(t.cells)):
t.cells[i][0].left = True
t.cells[i][len(t.cells[i]) - 1].right = True
for i in range(len(t.cells[0])):
t.cells[0][i].top = True
t.cells[len(t.cells) - 1][i].bottom = True
return t
def _reduce_index(t, rotation, r_idx, c_idx):
"""Reduces index of a text object if it lies within a spanning
cell taking in account table rotation.
Parameters
----------
t : object
camelot.table.Table
rotation : string
{'', 'left', 'right'}
r_idx : int
Current row index.
c_idx : int
Current column index.
Returns
-------
r_idx : int
Reduced row index.
c_idx : int
Reduced column index.
"""
if not rotation:
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].left:
c_idx -= 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].top:
r_idx -= 1
elif rotation == 'left':
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].left:
c_idx -= 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].bottom:
r_idx += 1
elif rotation == 'right':
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].right:
c_idx += 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].top:
r_idx -= 1
return r_idx, c_idx
class Lattice:
"""Lattice looks for lines in the pdf to form a table.
If you want to give fill and mtol for each table when specifying
multiple table areas, make sure that the length of fill and mtol
is equal to the length of table_area. Mapping between them is based
on index.
Parameters
----------
table_area : list
List of tuples of the form (x1, y1, x2, y2) where
(x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
coordinate space, denoting table areas to analyze.
(optional, default: None)
fill : list
List of strings specifying directions to fill spanning cells.
{'h', 'v', 'hv'} to fill spanning cells in horizontal, vertical
or both directions.
(optional, default: None)
mtol : list
List of ints specifying m-tolerance parameters.
(optional, default: [2])
scale : int
Scaling factor. Large scaling factor leads to smaller lines
being detected. (optional, default: 15)
mtol : int
Tolerance to account for when merging lines which are
very close. (optional, default: 2)
Used to divide the height/width of a pdf to get a structuring
element for image processing.
(optional, default: 15)
invert : bool
Invert pdf image to make sure that lines are in foreground.
Whether or not to invert the image. Useful when pdfs have
tables with lines in background.
(optional, default: False)
debug : string
Debug by visualizing pdf geometry.
(optional, default: None) {'contour', 'line', 'joint', 'table'}
margins : tuple
PDFMiner margins. (char_margin, line_margin, word_margin)
(optional, default: (1.0, 0.5, 0.1))
Attributes
----------
tables : dict
Dictionary with page number as key and list of tables on that
page as value.
debug : string
{'contour', 'line', 'joint', 'table'}
Set to one of the above values to generate a matplotlib plot
of detected contours, lines, joints and the table generated.
(optional, default: None)
"""
def __init__(self, table_area=None, fill=None, mtol=[2], scale=15,
invert=False, margins=(1.0, 0.5, 0.1), debug=None):
@ -75,13 +198,16 @@ class Lattice:
self.debug = debug
def get_tables(self, pdfname):
"""Returns all tables found in given pdf.
"""get_tables
Parameters
----------
pdfname : string
Path to single page pdf file.
Returns
-------
tables : dict
Dictionary with page number as key and list of tables on that
page as value.
page : dict
"""
layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
line_margin=self.line_margin, word_margin=self.word_margin)
@ -125,7 +251,7 @@ class Lattice:
if self.table_area is not None:
if self.fill is not None:
if len(self.table_area) != len(self.fill):
raise ValueError("message")
raise ValueError("Length of fill should be equal to table_area.")
areas = []
for area in self.table_area:
x1, y1, x2, y2 = area.split(",")
@ -187,7 +313,7 @@ class Lattice:
# set spanning cells to True
table = table.set_spanning()
# set table border edges to True
table = outline(table)
table = _outline(table)
if self.debug:
self.debug_tables.append(table)
@ -207,7 +333,7 @@ class Lattice:
continue
rerror.append(rass_error)
cerror.append(cass_error)
r_idx, c_idx = reduce_index(table, table_rotation, r_idx, c_idx)
r_idx, c_idx = _reduce_index(table, table_rotation, r_idx, c_idx)
table.cells[r_idx][c_idx].add_object(t)
for i in range(len(table.cells)):
@ -232,7 +358,7 @@ class Lattice:
table_data['score'] = score
if self.fill is not None:
table = fill_spanning(table, fill=self.fill[table_no])
table = _fill_spanning(table, fill=self.fill[table_no])
ar = table.get_list()
if table_rotation == 'left':
ar = zip(*ar[::-1])

View File

@ -12,17 +12,18 @@ __all__ = ['Pdf']
def _parse_page_numbers(pagenos):
"""Converts list of page ranges to a list of page numbers.
"""Converts list of dicts to list of ints.
Parameters
----------
pagenos : list
List of dicts containing page ranges.
List of dicts representing page ranges. A dict must have only
two keys named 'start' and 'end' having int as their value.
Returns
-------
page_numbers : list
List of page numbers.
List of int page numbers.
"""
page_numbers = []
for p in pagenos:
@ -32,32 +33,32 @@ def _parse_page_numbers(pagenos):
class Pdf:
"""Handles all pdf operations which include:
1. Split pdf into single page pdfs using given page numbers
2. Convert single page pdfs into images
3. Extract text from single page pdfs
"""Pdf manager.
Handles all operations like temp directory creation, splitting file
into single page pdfs, running extraction using multiple processes
and removing the temp directory.
Parameters
----------
extractor : object
camelot.stream.Stream or camelot.lattice.Lattice extractor
object.
pdfname : string
Path to pdf.
Path to pdf file.
pagenos : list
List of dicts which specify pdf page ranges.
List of dicts representing page ranges. A dict must have only
two keys named 'start' and 'end' having int as their value.
(optional, default: [{'start': 1, 'end': 1}])
char_margin : float
Chars closer than char_margin are grouped together to form a
word. (optional, default: 2.0)
parallel : bool
Whether or not to run using multiple processes.
(optional, default: False)
line_margin : float
Lines closer than line_margin are grouped together to form a
textbox. (optional, default: 0.5)
word_margin : float
Insert blank spaces between chars if distance between words
is greater than word_margin. (optional, default: 0.1)
clean : bool
Whether or not to remove the temp directory.
(optional, default: False)
"""
def __init__(self, extractor, pdfname, pagenos=[{'start': 1, 'end': 1}],
@ -75,7 +76,7 @@ class Pdf:
self.temp = tempfile.mkdtemp()
def split(self):
"""Splits pdf into single page pdfs.
"""Splits file into single page pdfs.
"""
infile = PdfFileReader(open(self.pdfname, 'rb'), strict=False)
for p in self.pagenos:
@ -85,11 +86,9 @@ class Pdf:
with open(os.path.join(self.temp, 'page-{0}.pdf'.format(p)), 'wb') as f:
outfile.write(f)
def remove_tempdir(self):
shutil.rmtree(self.temp)
def extract(self):
"""Extracts text objects, width, height from a pdf.
"""Runs table extraction by calling extractor.get_tables
on all single page pdfs.
"""
self.split()
pages = [os.path.join(self.temp, 'page-{0}.pdf'.format(p))
@ -123,10 +122,15 @@ class Pdf:
self.remove_tempdir()
return tables
def remove_tempdir(self):
"""Removes temporary directory that was created to save single
page pdfs and their images.
"""
shutil.rmtree(self.temp)
def debug_plot(self):
"""Plots all text objects and various pdf geometries so that
user can choose number of columns, columns x-coordinates for
Stream or tweak Lattice parameters (mtol, scale).
"""Generates a matplotlib plot based on the selected extractor
debug option.
"""
import matplotlib.pyplot as plt
import matplotlib.patches as patches

View File

@ -7,8 +7,8 @@ import copy_reg
import numpy as np
from .table import Table
from .utils import (rotate, get_row_index, get_score, count_empty, encode_list,
get_page_layout, get_text_objects, text_bbox, get_rotation)
from .utils import (rotate, get_rotation, text_bbox, get_row_index, get_score,
count_empty, encode_list, get_text_objects, get_page_layout)
__all__ = ['Stream']
@ -23,21 +23,22 @@ copy_reg.pickle(types.MethodType, _reduce_method)
def _group_rows(text, ytol=2):
"""Groups text objects into rows using ytol.
"""Groups PDFMiner text objects into rows using their
y-coordinates taking into account some tolerance ytol.
Parameters
----------
text : list
List of text objects.
List of PDFMiner text objects.
ytol : int
Tolerance to account for when grouping rows
together. (optional, default: 2)
Tolerance parameter.
(optional, default: 2)
Returns
-------
rows : list
List of grouped text rows.
Two-dimensional list of text objects grouped into rows.
"""
row_y = 0
rows = []
@ -58,18 +59,22 @@ def _group_rows(text, ytol=2):
def _merge_columns(l, mtol=0):
"""Merges overlapping columns and returns list with updated
columns boundaries.
"""Merges column boundaries if they overlap or lie within some
tolerance mtol.
Parameters
----------
l : list
List of column x-coordinates.
List of column coordinate tuples.
mtol : int
TODO
(optional, default: 0)
Returns
-------
merged : list
List of merged column x-coordinates.
List of merged column coordinate tuples.
"""
merged = []
for higher in l:
@ -98,19 +103,104 @@ def _merge_columns(l, mtol=0):
return merged
def _join_rows(rows_grouped, text_y_max, text_y_min):
"""Makes row coordinates continuous.
Parameters
----------
rows_grouped : list
Two-dimensional list of text objects grouped into rows.
text_y_max : int
text_y_min : int
Returns
-------
rows : list
List of continuous row coordinate tuples.
"""
row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
if len(r) > 0 else 0 for r in rows_grouped]
rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
rows.insert(0, text_y_max)
rows.append(text_y_min)
rows = [(rows[i], rows[i + 1])
for i in range(0, len(rows) - 1)]
return rows
def _join_columns(cols, text_x_min, text_x_max):
"""Makes column coordinates continuous.
Parameters
----------
cols : list
List of column coordinate tuples.
text_x_min : int
text_y_max : int
Returns
-------
cols : list
Updated list of column coordinate tuples.
"""
cols = sorted(cols)
cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
cols.insert(0, text_x_min)
cols.append(text_x_max)
cols = [(cols[i], cols[i + 1])
for i in range(0, len(cols) - 1)]
return cols
def _add_columns(cols, text, ytol):
"""Adds columns to existing list by taking into account
the text that lies outside the current column coordinates.
Parameters
----------
cols : list
List of column coordinate tuples.
text : list
List of PDFMiner text objects.
ytol : int
Tolerance parameter.
Returns
-------
cols : list
Updated list of column coordinate tuples.
"""
if text:
text = _group_rows(text, ytol=ytol)
elements = [len(r) for r in text]
new_cols = [(t.x0, t.x1)
for r in text if len(r) == max(elements) for t in r]
cols.extend(_merge_columns(sorted(new_cols)))
return cols
def _get_column_index(t, columns):
"""Gets index of the column in which the given object falls by
comparing their co-ordinates.
"""Gets index of the column in which the given text object lies by
comparing their x-coordinates.
Parameters
----------
t : object
columns : list
List of column coordinate tuples.
Returns
-------
c : int
c_idx : int
error : float
"""
offset1, offset2 = 0, 0
lt_col_overlap = []
@ -134,69 +224,51 @@ def _get_column_index(t, columns):
return c_idx, error
def _join_rows(rows_grouped, text_y_max, text_y_min):
row_mids = [sum([(t.y0 + t.y1) / 2 for t in r]) / len(r)
if len(r) > 0 else 0 for r in rows_grouped]
rows = [(row_mids[i] + row_mids[i - 1]) / 2 for i in range(1, len(row_mids))]
rows.insert(0, text_y_max)
rows.append(text_y_min)
rows = [(rows[i], rows[i + 1])
for i in range(0, len(rows) - 1)]
return rows
def _add_columns(cols, text, ytolerance):
if text:
text = _group_rows(text, ytol=ytolerance)
elements = [len(r) for r in text]
new_cols = [(t.x0, t.x1)
for r in text if len(r) == max(elements) for t in r]
cols.extend(_merge_columns(sorted(new_cols)))
return cols
def _join_columns(cols, text_x_min, text_x_max):
cols = sorted(cols)
cols = [(cols[i][0] + cols[i - 1][1]) / 2 for i in range(1, len(cols))]
cols.insert(0, text_x_min)
cols.append(text_x_max)
cols = [(cols[i], cols[i + 1])
for i in range(0, len(cols) - 1)]
return cols
class Stream:
"""Stream algorithm
"""Stream looks for spaces between text elements to form a table.
Groups text objects into rows and guesses number of columns
using mode of the number of text objects in each row.
If you want to give columns, ncolumns, ytol or mtol for each table
when specifying multiple table areas, make sure that their length
is equal to the length of table_area. Mapping between them is based
on index.
The number of columns can be passed explicitly or specified by a
list of column x-coordinates.
Also, if you want to specify columns for the first table and
ncolumns for the second table in a pdf having two tables, pass
columns as ['x1,x2,x3,x4', ''] and ncolumns as [-1, 5].
Parameters
----------
pdfobject : camelot.pdf.Pdf
ncolumns : int
Number of columns. (optional, default: 0)
columns : string
Comma-separated list of column x-coordinates.
table_area : list
List of tuples of the form (x1, y1, x2, y2) where
(x1, y1) -> left-top and (x2, y2) -> right-bottom in PDFMiner's
coordinate space, denoting table areas to analyze.
(optional, default: None)
ytol : int
Tolerance to account for when grouping rows
together. (optional, default: 2)
columns : list
List of strings where each string is comma-separated values of
x-coordinates in PDFMiner's coordinate space.
(optional, default: None)
ncolumns : list
List of ints specifying the number of columns in each table.
(optional, default: None)
ytol : list
List of ints specifying the y-tolerance parameters.
(optional, default: [2])
mtol : list
List of ints specifying the m-tolerance parameters.
(optional, default: [0])
margins : tuple
PDFMiner margins. (char_margin, line_margin, word_margin)
(optional, default: (1.0, 0.5, 0.1))
debug : bool
Debug by visualizing textboxes. (optional, default: False)
Attributes
----------
tables : dict
Dictionary with page number as key and list of tables on that
page as value.
Set to True to generate a matplotlib plot of
LTTextLineHorizontals in order to select table_area, columns.
(optional, default: False)
"""
def __init__(self, table_area=None, columns=None, ncolumns=None, ytol=[2],
mtol=[0], margins=(1.0, 0.5, 0.1), debug=False):
@ -211,13 +283,16 @@ class Stream:
self.debug = debug
def get_tables(self, pdfname):
"""Returns all tables found in given pdf.
"""get_tables
Parameters
---------
pdfname : string
Path to single page pdf file.
Returns
-------
tables : dict
Dictionary with page number as key and list of tables on that
page as value.
page : dict
"""
layout, dim = get_page_layout(pdfname, char_margin=self.char_margin,
line_margin=self.line_margin, word_margin=self.word_margin)
@ -237,10 +312,10 @@ class Stream:
if self.table_area is not None:
if self.columns is not None:
if len(self.table_area) != len(self.columns):
raise ValueError("message")
raise ValueError("Length of columns should be equal to table_area.")
if self.ncolumns is not None:
if len(self.table_area) != len(self.ncolumns):
raise ValueError("message")
raise ValueError("Length of ncolumns should be equal to table_area.")
table_bbox = {}
for area in self.table_area:
x1, y1, x2, y2 = area.split(",")
@ -369,7 +444,8 @@ class Stream:
score = get_score([[50, rerror], [50, cerror]])
table_data['score'] = score
ar = encode_list(table.get_list())
ar = table.get_list()
ar = encode_list(ar)
table_data['data'] = ar
empty_p, r_nempty_cells, c_nempty_cells = count_empty(ar)
table_data['empty_p'] = empty_p

View File

@ -4,20 +4,27 @@ from .cell import Cell
class Table:
"""Table
"""Table.
Defines a table object with coordinates relative to a left-bottom
origin, which is also PDFMiner's coordinate space.
Parameters
----------
cols : list
List of column x-coordinates.
List of tuples representing column x-coordinates in increasing
order.
rows : list
List of row y-coordinates.
List of tuples representing row y-coordinates in decreasing
order.
Attributes
----------
cells : list
2-D list of cell objects.
List of cell objects with row-major ordering.
nocont_ : int
Number of lines that did not contribute to setting cell edges.
"""
def __init__(self, cols, rows):
@ -29,20 +36,18 @@ class Table:
self.nocont_ = 0
def set_edges(self, vertical, horizontal, jtol=2):
"""Sets cell edges to True if corresponding line segments
are detected in the pdf image.
"""Sets a cell's edges to True depending on whether they
overlap with lines found by imgproc.
Parameters
----------
vertical : list
List of vertical line segments.
List of vertical lines detected by imgproc. Coordinates
scaled and translated to the PDFMiner's coordinate space.
horizontal : list
List of horizontal line segments.
jtol : int
Tolerance to account for when comparing joint and line
coordinates. (optional, default: 2)
List of horizontal lines detected by imgproc. Coordinates
scaled and translated to the PDFMiner's coordinate space.
"""
for v in vertical:
# find closest x coord
@ -151,8 +156,9 @@ class Table:
return self
def set_spanning(self):
"""Sets spanning values of a cell to True if it isn't
bounded by four edges.
"""Sets a cell's spanning_h or spanning_v attribute to True
depending on whether the cell spans/extends horizontally or
vertically.
"""
for i in range(len(self.cells)):
for j in range(len(self.cells[i])):
@ -199,7 +205,8 @@ class Table:
return self
def get_list(self):
"""Returns text from all cells as list of lists.
"""Returns a two-dimensional list of text assigned to each
cell.
Returns
-------

View File

@ -82,28 +82,58 @@ def rotate(x1, y1, x2, y2, angle):
def scale_to_image(k, factors):
"""Translates and scales PDFMiner coordinates to OpenCV's coordinate
space.
Parameters
----------
k : tuple
Tuple (x1, y1, x2, y2) representing table bounding box where
(x1, y1) -> lt and (x2, y2) -> rb in PDFMiner's coordinate
space.
factors : tuple
Tuple (scaling_factor_x, scaling_factor_y, pdf_y) where the
first two elements are scaling factors and pdf_y is height of
pdf.
Returns
-------
knew : tuple
Tuple (x1, y1, x2, y2) representing table bounding box where
(x1, y1) -> lt and (x2, y2) -> rb in OpenCV's coordinate
space.
"""
x1, y1, x2, y2 = k
scaling_factor_x, scaling_factor_y, pdf_y = factors
x1 = scale(x1, scaling_factor_x)
y1 = scale(abs(translate(-pdf_y, y1)), scaling_factor_y)
x2 = scale(x2, scaling_factor_x)
y2 = scale(abs(translate(-pdf_y, y2)), scaling_factor_y)
return int(x1), int(y1), int(x2), int(y2)
knew = (int(x1), int(y1), int(x2), int(y2))
return knew
def scale_to_pdf(tables, v_segments, h_segments, factors):
"""Translates and scales OpenCV coordinates to PDFMiner coordinate
"""Translates and scales OpenCV coordinates to PDFMiner's coordinate
space.
Parameters
----------
tables : dict
Dict with table boundaries as keys and list of intersections
in that boundary as their value.
v_segments : list
List of vertical line segments.
h_segments : list
List of horizontal line segments.
factors : tuple
Tuple (scaling_factor_x, scaling_factor_y, img_y) where the
first two elements are scaling factors and img_y is height of
image.
Returns
-------
@ -145,16 +175,28 @@ def scale_to_pdf(tables, v_segments, h_segments, factors):
def get_rotation(ltchar, lttextlh=None, lttextlv=None):
"""Detects if text in table is vertical or not and returns
its orientation.
"""Detects if text in table is vertical or not using the current
transformation matrix (CTM) and returns its orientation.
Parameters
----------
text : list
ltchar : list
List of PDFMiner LTChar objects.
lttextlh : list
List of PDFMiner LTTextLineHorizontal objects.
(optional, default: None)
lttextlv : list
List of PDFMiner LTTextLineVertical objects.
(optional, default: None)
Returns
-------
rotation : string
{'', 'left', 'right'}
'' if text in table is upright, 'left' if rotated 90 degree
anti-clockwise and 'right' if rotated 90 degree clockwise.
"""
rotation = ''
if lttextlh is not None and lttextlv is not None:
@ -173,26 +215,28 @@ def get_rotation(ltchar, lttextlh=None, lttextlv=None):
def segments_bbox(bbox, v_segments, h_segments):
"""Returns all text objects and line segments present inside a
"""Returns all line segments present inside a
table's bounding box.
Parameters
----------
bbox : tuple
text : list
Tuple (x1, y1, x2, y2) representing table bounding box where
(x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.
v_segments : list
List of vertical line segments.
h_segments : list
List of vertical horizontal segments.
Returns
-------
text_bbox : list
v_s : list
List of vertical line segments that lie inside table.
h_s : list
List of horizontal line segments that lie inside table.
"""
lb = (bbox[0], bbox[1])
rt = (bbox[2], bbox[3])
@ -204,6 +248,23 @@ def segments_bbox(bbox, v_segments, h_segments):
def text_bbox(bbox, text):
"""Returns all text objects present inside a
table's bounding box.
Parameters
----------
bbox : tuple
Tuple (x1, y1, x2, y2) representing table bounding box where
(x1, y1) -> lb and (x2, y2) -> rt in PDFMiner's coordinate space.
text : list
List of PDFMiner text objects.
Returns
-------
t_bbox : list
List of PDFMiner text objects that lie inside table.
"""
lb = (bbox[0], bbox[1])
rt = (bbox[2], bbox[3])
t_bbox = [t for t in text if lb[0] - 2 <= (t.x0 + t.x1) / 2.0
@ -270,18 +331,21 @@ def merge_close_values(ar, mtol=2):
def get_row_index(t, rows):
"""Gets index of the row in which the given object falls by
comparing their co-ordinates.
"""Gets index of the row in which the given text object lies by
comparing their y-coordinates.
Parameters
----------
t : object
rows : list, sorted in decreasing order
rows : list
List of row coordinate tuples, sorted in decreasing order.
Returns
-------
r : int
error : float
"""
offset1, offset2 = 0, 0
for r in range(len(rows)):
@ -298,18 +362,21 @@ def get_row_index(t, rows):
def get_column_index(t, columns):
"""Gets index of the column in which the given object falls by
comparing their co-ordinates.
"""Gets index of the column in which the given text object lies by
comparing their x-coordinates.
Parameters
----------
t : object
columns : list
List of column coordinate tuples.
Returns
-------
c : int
error : float
"""
offset1, offset2 = 0, 0
for c in range(len(columns)):
@ -331,10 +398,10 @@ def get_score(error_weights):
Parameters
----------
error_weights : dict
Dict with a tuple of error percentages as key and weightage
assigned to them as value. Sum of all values should be equal
to 100.
error_weights : list
Two-dimensional list of the form [[p1, e1], [p2, e2], ...]
where pn is the weight assigned to list of errors en.
Sum of pn should be equal to 100.
Returns
-------
@ -352,109 +419,8 @@ def get_score(error_weights):
return score
def reduce_index(t, rotation, r_idx, c_idx):
"""Reduces index of a text object if it lies within a spanning
cell taking in account table rotation.
Parameters
----------
t : object
rotation : string
r_idx : int
c_idx : int
Returns
-------
r_idx : int
c_idx : int
"""
if not rotation:
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].left:
c_idx -= 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].top:
r_idx -= 1
elif rotation == 'left':
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].left:
c_idx -= 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].bottom:
r_idx += 1
elif rotation == 'right':
if t.cells[r_idx][c_idx].spanning_h:
while not t.cells[r_idx][c_idx].right:
c_idx += 1
if t.cells[r_idx][c_idx].spanning_v:
while not t.cells[r_idx][c_idx].top:
r_idx -= 1
return r_idx, c_idx
def outline(t):
"""Sets table border edges to True.
Parameters
----------
t : object
Returns
-------
t : object
"""
for i in range(len(t.cells)):
t.cells[i][0].left = True
t.cells[i][len(t.cells[i]) - 1].right = True
for i in range(len(t.cells[0])):
t.cells[0][i].top = True
t.cells[len(t.cells) - 1][i].bottom = True
return t
def fill_spanning(t, fill=None):
"""Fills spanning cells.
Parameters
----------
t : object
f : string
(optional, default: None)
Returns
-------
t : object
"""
if fill == "h":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_h:
t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
elif fill == "v":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_v:
t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
elif fill == "hv":
for i in range(len(t.cells)):
for j in range(len(t.cells[i])):
if t.cells[i][j].get_text().strip() == '':
if t.cells[i][j].spanning_h:
t.cells[i][j].add_text(t.cells[i][j - 1].get_text())
elif t.cells[i][j].spanning_v:
t.cells[i][j].add_text(t.cells[i - 1][j].get_text())
return t
def remove_empty(d):
"""Removes empty rows and columns from list of lists.
"""Removes empty rows and columns from a two-dimensional list.
Parameters
----------
@ -474,7 +440,7 @@ def remove_empty(d):
def count_empty(d):
"""Counts empty rows and columns from list of lists.
"""Counts empty rows and columns in a two-dimensional list.
Parameters
----------
@ -532,17 +498,19 @@ def get_text_objects(layout, LTType="char", t=None):
Parameters
----------
layout : object
Layout object.
PDFMiner LTPage object.
LTObject : object
Text object, either LTChar or LTTextLineHorizontal.
LTType : string
{'char', 'lh', 'lv'}
Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
and LTTextLineVertical objects respectively.
t : list (optional, default: None)
t : list
Returns
-------
t : list
List of text objects.
List of PDFMiner text objects.
"""
if LTType == "char":
LTObject = LTChar
@ -565,6 +533,33 @@ def get_text_objects(layout, LTType="char", t=None):
def get_page_layout(pname, char_margin=2.0, line_margin=0.5, word_margin=0.1,
detect_vertical=True, all_texts=True):
"""Returns a PDFMiner LTPage object and page dimension of a single
page pdf. See https://euske.github.io/pdfminer/ to get definitions
of kwargs.
Parameters
----------
pname : string
Path to pdf file.
char_margin : float
line_margin : float
word_margin : float
detect_vertical : bool
all_texts : bool
Returns
-------
layout : object
PDFMiner LTPage object.
dim : tuple
pdf page dimension of the form (width, height).
"""
with open(pname, 'r') as f:
parser = PDFParser(f)
document = PDFDocument(parser)

View File

@ -4,26 +4,24 @@
contain the root `toctree` directive.
==================================
Camelot: PDF parsing made simpler!
Camelot: pdf parsing made simpler!
==================================
Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.
Camelot is a Python 2.7 library and command-line tool for getting tables out of pdf files.
Why another PDF table parsing library?
Why another pdf table parsing library?
======================================
We tried a lot of tools available online to get tables out of PDFs, but each one had its limitations. `PDFTables`_ stopped its open source development in 2013. `SolidConverter`_ which powers `Smallpdf`_ is closed source. Recently, `Docparser`_ was launched, which again is closed source. `Tabula`_, though being open source, doesn't always give correct output. In most cases, we had to resort to writing custom scripts for each type of PDF.
We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf.
.. _PDFTables: https://pdftables.com/
.. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
.. _Smallpdf: smallpdf.com
.. _Docparser: https://docparser.com/
.. _Tabula: http://tabula.technology/
PDFs have feelings too
======================
Some background
===============
PDF started as `The Camelot Project`_ when people wanted a cross-platform way to share documents, since a document looked different on each system. A PDF contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements.
@ -37,9 +35,9 @@ Usage
>>> from camelot.pdf import Pdf
>>> from camelot.lattice import Lattice
>>> extractor = Lattice(Pdf('us-030.pdf'))
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Lattice(), 'us-030.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::
:header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -51,7 +49,7 @@ Usage
"2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
"4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
Camelot comes with a command-line tool in which you can specify the output format (csv, tsv, html, json, and xlsx), page numbers you want to parse and the output directory in which you want the output files to be placed. By default, the output files are placed in the same directory as the PDF.
Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.
::
@ -63,11 +61,23 @@ Camelot comes with a command-line tool in which you can specify the output forma
options:
-h, --help Show this screen.
-v, --version Show version.
-V, --verbose Verbose.
-p, --pages <pageno> Comma-separated list of page numbers.
Example: -p 1,3-6,10 [default: 1]
-P, --parallel Parallelize the parsing process.
-f, --format <format> Output format. (csv,tsv,html,json,xlsx) [default: csv]
-l, --log Print log to file.
-l, --log Log to file.
-o, --output <directory> Output directory.
-M, --cmargin <cmargin> Char margin. Chars closer than cmargin are
grouped together to form a word. [default: 2.0]
-L, --lmargin <lmargin> Line margin. Lines closer than lmargin are
grouped together to form a textbox. [default: 0.5]
-W, --wmargin <wmargin> Word margin. Insert blank spaces between chars
if distance between words is greater than word
margin. [default: 0.1]
-S, --print-stats List stats on the parsing process.
-T, --save-stats Save stats to a file.
-X, --plot <dist> Plot distributions. (page,all,rc)
camelot methods:
lattice Looks for lines between data.
@ -80,7 +90,7 @@ Installation
Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
pip install -U pip, setuptools
pip install -U pip setuptools
The required dependencies include `numpy`_, `OpenCV`_ and `ImageMagick`_.
@ -88,46 +98,10 @@ The required dependencies include `numpy`_, `OpenCV`_ and `ImageMagick`_.
.. _OpenCV: http://opencv.org/
.. _ImageMagick: http://www.imagemagick.org/script/index.php
We strongly recommend that you use a `virtual environment`_ to install Camelot. If you don't want to use a virtual environment, then skip the next section.
Installing virtualenvwrapper
----------------------------
You'll need to install `virtualenvwrapper`_.
::
pip install virtualenvwrapper
or
::
sudo pip install virtualenvwrapper
After installing virtualenvwrapper, add the following lines to your `.bashrc` and source it.
::
export WORKON_HOME=$HOME/.virtualenvs
source /usr/bin/virtualenvwrapper.sh
.. note:: The path to `virtualenvwrapper.sh` could be different on your system.
Finally make a virtual environment using::
mkvirtualenv camelot
Installing dependencies
-----------------------
`numpy` can be install using `pip`.
::
pip install numpy
`OpenCV` and `imagemagick` can be installed using your system's default package manager.
numpy can be install using `pip`. OpenCV and imagemagick can be installed using your system's default package manager.
Linux
^^^^^
@ -151,17 +125,10 @@ OS X
brew install homebrew/science/opencv imagemagick
If you're working in a virtualenv, you'll need to create a symbolic link for the OpenCV shared object file::
sudo ln -s /path/to/system/site-packages/cv2.so ~/path/to/virtualenv/site-packages/cv2.so
Finally, `cd` into the project directory and install by doing::
Finally, `cd` into the project directory and install by::
make install
.. _virtual environment: http://virtualenvwrapper.readthedocs.io/en/latest/install.html#basic-installation
.. _virtualenvwrapper: https://virtualenvwrapper.readthedocs.io/en/latest/
API Reference
=============

View File

@ -4,15 +4,15 @@
Lattice
=======
Lattice method is designed to work on PDFs which have tables with well-defined grids. It looks for lines on a page to form a table representation.
Lattice method is designed to work on pdf files which have tables with well-defined grids. It looks for lines on a page to form a table.
Lattice uses OpenCV to apply a set of morphological transformations (erosion and dilation) to find horizontal and vertical line segments in a PDF page after converting it to an image using imagemagick.
Lattice uses OpenCV to apply a set of morphological transformations (erosion and dilation) to find horizontal and vertical line segments in a pdf page after converting it to an image using imagemagick.
.. note:: Currently, Lattice only works on PDFs that contain text i.e. they are not composed of an image of the text. However, we plan to add `OCR support`_ in the future.
.. note:: Currently, Lattice only works on pdf files that contain text. However, we plan to add `OCR support`_ in the future.
.. _OCR support: https://github.com/socialcopsdev/camelot/issues/14
Let's see how Lattice processes this PDF, step by step.
Let's see how Lattice processes this pdf, step by step.
Line segments are detected in the first step.
@ -40,7 +40,7 @@ The detected line segments are overlapped again, this time by `or` ing their pix
:scale: 50%
:align: left
Since dimensions of a PDF and its image vary; table contours, intersections and segments are scaled and translated to the PDF's coordinate space. A representation of the table is then created using these scaled coordinates.
Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates.
.. image:: assets/table.png
:height: 674
@ -63,9 +63,9 @@ Finally, the characters found on the page are assigned to cells based on their x
>>> from camelot.pdf import Pdf
>>> from camelot.lattice import Lattice
>>> extractor = Lattice(Pdf('us-030.pdf'))
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Lattice(), 'us-030.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::
:header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -82,7 +82,7 @@ Scale
The scale parameter is used to determine the length of the structuring element used for morphological transformations. The length of vertical and horizontal structuring elements are found by dividing the image's height and width respectively, by `scale`. Large `scale` will lead to a smaller structuring element, which means that smaller lines will be detected. The default value for scale is 15.
Let's consider this PDF.
Let's consider this pdf file.
.. .. _this: insert link for row_span_1.pdf
@ -105,16 +105,16 @@ Voila! It detected the smaller lines.
Fill
----
In the PDF used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the PDF we used above, vertically.
In the file used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the file we used above, vertically.
::
>>> from camelot.pdf import Pdf
>>> from camelot.lattice import Lattice
>>> extractor = Lattice(Pdf('row_span_1.pdf'), fill='v', scale=40)
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Lattice(fill=['v'], scale=40), 'row_span_1.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::
:header: "Plan Type","County","Plan Name","Totals"
@ -162,7 +162,7 @@ In the PDF used above, you can see that some cells spanned a lot of rows, `fill`
Invert
------
To find line segments, Lattice needs the lines of the PDF to be in foreground. So, if you encounter a PDF like this, just set invert to True.
To find line segments, Lattice needs the lines of the pdf file to be in foreground. So, if you encounter a file like this, just set invert to True.
.. .. _this: insert link for lines_in_background_1.pdf
@ -171,9 +171,9 @@ To find line segments, Lattice needs the lines of the PDF to be in foreground. S
>>> from camelot.pdf import Pdf
>>> from camelot.lattice import Lattice
>>> extractor = Lattice(Pdf('lines_in_background_1.pdf'), invert=True)
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Lattice(invert=True), 'lines_in_background_1.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::
:header: "State","Date","Halt stations","Halt days","Persons directly reached(in lakh)","Persons trained","Persons counseled","Persons testedfor HIV"
@ -186,8 +186,8 @@ To find line segments, Lattice needs the lines of the PDF to be in foreground. S
"Kerala","23.2.2010 to 11.3.2010","9","17","1.42","3,559","2,173","855"
"Total","","47","92","11.81","22,455","19,584","10,644"
Lattice can also parse PDFs with tables like these that are rotated clockwise/anti-clockwise by 90 degrees.
Lattice can also parse pdf files with tables like these that are rotated clockwise/anti-clockwise by 90 degrees.
.. .. _these: insert link for left_rotated_table.pdf
You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `plot_geometry()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `debug_plot()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.

View File

@ -4,20 +4,20 @@
Stream
======
Stream method is the complete opposite of Lattice and works on PDFs which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
Stream method is the complete opposite of Lattice and works on pdf files which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a PDF table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a pdf table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
Let's run it on this PDF.
Let's run it on this pdf.
::
>>> from camelot.pdf import Pdf
>>> from camelot.stream import Stream
>>> extractor = Stream(Pdf('eu-027.pdf'))
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Stream(), 'eu-027.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. .. _this: insert link for eu-027.pdf
@ -66,9 +66,9 @@ But sometimes its guess could be incorrect, like in this case.
>>> from camelot.pdf import Pdf
>>> from camelot.stream import Stream
>>> extractor = Stream(Pdf('missing_values.pdf'))
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Stream(), 'missing_values.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. .. _this: insert link for missing_values.pdf
@ -118,16 +118,16 @@ But sometimes its guess could be incorrect, like in this case.
"14...","",""
"Chronic...","",""
It guessed that the PDF has 3 columns, because there wasn't any data in the last 2 columns for most rows. So, let's specify the number of columns explicitly, following which, Stream will only consider rows that have 5 words, to decide on column boundaries.
It guessed that the pdf has 3 columns, because there wasn't any data in the last 2 columns for most rows. So, let's specify the number of columns explicitly, following which, Stream will only consider rows that have 5 words, to decide on column boundaries.
::
>>> from camelot.pdf import Pdf
>>> from camelot.stream import Stream
>>> extractor = Stream(Pdf('missing_values.pdf'), ncolumns=5)
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Stream(ncolumns=[5]), 'missing_values.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::
@ -175,15 +175,15 @@ It guessed that the PDF has 3 columns, because there wasn't any data in the last
"14...","","","",""
"Chronic...","","","",""
We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this PDF.
We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this pdf file.
::
>>> from camelot.pdf import Pdf
>>> from camelot.stream import Stream
>>> extractor = Stream(Pdf('mexican_towns.pdf'), debug=True)
>>> extractor.plot_text()
>>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
>>> manager.debug_plot()
.. image:: assets/columns.png
:height: 674
@ -198,9 +198,9 @@ After getting the x-coordinates, we just need to pass them to Stream, like this.
>>> from camelot.pdf import Pdf
>>> from camelot.stream import Stream
>>> extractor = Stream(Pdf('mexican_towns.pdf'), columns='28,67,180,230,425,475,700')
>>> tables = extractor.get_tables()
>>> print tables['page-1'][0]
>>> manager = Pdf(Stream(columns=['28,67,180,230,425,475,700']), 'mexican_towns.pdf')
>>> tables = manager.extract()
>>> print tables['page-1']['table-1']['data']
.. csv-table::