A Python library to extract tabular data from PDFs
 
 
Go to file
Vinayak Mehta 066c5c6aca Add LICENSE and _templates 2018-09-11 18:47:29 +05:30
camelot Fix docstrings and interlinks 2018-09-11 08:31:37 +05:30
docs Add LICENSE and _templates 2018-09-11 18:47:29 +05:30
tests Port tests 2018-09-09 05:29:24 +05:30
.coveragerc Add coveragerc and update Makefile 2016-08-08 17:24:13 +05:30
.gitignore Add docstrings and update docs 2018-09-09 10:00:22 +05:30
LICENSE Add LICENSE and _templates 2018-09-11 18:47:29 +05:30
README.md Add LICENSE and _templates 2018-09-11 18:47:29 +05:30
requirements-dev.txt Fix setup.py 2018-09-11 08:31:37 +05:30
requirements.txt Fix setup.py 2018-09-11 08:31:37 +05:30
setup.cfg Add setup.cfg 2018-09-09 05:41:42 +05:30
setup.py Fix setup.py 2018-09-11 08:31:37 +05:30

README.md

Camelot: PDF Table Parsing for Humans

Camelot is a Python library and command-line tool for extracting tables from PDF files.

Usage

API

>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")
>>> tables
<TableList n=2>
>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
>>> tables[0]
<Table shape=(3,4)>
>>> tables[0].parsing_report
{
    "accuracy": 96,
    "whitespace": 80,
    "order": 1,
    "page": 1
}
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> tables[0].df

Command-line interface

Usage: camelot [OPTIONS] FILEPATH

Options:
  -p, --pages TEXT                Comma-separated page numbers to parse.
                                  Example: 1,3,4 or 1,4-end
  -o, --output TEXT               Output filepath.
  -f, --format [csv|json|excel|html]
                                  Output file format.
  -z, --zip                       Whether or not to create a ZIP archive.
  -m, --mesh                      Whether or not to use Lattice method of
                                  parsing. Stream is used by default.
  -T, --table_area TEXT           Table areas (x1,y1,x2,y2) to process.
                                  x1, y1
                                  -> left-top and x2, y2 -> right-bottom
  -split, --split_text            Whether or not to split text if it spans
                                  across multiple cells.
  -flag, --flag_size              (inactive) Whether or not to flag text which
                                  has uncommon size. (Useful to detect
                                  super/subscripts)
  -M, --margins <FLOAT FLOAT FLOAT>...
                                  char_margin, line_margin, word_margin for
                                  PDFMiner.
  -C, --columns TEXT              x-coordinates of column separators.
  -r, --row_close_tol INTEGER     Rows will be formed by combining text
                                  vertically within this tolerance.
  -c, --col_close_tol INTEGER     Columns will be formed by combining text
                                  horizontally within this tolerance.
  -back, --process_background     (with --mesh) Whether or not to process
                                  lines that are in background.
  -scale, --line_size_scaling INTEGER
                                  (with --mesh) Factor by which the page
                                  dimensions will be divided to get smallest
                                  length of detected lines.
  -copy, --copy_text [h|v]        (with --mesh) Specify direction in which
                                  text will be copied over in a spanning cell.
  -shift, --shift_text [l|r|t|b]  (with --mesh) Specify direction in which
                                  text in a spanning cell should flow.
  -l, --line_close_tol INTEGER    (with --mesh) Tolerance parameter used to
                                  merge close vertical lines and close
                                  horizontal lines.
  -j, --joint_close_tol INTEGER   (with --mesh) Tolerance parameter used to
                                  decide whether the detected lines and points
                                  lie close to each other.
  -block, --threshold_blocksize INTEGER
                                  (with --mesh) For adaptive thresholding,
                                  size of a pixel neighborhood that is used to
                                  calculate a threshold value for the pixel:
                                  3, 5, 7, and so on.
  -const, --threshold_constant INTEGER
                                  (with --mesh) For adaptive thresholding,
                                  constant subtracted from the mean or
                                  weighted mean.
                                  Normally, it is positive but
                                  may be zero or negative as well.
  -I, --iterations INTEGER        (with --mesh) Number of times for
                                  erosion/dilation is applied.
  -G, --geometry_type [text|table|contour|joint|line]
                                  Plot geometry found on pdf page for
                                  debugging.
                                  text: Plot text objects. (Useful to get
                                        table_area and columns coordinates)
                                  table: Plot parsed table.
                                  contour (with --mesh): Plot detected rectangles.
                                  joint (with --mesh): Plot detected line intersections.
                                  line (with --mesh): Plot detected lines.
  --help                          Show this message and exit.

Dependencies

The dependencies include tk and ghostscript.

Installation

Make sure you have the most updated versions for pip and setuptools. You can update them by

pip install -U pip setuptools

Installing dependencies

tk and ghostscript can be installed using your system's default package manager.

Linux

  • Ubuntu
sudo apt-get install python-tk ghostscript
  • Arch Linux
sudo pacman -S tk ghostscript

OS X

brew install tcl-tk ghostscript

Finally, cd into the project directory and install by

python setup.py install

Development

Code

You can check the latest sources with the command:

git clone https://github.com/socialcopsdev/camelot.git

Contributing

See Contributing guidelines.

Testing

python setup.py test