A Python library to extract tabular data from PDFs
 
 
Go to file
Vinayak Mehta a43d5ca2c7 Replace chars with textlines
* Add split function

* Add split_text and shift_text params

* Change get_rotation

* Move get_column_index to utils

* Add split_text and shift_text

* Fix split_text
2016-10-12 13:17:02 +05:30
camelot Replace chars with textlines 2016-10-12 13:17:02 +05:30
docs Add logo 2016-10-04 20:59:52 +05:30
examples Adds documentation 2016-08-09 17:23:50 +05:30
tests Replace chars with textlines 2016-10-12 13:17:02 +05:30
tools Minor Stream fix 2016-09-27 17:27:34 +05:30
.coveragerc Add coveragerc and update Makefile 2016-08-08 17:24:13 +05:30
.gitignore Add Makefile 2016-08-08 16:32:05 +05:30
Makefile Fix Makefile spaces to tabs 2016-08-08 17:26:54 +05:30
README.md Update docs 2016-10-04 17:50:48 +05:30
requirements.txt Replace imagemagick with ghostscript 2016-09-13 17:35:07 +05:30
setup.py Create python package 2016-07-29 21:09:39 +05:30

README.md

camelot

Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.

Usage

from camelot.pdf import Pdf
from camelot.lattice import Lattice

manager = Pdf(Lattice(), "/path/to/pdf")
tables = manager.extract()

Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.

Camelot: PDF parsing made simpler!

usage:
 camelot [options] <method> [<args>...]

options:
 -h, --help                Show this screen.
 -v, --version             Show version.
 -V, --verbose             Verbose.
 -p, --pages <pageno>      Comma-separated list of page numbers.
                           Example: -p 1,3-6,10  [default: 1]
 -P, --parallel            Parallelize the parsing process.
 -f, --format <format>     Output format. (csv,tsv,html,json,xlsx) [default: csv]
 -l, --log                 Log to file.
 -o, --output <directory>  Output directory.
 -M, --cmargin <cmargin>   Char margin. Chars closer than cmargin are
                           grouped together to form a word. [default: 2.0]
 -L, --lmargin <lmargin>   Line margin. Lines closer than lmargin are
                           grouped together to form a textbox. [default: 0.5]
 -W, --wmargin <wmargin>   Word margin. Insert blank spaces between chars
                           if distance between words is greater than word
                           margin. [default: 0.1]
 -S, --print-stats         List stats on the parsing process.
 -T, --save-stats          Save stats to a file.
 -X, --plot <dist>         Plot distributions. (page,all,rc)

camelot methods:
 lattice  Looks for lines between data.
 stream   Looks for spaces between data.

See 'camelot <method> -h' for more information on a specific method.

Dependencies

Currently, camelot works under Python 2.7.

The required dependencies include numpy, OpenCV and ImageMagick.

Installation

Make sure you have the most updated versions for pip and setuptools. You can update them by

pip install -U pip setuptools

Installing dependencies

numpy can be install using pip. OpenCV and imagemagick can be installed using your system's default package manager.

Linux

  • Arch Linux
sudo pacman -S opencv imagemagick
  • Ubuntu
sudo apt-get install libopencv-dev python-opencv imagemagick

OS X

brew install homebrew/science/opencv imagemagick

Finally, cd into the project directory and install by

make install

Development

Code

You can check the latest sources with the command:

git clone https://github.com/socialcopsdev/camelot.git

Contributing

See Contributing doc.

Testing

make test

License

BSD License