A Python library to extract tabular data from PDFs

Go to file

Vinayak Mehta 941994f0bf Make present code work with new API		2018-09-04 23:34:49 +05:30
camelot	Make present code work with new API	2018-09-04 23:34:49 +05:30
debug	Add debug script	2017-04-18 18:32:18 +05:30
docs	Remove ncolumns everywhere	2017-03-01 19:53:48 +05:30
examples	Remove ncolumns everywhere	2017-03-01 19:53:48 +05:30
tests	Fix column parameter	2016-10-13 16:54:45 +05:30
tools	Remove ocr	2018-09-01 16:23:54 +05:30
.coveragerc	Add coveragerc and update Makefile	2016-08-08 17:24:13 +05:30
.gitignore	Add Makefile	2016-08-08 16:32:05 +05:30
Makefile	Fix Makefile spaces to tabs	2016-08-08 17:26:54 +05:30
README.md	Update README	2018-09-04 03:53:30 +05:30
requirements.txt	Update README and requirements	2018-09-02 19:04:24 +05:30
setup.py	Create python package	2016-07-29 21:09:39 +05:30

README.md

camelot

Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.

Usage

>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")
>>> tables
<TableSet n=2>
>>> tables.to_csv(zip=True) # to_json, to_excel, to_html
>>> tables[0]
<Table shape=(3,4)>
>>> tables[0].parsing_report
{
    "accuracy": 96,
    "whitespace": 80,
    "time_taken": 0.5,
    "page": 1
}
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> df = tables[0].to_df()

Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.

Camelot: PDF parsing made simpler!

usage:
 camelot [options] <method> [<args>...]

options:
 -h, --help                Show this screen.
 -v, --version             Show version.
 -V, --verbose             Verbose.
 -p, --pages <pageno>      Comma-separated list of page numbers.
                           Example: -p 1,3-6,10  [default: 1]
 -P, --parallel            Parallelize the parsing process.
 -f, --format <format>     Output format. (csv,tsv,html,json,xlsx) [default: csv]
 -l, --log                 Log to file.
 -o, --output <directory>  Output directory.
 -M, --cmargin <cmargin>   Char margin. Chars closer than cmargin are
                           grouped together to form a word. [default: 2.0]
 -L, --lmargin <lmargin>   Line margin. Lines closer than lmargin are
                           grouped together to form a textbox. [default: 0.5]
 -W, --wmargin <wmargin>   Word margin. Insert blank spaces between chars
                           if distance between words is greater than word
                           margin. [default: 0.1]
 -J, --split_text          Split text lines if they span across multiple cells.
 -K, --flag_size           Flag substring if its size differs from the whole string.
                           Useful for super and subscripts.
 -X, --print-stats         List stats on the parsing process.
 -Y, --save-stats          Save stats to a file.
 -Z, --plot <dist>         Plot distributions. (page,all,rc)

camelot methods:
 lattice  Looks for lines between data.
 stream   Looks for spaces between data.

See 'camelot <method> -h' for more information on a specific method.

Dependencies

Currently, camelot works under Python 2.7.

The required dependencies include numpy, OpenCV and ghostscript.

Installation

Make sure you have the most updated versions for pip and setuptools. You can update them by

pip install -U pip setuptools

Installing dependencies

numpy can be install using pip. OpenCV and ghostscript can be installed using your system's default package manager.

Linux

Arch Linux

sudo pacman -S opencv ghostscript

Ubuntu

sudo apt-get install libopencv-dev python-opencv python-tk ghostscript

OS X

brew install homebrew/science/opencv ghostscript

Finally, cd into the project directory and install by

make install

Development

Code

You can check the latest sources with the command:

git clone https://github.com/socialcopsdev/camelot.git

Contributing

See Contributing doc.

Testing

make test

License

BSD License