camelot-py/README.md

1.5 KiB

Camelot: PDF Table Parsing for Humans

Camelot is a Python 2.7 library and command-line tool for extracting tabular data from PDF files.

Usage

>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")
>>> tables
<TableList n=2>
>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
>>> tables[0]
<Table shape=(3,4)>
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> tables[0].parsing_report
{
    "accuracy": 96,
    "whitespace": 80,
    "order": 1,
    "page": 1
}
>>> df = tables[0].df

Dependencies

The dependencies include tk and ghostscript.

Installation

Make sure you have the most updated versions for pip and setuptools. You can update them by

pip install -U pip setuptools

Installing dependencies

tk and ghostscript can be installed using your system's default package manager.

Linux

  • Ubuntu
sudo apt-get install python-tk ghostscript
  • Arch Linux
sudo pacman -S tk ghostscript

OS X

brew install tcl-tk ghostscript

Finally, cd into the project directory and install by

python setup.py install

Development

Code

You can check the latest sources with the command:

git clone https://github.com/socialcopsdev/camelot.git

Contributing

See Contributing guidelines.

Testing

python setup.py test

License

BSD License