A Python library to extract tabular data from PDFs
 
 
Go to file
Vinayak Mehta 80f6870117 Update HISTORY.md 2018-10-05 20:21:02 +05:30
camelot [MRG + 1] Add tests for repr (#128) 2018-10-05 20:19:24 +05:30
docs [MRG + 1] Make pep8 (#125) 2018-10-05 16:55:43 +05:30
tests [MRG + 1] Add tests for repr (#128) 2018-10-05 20:19:24 +05:30
.coveragerc Add pytest-cov 2018-10-02 22:37:38 +05:30
.gitignore [MRG + 1] Test UsageError for CLI (#122) 2018-10-04 22:01:20 +05:30
.travis.yml [MRG] Add python versions (#119) 2018-10-04 23:43:52 +05:30
CODE_OF_CONDUCT.md Update CODE_OF_CONDUCT.md 2018-09-14 10:05:38 +05:30
CONTRIBUTING.md Fix GH issues link 2018-10-03 19:36:29 +05:30
HISTORY.md Update HISTORY.md 2018-10-05 20:21:02 +05:30
LICENSE Update LICENSE 2018-09-23 22:48:17 +05:30
MANIFEST.in Add HISTORY.md 2018-10-05 19:42:19 +05:30
README.md Update README 2018-10-05 19:46:38 +05:30
requirements-dev.txt Add pytest-cov 2018-10-02 22:37:38 +05:30
requirements.txt [MRG] Add tests for output formats and parser kwargs (#126) 2018-10-05 16:15:30 +05:30
setup.cfg Add pytest-cov 2018-10-02 22:37:38 +05:30
setup.py [MRG + 1] Make pep8 (#125) 2018-10-05 16:55:43 +05:30

README.md

Camelot: PDF Table Extraction for Humans

Build Status codecov.io image image image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel and HTML.

See comparison with other PDF table extraction libraries and tools.

Installation

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py

Alternatively

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/socialcopsdev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install .

Note: Use a virtualenv if you don't want to affect your global Python installation.

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/socialcopsdev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.