A Python library to extract tabular data from PDFs

Go to file

Vinayak Mehta a70befe528 Update docs		2018-09-23 14:04:21 +05:30
camelot	Update docs	2018-09-23 14:04:21 +05:30
docs	Update docs	2018-09-23 14:04:21 +05:30
tests	Add flavors	2018-09-23 10:53:32 +05:30
.coveragerc	Add coveragerc and update Makefile	2016-08-08 17:24:13 +05:30
.gitignore	Add _static	2018-09-13 16:25:42 +05:30
CODE_OF_CONDUCT.md	Update CODE_OF_CONDUCT.md	2018-09-14 10:05:38 +05:30
CONTRIBUTING.md	Fix contributor's guide	2018-09-14 11:08:25 +05:30
LICENSE	Add LICENSE and _templates	2018-09-11 18:47:29 +05:30
README.md	Update docs	2018-09-23 14:04:21 +05:30
requirements-dev.txt	Fix setup.py	2018-09-11 08:31:37 +05:30
requirements.txt	Fix setup.py	2018-09-11 08:31:37 +05:30
setup.cfg	Add setup.cfg	2018-09-09 05:41:42 +05:30
setup.py	Add flavors	2018-09-23 10:53:32 +05:30

README.md

Camelot: PDF Table Parsing for Humans

Camelot is a Python library which makes it easy for anyone to extract tables from PDF files!

Here's how you can extract tables from PDF files. Check out the PDF used in this example, here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList tables=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%

There's a command-line interface too!

Why Camelot?

You are in control: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.)
Metrics: Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
Each table is a pandas DataFrame, which enables seamless integration into data analysis workflows.
Export to multiple formats, including json, excel and html.
Simple and Elegant API, written in Python!

See comparison with other PDF parsing libraries and tools.

Installation

After installing the dependencies, you can simply use pip to install Camelot:

$ pip install camelot-py

Alternatively

You can install the dependencies tk and ghostscript using your system's package manager. After that, clone the repo using:

$ git clone https://www.github.com/socialcopsdev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install .

Note: Use a virtualenv if you don't want to affect your global Python installation.

Documentation

Great documentation is available at insert link.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/socialcopsdev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository.

License

This project is licensed under the MIT License, see the LICENSE file for details.