A Python library to extract tabular data from PDFs
 
 
Go to file
Jose Vargas 52adbbd796 [parsers.stream] - Use fall back column coordinates.
The Stream class would raise an IndexError when the 'columns' argument was specified
and the number of tables identified was larger than the number of items in the
'columns' argument.

This IndexError makes extracting tables from a PDF comprised mainly of known,
consistent table structures of interest to the caller, but that may be variable in
height, starting position, or number, rather cumbersome with the Stream parser.
This is especially true within an automated or programmatic context.
Either the caller must call 'camelot.read_pdf' once per page, or
manipulate the 'columns' argument so as to avoid the IndexError. The former
isn't guaranteed to work, as a single page can contain multiple tables,
and therefore, in such a situation, the caller must resort to the latter even if
extracting tables from a single page.

The Stream class continues to function exactly the same when the 'table_areas'
argument is provided; this commit only changes the behavior of the Stream parser
when 'table_areas' is not provided.

This commit allows all tables to be easily extracted by specifying 'pages=all'
and providing the appropriate 'columns' argument value to
'camelot.read_pdf'.

Extracting all tables from such a PDF is already possible with the
Lattice parser, this commit makes this possible with the Stream
parser as well.

Callers are responsible for filtering out any extraneous tables.
2020-01-31 20:29:26 -05:00
camelot [parsers.stream] - Use fall back column coordinates. 2020-01-31 20:29:26 -05:00
docs Add deepsource badge to docs 2019-12-24 13:08:46 +05:30
tests Moved the version tests to test_common PR #94 2019-11-14 20:26:20 -05:00
.coveragerc Update .coveragerc 2019-01-05 11:49:55 +05:30
.deepsource.toml Create .deepsource.toml 2019-12-24 13:03:45 +05:30
.editorconfig Update .editorconfig and HISTORY.md 2018-10-19 16:23:15 +05:30
.gitignore No need to monkey-patch Click.HelpFormatter 2019-07-04 13:13:32 +03:00
.travis.yml Update HISTORY.md 2018-10-11 23:51:05 +05:30
CODE_OF_CONDUCT.md Update CODE_OF_CONDUCT.md 2018-09-14 10:05:38 +05:30
CONTRIBUTING.md Update docs 2019-07-06 04:28:32 +05:30
HISTORY.md Update HISTORY.md 2019-07-28 21:46:55 +10:00
LICENSE Update LICENSE and fix travis 2019-07-03 20:46:18 +05:30
MANIFEST.in Update MANIFEST.in 2018-10-08 01:18:19 +05:30
Makefile Fix no table found warning and add tests for two tables 2018-11-23 19:28:55 +05:30
README.md Add deepsource badge 2019-12-24 13:07:11 +05:30
requirements.txt Update requirements.txt 2018-12-18 07:45:01 +05:30
setup.cfg Fix no table found warning and add tests for two tables 2018-11-23 19:28:55 +05:30
setup.py Add chardet to install_requires 2018-12-05 20:08:37 +05:30

README.md

Camelot: PDF Table Extraction for Humans

Build Status Documentation Status codecov.io image image image Gitter chat image image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py[cv]

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Support the development

You can support our work on Camelot with a one-time or monthly donation on OpenCollective. Organizations who use camelot can also sponsor the project for an acknowledgement on our documentation site and this README.

Special thanks to all the users, organizations and contributors that support Camelot!