|
|
||
|---|---|---|
| camelot | ||
| debug | ||
| docs | ||
| examples | ||
| tests | ||
| tools | ||
| .coveragerc | ||
| .gitignore | ||
| Makefile | ||
| README.md | ||
| requirements.txt | ||
| setup.py | ||
README.md
camelot
Camelot is a Python 2.7 library and command-line tool for getting tables out of PDF files.
Usage
>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")
>>> tables
<TableSet n=2>
>>> tables.to_csv(zip=True) # to_json, to_excel, to_html
>>> tables[0]
<Table shape=(3,4)>
>>> tables[0].parsing_report
{
"accuracy": 96,
"whitespace": 80,
"time_taken": 0.5,
"page": 1
}
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> df = tables[0].to_df()
Camelot comes with a CLI where you can specify page numbers, output format, output directory etc. By default, the output files are placed in the same directory as the PDF.
Camelot: PDF parsing made simpler!
usage:
camelot [options] <method> [<args>...]
options:
-h, --help Show this screen.
-v, --version Show version.
-V, --verbose Verbose.
-p, --pages <pageno> Comma-separated list of page numbers.
Example: -p 1,3-6,10 [default: 1]
-P, --parallel Parallelize the parsing process.
-f, --format <format> Output format. (csv,tsv,html,json,xlsx) [default: csv]
-l, --log Log to file.
-o, --output <directory> Output directory.
-M, --cmargin <cmargin> Char margin. Chars closer than cmargin are
grouped together to form a word. [default: 2.0]
-L, --lmargin <lmargin> Line margin. Lines closer than lmargin are
grouped together to form a textbox. [default: 0.5]
-W, --wmargin <wmargin> Word margin. Insert blank spaces between chars
if distance between words is greater than word
margin. [default: 0.1]
-J, --split_text Split text lines if they span across multiple cells.
-K, --flag_size Flag substring if its size differs from the whole string.
Useful for super and subscripts.
-X, --print-stats List stats on the parsing process.
-Y, --save-stats Save stats to a file.
-Z, --plot <dist> Plot distributions. (page,all,rc)
camelot methods:
lattice Looks for lines between data.
stream Looks for spaces between data.
See 'camelot <method> -h' for more information on a specific method.
Dependencies
Currently, camelot works under Python 2.7.
The required dependencies include numpy, OpenCV and ghostscript.
Installation
Make sure you have the most updated versions for pip and setuptools. You can update them by
pip install -U pip setuptools
Installing dependencies
numpy can be install using pip. OpenCV and ghostscript can be installed using your system's default package manager.
Linux
- Arch Linux
sudo pacman -S opencv ghostscript
- Ubuntu
sudo apt-get install libopencv-dev python-opencv python-tk ghostscript
OS X
brew install homebrew/science/opencv ghostscript
Finally, cd into the project directory and install by
make install
Development
Code
You can check the latest sources with the command:
git clone https://github.com/socialcopsdev/camelot.git
Contributing
See Contributing doc.
Testing
make test
License
BSD License