|
|
||
|---|---|---|
| camelot | ||
| docs | ||
| tests | ||
| .coveragerc | ||
| .gitignore | ||
| LICENSE | ||
| README.md | ||
| requirements-dev.txt | ||
| requirements.txt | ||
| setup.cfg | ||
| setup.py | ||
README.md
Camelot: PDF Table Parsing for Humans
Camelot is a Python library and command-line tool for extracting tables from PDF files.
Usage
API
>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")
>>> tables
<TableList n=2>
>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
>>> tables[0]
<Table shape=(3,4)>
>>> tables[0].parsing_report
{
"accuracy": 96,
"whitespace": 80,
"order": 1,
"page": 1
}
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> tables[0].df
Command-line interface
Usage: camelot [OPTIONS] FILEPATH
Options:
-p, --pages TEXT Comma-separated page numbers to parse.
Example: 1,3,4 or 1,4-end
-o, --output TEXT Output filepath.
-f, --format [csv|json|excel|html]
Output file format.
-z, --zip Whether or not to create a ZIP archive.
-m, --mesh Whether or not to use Lattice method of
parsing. Stream is used by default.
-T, --table_area TEXT Table areas (x1,y1,x2,y2) to process.
x1, y1
-> left-top and x2, y2 -> right-bottom
-split, --split_text Whether or not to split text if it spans
across multiple cells.
-flag, --flag_size (inactive) Whether or not to flag text which
has uncommon size. (Useful to detect
super/subscripts)
-M, --margins <FLOAT FLOAT FLOAT>...
char_margin, line_margin, word_margin for
PDFMiner.
-C, --columns TEXT x-coordinates of column separators.
-r, --row_close_tol INTEGER Rows will be formed by combining text
vertically within this tolerance.
-c, --col_close_tol INTEGER Columns will be formed by combining text
horizontally within this tolerance.
-back, --process_background (with --mesh) Whether or not to process
lines that are in background.
-scale, --line_size_scaling INTEGER
(with --mesh) Factor by which the page
dimensions will be divided to get smallest
length of detected lines.
-copy, --copy_text [h|v] (with --mesh) Specify direction in which
text will be copied over in a spanning cell.
-shift, --shift_text [l|r|t|b] (with --mesh) Specify direction in which
text in a spanning cell should flow.
-l, --line_close_tol INTEGER (with --mesh) Tolerance parameter used to
merge close vertical lines and close
horizontal lines.
-j, --joint_close_tol INTEGER (with --mesh) Tolerance parameter used to
decide whether the detected lines and points
lie close to each other.
-block, --threshold_blocksize INTEGER
(with --mesh) For adaptive thresholding,
size of a pixel neighborhood that is used to
calculate a threshold value for the pixel:
3, 5, 7, and so on.
-const, --threshold_constant INTEGER
(with --mesh) For adaptive thresholding,
constant subtracted from the mean or
weighted mean.
Normally, it is positive but
may be zero or negative as well.
-I, --iterations INTEGER (with --mesh) Number of times for
erosion/dilation is applied.
-G, --geometry_type [text|table|contour|joint|line]
Plot geometry found on pdf page for
debugging.
text: Plot text objects. (Useful to get
table_area and columns coordinates)
table: Plot parsed table.
contour (with --mesh): Plot detected rectangles.
joint (with --mesh): Plot detected line intersections.
line (with --mesh): Plot detected lines.
--help Show this message and exit.
Dependencies
The dependencies include tk and ghostscript.
Installation
Make sure you have the most updated versions for pip and setuptools. You can update them by
pip install -U pip setuptools
Installing dependencies
tk and ghostscript can be installed using your system's default package manager.
Linux
- Ubuntu
sudo apt-get install python-tk ghostscript
- Arch Linux
sudo pacman -S tk ghostscript
OS X
brew install tcl-tk ghostscript
Finally, cd into the project directory and install by
python setup.py install
Development
Code
You can check the latest sources with the command:
git clone https://github.com/socialcopsdev/camelot.git
Contributing
See Contributing guidelines.
Testing
python setup.py test