Add LICENSE and _templates

pull/2/head
Vinayak Mehta 2018-09-11 18:47:29 +05:30
parent 17ea5f335e
commit 066c5c6aca
36 changed files with 282 additions and 162 deletions

7
LICENSE 100644
View File

@ -0,0 +1,7 @@
Copyright (c) 2018 Peeply Private Ltd (Singapore)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

View File

@ -14,7 +14,6 @@ Camelot is a Python library and command-line tool for extracting tables from PDF
>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
>>> tables[0] >>> tables[0]
<Table shape=(3,4)> <Table shape=(3,4)>
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> tables[0].parsing_report >>> tables[0].parsing_report
{ {
"accuracy": 96, "accuracy": 96,
@ -22,7 +21,8 @@ Camelot is a Python library and command-line tool for extracting tables from PDF
"order": 1, "order": 1,
"page": 1 "page": 1
} }
>>> df = tables[0].df >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> tables[0].df
</pre> </pre>
### Command-line interface ### Command-line interface
@ -87,17 +87,12 @@ Options:
-G, --geometry_type [text|table|contour|joint|line] -G, --geometry_type [text|table|contour|joint|line]
Plot geometry found on pdf page for Plot geometry found on pdf page for
debugging. debugging.
text: Plot text objects. (Useful to get
text: Plot text objects. (Useful table_area and columns coordinates)
to get table_area and columns coordinates)
table: Plot parsed table. table: Plot parsed table.
contour (with contour (with --mesh): Plot detected rectangles.
--mesh): Plot detected rectangles. joint (with --mesh): Plot detected line intersections.
joint line (with --mesh): Plot detected lines.
(with --mesh): Plot detected line
intersections.
line (with --mesh): Plot
detected lines.
--help Show this message and exit. --help Show this message and exit.
</pre> </pre>
@ -161,8 +156,4 @@ See [Contributing guidelines]().
<pre> <pre>
python setup.py test python setup.py test
</pre> </pre>
## License
BSD License

15
docs/_templates/hacks.html vendored 100644
View File

@ -0,0 +1,15 @@
<style type="text/css">
/* "Quick Search" should be capitalized. */
div#searchbox h3 {text-transform: capitalize;}
/* Make the document a little wider, less code is cut-off. */
div.document {width: 1008px;}
/* Much-improved spacing around code blocks. */
div.highlight pre {padding: 11px 14px;}
/* Remain Responsive! */
@media screen and (max-width: 1008px) {
div.sphinxsidebar {display: none;}
div.document {width: 100%!important;}
/* Have code blocks escape the document right-margin. */
div.highlight pre {margin-right: -30px;}
}
</style>

View File

@ -0,0 +1,16 @@
<p class="logo">
<a href="{{ pathto(master_doc) }}">
<img class="logo" src="{{ pathto('_static/camelot.png', 1) }}"/>
</a>
</p>
<p>
<iframe src="https://ghbtns.com/github-btn.html?user=socialcopsdev&repo=camelot&type=watch&count=true&size=large"
allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe>
</p>
<h3>Useful Links</h3>
<ul>
<li><a href="https://github.com/socialcopsdev/camelot">Camelot @ GitHub</a></li>
<li><a href="https://pypi.org/project/camelot-py/">Camelot @ PyPI</a></li>
<li><a href="https://github.com/socialcopsdev/camelot/issues">Issue Tracker</a></li>
</ul>

View File

@ -1,45 +1,35 @@
.. _api: .. _api:
=============
API Reference API Reference
============= =============
.. module:: camelot .. module:: camelot
camelot.read_pdf Main Interface
================ --------------
.. autofunction:: camelot.read_pdf .. autofunction:: camelot.read_pdf
camelot.plot_geometry
=====================
.. autofunction:: camelot.plot_geometry .. autofunction:: camelot.plot_geometry
camelot.handlers.PDFHandler Lower-Level Classes
=========================== -------------------
.. autoclass:: camelot.handlers.PDFHandler .. autoclass:: camelot.handlers.PDFHandler
:inherited-members: :inherited-members:
camelot.parsers.Stream
======================
.. autoclass:: camelot.parsers.Stream .. autoclass:: camelot.parsers.Stream
:inherited-members: :inherited-members:
camelot.parsers.Lattice
=======================
.. autoclass:: camelot.parsers.Lattice .. autoclass:: camelot.parsers.Lattice
:inherited-members: :inherited-members:
camelot.core.Cell Lower-Lower-Level Classes
================= -------------------------
.. autoclass:: camelot.core.Cell
.. autoclass:: camelot.core.TableList
:inherited-members: :inherited-members:
camelot.core.Table
==================
.. autoclass:: camelot.core.Table .. autoclass:: camelot.core.Table
:inherited-members: :inherited-members:
camelot.core.TableList .. autoclass:: camelot.core.Cell
======================
.. autoclass:: camelot.core.TableList
:inherited-members: :inherited-members:

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.0 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.1 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 5.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.8 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.8 KiB

View File

@ -63,7 +63,7 @@ master_doc = 'index'
# General information about the project. # General information about the project.
project = u'Camelot' project = u'Camelot'
copyright = u'2018, SocialCops' copyright = u'2018, Peeply Private Ltd (Singapore)'
author = u'Vinayak Mehta' author = u'Vinayak Mehta'
# The version info for the project you're documenting, acts as replacement for # The version info for the project you're documenting, acts as replacement for
@ -189,10 +189,10 @@ html_use_smartypants = True
# Custom sidebar templates, maps document names to template names. # Custom sidebar templates, maps document names to template names.
html_sidebars = { html_sidebars = {
'index': ['sidebarlogo.html', 'relations.html', 'sourcelink.html', 'index': ['sidebarintro.html', 'relations.html', 'sourcelink.html',
'searchbox.html'], 'searchbox.html', 'hacks.html'],
'**': ['sidebarlogo.html', 'localtoc.html', 'relations.html', '**': ['sidebarlogo.html', 'localtoc.html', 'relations.html',
'sourcelink.html', 'searchbox.html'] 'sourcelink.html', 'searchbox.html', 'hacks.html']
} }
# Additional templates that should be rendered to pages, maps page names to # Additional templates that should be rendered to pages, maps page names to

View File

@ -1,8 +1,7 @@
.. _contributing: .. _contributing:
======================= Contributor's Guide
Contributing guidelines ===================
=======================
The preferred way to contribute to Camelot is to fork this repository, and then submit a "pull request" (PR): The preferred way to contribute to Camelot is to fork this repository, and then submit a "pull request" (PR):
@ -27,3 +26,22 @@ The preferred way to contribute to Camelot is to fork this repository, and then
$ git push -u origin my-feature $ git push -u origin my-feature
Finally, go to the web page of the your fork of the camelot repo, and click Pull request to send your changes to the maintainers for review. Finally, go to the web page of the your fork of the camelot repo, and click Pull request to send your changes to the maintainers for review.
Code
----
You can check the latest sources with the command::
git clone https://github.com/socialcopsdev/camelot.git
Contributing
------------
See :doc:`Contributing guidelines <contributing>`.
Testing
-------
::
python setup.py test

View File

@ -1,53 +1,50 @@
.. camelot documentation master file, created by .. Camelot documentation master file, created by
sphinx-quickstart on Tue Jul 19 13:44:18 2016. sphinx-quickstart on Tue Jul 19 13:44:18 2016.
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
=====================================
Camelot: PDF Table Parsing for Humans Camelot: PDF Table Parsing for Humans
===================================== =====================================
Camelot is a Python 2.7 library and command-line tool for extracting tabular data from PDF files. Release v\ |version|. (:ref:`Installation <install>`)
Why another pdf table parsing library? .. image:: https://img.shields.io/badge/license-MIT-lightgrey.svg
====================================== :target: https://pypi.org/project/camelot-py/
We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf. .. image:: https://img.shields.io/badge/python-2.7-blue.svg
:target: https://pypi.org/project/camelot-py/
.. _PDFTables: https://pdftables.com/ **Camelot** is a Python library and command-line tool for extracting tables from PDF files.
.. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
.. _Tabula: http://tabula.technology/
Some background .. note:: Camelot only works with:
===============
PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart. - Python 2, with **Python 3** support `on the way`_.
- Text-based PDFs and not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer, then your PDF is text-based. Support for image-based PDFs using **OCR** is `planned`_.
Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements. .. _on the way: https://github.com/socialcopsdev/camelot/issues/81
.. _planned: https://github.com/socialcopsdev/camelot/issues/101
.. _The Camelot Project: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf
Usage Usage
===== -----
:: ::
>>> import camelot >>> import camelot
>>> tables = camelot.read_pdf("foo.pdf") >>> tables = camelot.read_pdf("foo.pdf")
>>> tables >>> tables
<TableList n=2> <TableList n=2>
>>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
>>> tables[0] >>> tables[0]
<Table shape=(3,4)> <Table shape=(3,4)>
>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html >>> tables[0].parsing_report
>>> tables[0].parsing_report {
{ "accuracy": 96,
"accuracy": 96, "whitespace": 80,
"whitespace": 80, "order": 1,
"order": 1, "page": 1
"page": 1 }
} >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
>>> df = tables[0].df >>> tables[0].df
.. csv-table:: .. csv-table::
:header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","","" :header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -59,87 +56,107 @@ Usage
"2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%" "2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
"4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%" "4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
Installation
============
Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
pip install -U pip setuptools
The dependencies include `tk`_ and `ghostscript`_.
.. _tk: https://wiki.tcl.tk/3743
.. _ghostscript: https://www.ghostscript.com/
Installing dependencies
-----------------------
tk and ghostscript can be installed using your system's default package manager.
Linux
^^^^^
* Ubuntu
:: ::
sudo apt-get install python-opencv python-tk ghostscript Usage: camelot [OPTIONS] FILEPATH
* Arch Linux Options:
-p, --pages TEXT Comma-separated page numbers to parse.
Example: 1,3,4 or 1,4-end
-o, --output TEXT Output filepath.
-f, --format [csv|json|excel|html]
Output file format.
-z, --zip Whether or not to create a ZIP archive.
-m, --mesh Whether or not to use Lattice method of
parsing. Stream is used by default.
-T, --table_area TEXT Table areas (x1,y1,x2,y2) to process.
x1, y1
-> left-top and x2, y2 -> right-bottom
-split, --split_text Whether or not to split text if it spans
across multiple cells.
-flag, --flag_size (inactive) Whether or not to flag text which
has uncommon size. (Useful to detect
super/subscripts)
-M, --margins <FLOAT FLOAT FLOAT>...
char_margin, line_margin, word_margin for
PDFMiner.
-C, --columns TEXT x-coordinates of column separators.
-r, --row_close_tol INTEGER Rows will be formed by combining text
vertically within this tolerance.
-c, --col_close_tol INTEGER Columns will be formed by combining text
horizontally within this tolerance.
-back, --process_background (with --mesh) Whether or not to process
lines that are in background.
-scale, --line_size_scaling INTEGER
(with --mesh) Factor by which the page
dimensions will be divided to get smallest
length of detected lines.
-copy, --copy_text [h|v] (with --mesh) Specify direction in which
text will be copied over in a spanning cell.
-shift, --shift_text [l|r|t|b] (with --mesh) Specify direction in which
text in a spanning cell should flow.
-l, --line_close_tol INTEGER (with --mesh) Tolerance parameter used to
merge close vertical lines and close
horizontal lines.
-j, --joint_close_tol INTEGER (with --mesh) Tolerance parameter used to
decide whether the detected lines and points
lie close to each other.
-block, --threshold_blocksize INTEGER
(with --mesh) For adaptive thresholding,
size of a pixel neighborhood that is used to
calculate a threshold value for the pixel:
3, 5, 7, and so on.
-const, --threshold_constant INTEGER
(with --mesh) For adaptive thresholding,
constant subtracted from the mean or
weighted mean.
Normally, it is positive but
may be zero or negative as well.
-I, --iterations INTEGER (with --mesh) Number of times for
erosion/dilation is applied.
-G, --geometry_type [text|table|contour|joint|line]
Plot geometry found on pdf page for
debugging.
text: Plot text objects. (Useful to get
table_area and columns coordinates)
table: Plot parsed table.
contour (with --mesh): Plot detected rectangles.
joint (with --mesh): Plot detected line intersections.
line (with --mesh): Plot detected lines.
--help Show this message and exit.
:: The User Guide
--------------
sudo pacman -S opencv tk ghostscript This part of the documentation, which is mostly prose, begins with some
background information about Requests, then focuses on step-by-step
OS X instructions for getting the most out of Requests.
^^^^
::
brew install homebrew/science/opencv ghostscript
Finally, `cd` into the project directory and install by::
python setup.py install
API Reference
=============
See :doc:`API doc <api>`.
Development
===========
Code
----
You can check the latest sources with the command::
git clone https://github.com/socialcopsdev/camelot.git
Contributing
------------
See :doc:`Contributing guidelines <contributing>`.
Testing
-------
::
python setup.py test
License
=======
MIT License
Sitemap
=======
.. toctree:: .. toctree::
:maxdepth: 2
lattice user/intro
stream user/install
contributing user/quickstart
api
The API Documentation / Guide
-----------------------------
If you are looking for information on a specific function, class, or method,
this part of the documentation is for you.
.. toctree::
:maxdepth: 2
api
The Contributor Guide
---------------------
If you want to contribute to the project, this part of the documentation is for
you.
.. toctree::
:maxdepth: 2
dev/contributing

View File

@ -0,0 +1,44 @@
.. _install:
Installation
============
Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
pip install -U pip setuptools
The dependencies include `tk`_ and `ghostscript`_.
.. _tk: https://wiki.tcl.tk/3743
.. _ghostscript: https://www.ghostscript.com/
Installing dependencies
-----------------------
tk and ghostscript can be installed using your system's default package manager.
Linux
^^^^^
* Ubuntu
::
sudo apt-get install python-opencv python-tk ghostscript
* Arch Linux
::
sudo pacman -S opencv tk ghostscript
OS X
^^^^
::
brew install homebrew/science/opencv ghostscript
Finally, `cd` into the project directory and install by::
python setup.py install

View File

@ -0,0 +1,19 @@
PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements.
.. _The Camelot Project: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf
Why another pdf table parsing library?
======================================
We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf.
.. _PDFTables: https://pdftables.com/
.. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
.. _Tabula: http://tabula.technology/
License
=======
MIT License

View File

@ -1,6 +1,5 @@
.. _lattice: .. _lattice:
=======
Lattice Lattice
======= =======
@ -18,7 +17,7 @@ Line segments are detected in the first step.
.. .. _this: insert link for us-030.pdf .. .. _this: insert link for us-030.pdf
.. image:: assets/line.png .. image:: ../_static/user/line.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -26,7 +25,7 @@ Line segments are detected in the first step.
The detected line segments are overlapped by `and` ing their pixel intensities to find intersections. The detected line segments are overlapped by `and` ing their pixel intensities to find intersections.
.. image:: assets/intersection.png .. image:: ../_static/user/intersection.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -34,7 +33,7 @@ The detected line segments are overlapped by `and` ing their pixel intensities t
The detected line segments are overlapped again, this time by `or` ing their pixel intensities and outermost contours are computed to identify potential table boundaries. This helps Lattice in detecting more than one table on a single page. The detected line segments are overlapped again, this time by `or` ing their pixel intensities and outermost contours are computed to identify potential table boundaries. This helps Lattice in detecting more than one table on a single page.
.. image:: assets/contour.png .. image:: ../_static/user/contour.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -42,7 +41,7 @@ The detected line segments are overlapped again, this time by `or` ing their pix
Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates. Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates.
.. image:: assets/table.png .. image:: ../_static/user/table.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -50,7 +49,7 @@ Since dimensions of a pdf and its image vary; table contours, intersections and
Spanning cells are then detected using the line segments and intersections. Spanning cells are then detected using the line segments and intersections.
.. image:: assets/table_span.png .. image:: ../_static/user/table_span.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -86,7 +85,7 @@ Let's consider this pdf file.
.. .. _this: insert link for row_span_1.pdf .. .. _this: insert link for row_span_1.pdf
.. image:: assets/scale_1.png .. image:: ../_static/user/scale_1.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%
@ -94,7 +93,7 @@ Let's consider this pdf file.
Clearly, it couldn't detected those small lines in the lower left part. Therefore, we need to increase the value of scale. Let's try a value of 40. Clearly, it couldn't detected those small lines in the lower left part. Therefore, we need to increase the value of scale. Let's try a value of 40.
.. image:: assets/scale_2.png .. image:: ../_static/user/scale_2.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%

View File

@ -0,0 +1,5 @@
.. toctree::
:maxdepth: 2
lattice
stream

View File

@ -1,6 +1,5 @@
.. _stream: .. _stream:
======
Stream Stream
====== ======
@ -69,7 +68,7 @@ We can also specify the column x-coordinates. We need to call Stream with debug=
>>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True >>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
>>> manager.debug_plot() >>> manager.debug_plot()
.. image:: assets/columns.png .. image:: ../_static/user/columns.png
:height: 674 :height: 674
:width: 1366 :width: 1366
:scale: 50% :scale: 50%