Add LICENSE and _templates

2018-09-11 18:47:29 +05:30 · 2018-09-11 18:47:29 +05:30 · 066c5c6aca
parent 17ea5f335e
commit 066c5c6aca
36 changed files with 282 additions and 162 deletions
--- a/7
+++ b/7
@ -0,0 +1,7 @@
 Copyright (c) 2018 Peeply Private Ltd (Singapore)
 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/README.md
+++ b/README.md
@ -14,7 +14,6 @@ Camelot is a Python library and command-line tool for extracting tables from PDF
 >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
 >>> tables[0]
 &lt;Table shape=(3,4)&gt;
 >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
 >>> tables[0].parsing_report
 {
    "accuracy": 96,
@ -22,7 +21,8 @@ Camelot is a Python library and command-line tool for extracting tables from PDF
    "order": 1,
    "page": 1
 }
->>> df = tables[0].df
+>>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
 >>> tables[0].df
 </pre>
 ### Command-line interface
@ -87,17 +87,12 @@ Options:
  -G, --geometry_type [text|table|contour|joint|line]
                                  Plot geometry found on pdf page for
                                  debugging.
-
+                                  text: Plot text objects. (Useful to get
-                                  text: Plot text objects. (Useful
+                                        table_area and columns coordinates)
                                  to get table_area and columns coordinates)
                                  table: Plot parsed table.
-                                  contour (with
+                                  contour (with --mesh): Plot detected rectangles.
-                                  --mesh): Plot detected rectangles.
+                                  joint (with --mesh): Plot detected line intersections.
-                                  joint
+                                  line (with --mesh): Plot detected lines.
                                  (with --mesh): Plot detected line
                                  intersections.
                                  line (with --mesh): Plot
                                  detected lines.
  --help                          Show this message and exit.
 </pre>
@ -161,8 +156,4 @@ See [Contributing guidelines]().
 <pre>
 python setup.py test
-</pre>
+</pre>
 ## License
 BSD License
--- a/docs/_templates/hacks.html
+++ b/docs/_templates/hacks.html
@ -0,0 +1,15 @@
 <style type="text/css">
  /* "Quick Search" should be capitalized. */
  div#searchbox h3 {text-transform: capitalize;}
  /* Make the document a little wider, less code is cut-off. */
  div.document {width: 1008px;}
  /* Much-improved spacing around code blocks. */
  div.highlight pre {padding: 11px 14px;}
  /* Remain Responsive! */
  @media screen and (max-width: 1008px) {
    div.sphinxsidebar {display: none;}
    div.document {width: 100%!important;}
    /* Have code blocks escape the document right-margin. */
    div.highlight pre {margin-right: -30px;}
  }
 </style>
--- a/docs/_templates/sidebarintro.html
+++ b/docs/_templates/sidebarintro.html
@ -0,0 +1,16 @@
 <p class="logo">
  <a href="{{ pathto(master_doc) }}">
    <img class="logo" src="{{ pathto('_static/camelot.png', 1) }}"/>
  </a>
 </p>
 <p>
 <iframe src="https://ghbtns.com/github-btn.html?user=socialcopsdev&repo=camelot&type=watch&count=true&size=large"
  allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe>
 </p>
 <h3>Useful Links</h3>
 <ul>
  <li><a href="https://github.com/socialcopsdev/camelot">Camelot @ GitHub</a></li>
  <li><a href="https://pypi.org/project/camelot-py/">Camelot @ PyPI</a></li>
  <li><a href="https://github.com/socialcopsdev/camelot/issues">Issue Tracker</a></li>
 </ul>
--- a/docs/api.rst
+++ b/docs/api.rst
@ -1,45 +1,35 @@
 .. _api:
 =============
 API Reference
 =============
 .. module:: camelot
-camelot.read_pdf
+Main Interface
-================
+--------------
 .. autofunction:: camelot.read_pdf
 camelot.plot_geometry
 =====================
 .. autofunction:: camelot.plot_geometry
-camelot.handlers.PDFHandler
+Lower-Level Classes
-===========================
+-------------------
 .. autoclass:: camelot.handlers.PDFHandler
   :inherited-members:
 camelot.parsers.Stream
 ======================
 .. autoclass:: camelot.parsers.Stream
   :inherited-members:
 camelot.parsers.Lattice
 =======================
 .. autoclass:: camelot.parsers.Lattice
   :inherited-members:
-camelot.core.Cell
+Lower-Lower-Level Classes
-=================
+-------------------------
-.. autoclass:: camelot.core.Cell
+
 .. autoclass:: camelot.core.TableList
   :inherited-members:
 camelot.core.Table
 ==================
 .. autoclass:: camelot.core.Table
   :inherited-members:
-camelot.core.TableList
+.. autoclass:: camelot.core.Cell
 ======================
 .. autoclass:: camelot.core.TableList
   :inherited-members:
--- a/docs/assets/columns.png
+++ b/docs/assets/columns.png
--- a/docs/assets/contour.png
+++ b/docs/assets/contour.png
--- a/docs/assets/intersection.png
+++ b/docs/assets/intersection.png
--- a/docs/assets/lattice.png
+++ b/docs/assets/lattice.png
--- a/docs/assets/lattice_all.png
+++ b/docs/assets/lattice_all.png
--- a/docs/assets/lattice_all_ex.png
+++ b/docs/assets/lattice_all_ex.png
--- a/docs/assets/lattice_rc.png
+++ b/docs/assets/lattice_rc.png
--- a/docs/assets/lattice_rc_ex.png
+++ b/docs/assets/lattice_rc_ex.png
--- a/docs/assets/line.png
+++ b/docs/assets/line.png
--- a/docs/assets/scale_1.png
+++ b/docs/assets/scale_1.png
--- a/docs/assets/scale_2.png
+++ b/docs/assets/scale_2.png
--- a/docs/assets/stream1.png
+++ b/docs/assets/stream1.png
--- a/docs/assets/stream1_all.png
+++ b/docs/assets/stream1_all.png
--- a/docs/assets/stream1_page.png
+++ b/docs/assets/stream1_page.png
--- a/docs/assets/stream1_page_y.png
+++ b/docs/assets/stream1_page_y.png
--- a/docs/assets/stream1_rc.png
+++ b/docs/assets/stream1_rc.png
--- a/docs/assets/stream2.png
+++ b/docs/assets/stream2.png
--- a/docs/assets/stream2_all.png
+++ b/docs/assets/stream2_all.png
--- a/docs/assets/stream2_page.png
+++ b/docs/assets/stream2_page.png
--- a/docs/assets/stream2_page_y10_m8.png
+++ b/docs/assets/stream2_page_y10_m8.png
--- a/docs/assets/stream2_rc.png
+++ b/docs/assets/stream2_rc.png
--- a/docs/assets/table.png
+++ b/docs/assets/table.png
--- a/docs/assets/table_span.png
+++ b/docs/assets/table_span.png
--- a/docs/conf.py
+++ b/docs/conf.py
@ -63,7 +63,7 @@ master_doc = 'index'
 # General information about the project.
 project = u'Camelot'
-copyright = u'2018, SocialCops'
+copyright = u'2018, Peeply Private Ltd (Singapore)'
 author = u'Vinayak Mehta'
 # The version info for the project you're documenting, acts as replacement for
@ -189,10 +189,10 @@ html_use_smartypants = True
 # Custom sidebar templates, maps document names to template names.
 html_sidebars = {
-    'index': ['sidebarlogo.html', 'relations.html', 'sourcelink.html',
+    'index': ['sidebarintro.html', 'relations.html', 'sourcelink.html',
-              'searchbox.html'],
+              'searchbox.html', 'hacks.html'],
    '**': ['sidebarlogo.html', 'localtoc.html', 'relations.html',
-           'sourcelink.html', 'searchbox.html']
+           'sourcelink.html', 'searchbox.html', 'hacks.html']
 }
 # Additional templates that should be rendered to pages, maps page names to
--- a/docs/dev/contributing.rst
+++ b/docs/dev/contributing.rst
@ -1,8 +1,7 @@
 .. _contributing:
-=======================
+Contributor's Guide
-Contributing guidelines
+===================
 =======================
 The preferred way to contribute to Camelot is to fork this repository, and then submit a "pull request" (PR):
@ -27,3 +26,22 @@ The preferred way to contribute to Camelot is to fork this repository, and then
    $ git push -u origin my-feature
 Finally, go to the web page of the your fork of the camelot repo, and click ‘Pull request’ to send your changes to the maintainers for review.
 Code
 ----
 You can check the latest sources with the command::
    git clone https://github.com/socialcopsdev/camelot.git
 Contributing
 ------------
 See :doc:`Contributing guidelines <contributing>`.
 Testing
 -------
 ::
    python setup.py test
--- a/docs/index.rst
+++ b/docs/index.rst
@ -1,53 +1,50 @@
-.. camelot documentation master file, created by
+.. Camelot documentation master file, created by
   sphinx-quickstart on Tue Jul 19 13:44:18 2016.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
 =====================================
 Camelot: PDF Table Parsing for Humans
 =====================================
-Camelot is a Python 2.7 library and command-line tool for extracting tabular data from PDF files.
+Release v\ |version|. (:ref:`Installation <install>`)
-Why another pdf table parsing library?
+.. image:: https://img.shields.io/badge/license-MIT-lightgrey.svg
-======================================
+    :target: https://pypi.org/project/camelot-py/
-We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf.
+.. image:: https://img.shields.io/badge/python-2.7-blue.svg
    :target: https://pypi.org/project/camelot-py/
-.. _PDFTables: https://pdftables.com/
+**Camelot** is a Python library and command-line tool for extracting tables from PDF files.
 .. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
 .. _Tabula: http://tabula.technology/
-Some background
+.. note:: Camelot only works with:
 ===============
-PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
+          - Python 2, with **Python 3** support `on the way`_.
          - Text-based PDFs and not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer, then your PDF is text-based. Support for image-based PDFs using **OCR** is `planned`_.
-Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements.
+.. _on the way: https://github.com/socialcopsdev/camelot/issues/81
-
+.. _planned: https://github.com/socialcopsdev/camelot/issues/101
 .. _The Camelot Project: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf
 Usage
-=====
+-----
 ::
-    >>> import camelot
+  >>> import camelot
-    >>> tables = camelot.read_pdf("foo.pdf")
+  >>> tables = camelot.read_pdf("foo.pdf")
-    >>> tables
+  >>> tables
-    <TableList n=2>
+  <TableList n=2>
-    >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
+  >>> tables.export("foo.csv", f="csv", compress=True) # json, excel, html
-    >>> tables[0]
+  >>> tables[0]
-    <Table shape=(3,4)>
+  <Table shape=(3,4)>
-    >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
+  >>> tables[0].parsing_report
-    >>> tables[0].parsing_report
+  {
-    {
+      "accuracy": 96,
-        "accuracy": 96,
+      "whitespace": 80,
-        "whitespace": 80,
+      "order": 1,
-        "order": 1,
+      "page": 1
-        "page": 1
+  }
-    }
+  >>> tables[0].to_csv("foo.csv") # to_json, to_excel, to_html
-    >>> df = tables[0].df
+  >>> tables[0].df
 .. csv-table::
   :header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
@ -59,87 +56,107 @@ Usage
   "2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
   "4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
 Installation
 ============
 Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
    pip install -U pip setuptools
 The dependencies include `tk`_ and `ghostscript`_.
 .. _tk: https://wiki.tcl.tk/3743
 .. _ghostscript: https://www.ghostscript.com/
 Installing dependencies
 -----------------------
 tk and ghostscript can be installed using your system's default package manager.
 Linux
 ^^^^^
 * Ubuntu
 ::
-    sudo apt-get install python-opencv python-tk ghostscript
+  Usage: camelot [OPTIONS] FILEPATH
-* Arch Linux
+  Options:
    -p, --pages TEXT                Comma-separated page numbers to parse.
                                    Example: 1,3,4 or 1,4-end
    -o, --output TEXT               Output filepath.
    -f, --format [csv|json|excel|html]
                                    Output file format.
    -z, --zip                       Whether or not to create a ZIP archive.
    -m, --mesh                      Whether or not to use Lattice method of
                                    parsing. Stream is used by default.
    -T, --table_area TEXT           Table areas (x1,y1,x2,y2) to process.
                                    x1, y1
                                    -> left-top and x2, y2 -> right-bottom
    -split, --split_text            Whether or not to split text if it spans
                                    across multiple cells.
    -flag, --flag_size              (inactive) Whether or not to flag text which
                                    has uncommon size. (Useful to detect
                                    super/subscripts)
    -M, --margins <FLOAT FLOAT FLOAT>...
                                    char_margin, line_margin, word_margin for
                                    PDFMiner.
    -C, --columns TEXT              x-coordinates of column separators.
    -r, --row_close_tol INTEGER     Rows will be formed by combining text
                                    vertically within this tolerance.
    -c, --col_close_tol INTEGER     Columns will be formed by combining text
                                    horizontally within this tolerance.
    -back, --process_background     (with --mesh) Whether or not to process
                                    lines that are in background.
    -scale, --line_size_scaling INTEGER
                                    (with --mesh) Factor by which the page
                                    dimensions will be divided to get smallest
                                    length of detected lines.
    -copy, --copy_text [h|v]        (with --mesh) Specify direction in which
                                    text will be copied over in a spanning cell.
    -shift, --shift_text [l|r|t|b]  (with --mesh) Specify direction in which
                                    text in a spanning cell should flow.
    -l, --line_close_tol INTEGER    (with --mesh) Tolerance parameter used to
                                    merge close vertical lines and close
                                    horizontal lines.
    -j, --joint_close_tol INTEGER   (with --mesh) Tolerance parameter used to
                                    decide whether the detected lines and points
                                    lie close to each other.
    -block, --threshold_blocksize INTEGER
                                    (with --mesh) For adaptive thresholding,
                                    size of a pixel neighborhood that is used to
                                    calculate a threshold value for the pixel:
                                    3, 5, 7, and so on.
    -const, --threshold_constant INTEGER
                                    (with --mesh) For adaptive thresholding,
                                    constant subtracted from the mean or
                                    weighted mean.
                                    Normally, it is positive but
                                    may be zero or negative as well.
    -I, --iterations INTEGER        (with --mesh) Number of times for
                                    erosion/dilation is applied.
    -G, --geometry_type [text|table|contour|joint|line]
                                    Plot geometry found on pdf page for
                                    debugging.
                                    text: Plot text objects. (Useful to get
                                          table_area and columns coordinates)
                                    table: Plot parsed table.
                                    contour (with --mesh): Plot detected rectangles.
                                    joint (with --mesh): Plot detected line intersections.
                                    line (with --mesh): Plot detected lines.
    --help                          Show this message and exit.
-::
+The User Guide
 --------------
-    sudo pacman -S opencv tk ghostscript
+This part of the documentation, which is mostly prose, begins with some
-
+background information about Requests, then focuses on step-by-step
-OS X
+instructions for getting the most out of Requests.
 ^^^^
 ::
    brew install homebrew/science/opencv ghostscript
 Finally, `cd` into the project directory and install by::
    python setup.py install
 API Reference
 =============
 See :doc:`API doc <api>`.
 Development
 ===========
 Code
 ----
 You can check the latest sources with the command::
    git clone https://github.com/socialcopsdev/camelot.git
 Contributing
 ------------
 See :doc:`Contributing guidelines <contributing>`.
 Testing
 -------
 ::
    python setup.py test
 License
 =======
 MIT License
 Sitemap
 =======
 .. toctree::
   :maxdepth: 2
-    lattice
+   user/intro
-    stream
+   user/install
-    contributing
+   user/quickstart
-    api
+
 The API Documentation / Guide
 -----------------------------
 If you are looking for information on a specific function, class, or method,
 this part of the documentation is for you.
 .. toctree::
   :maxdepth: 2
   api
 The Contributor Guide
 ---------------------
 If you want to contribute to the project, this part of the documentation is for
 you.
 .. toctree::
   :maxdepth: 2
   dev/contributing
--- a/docs/user/install.rst
+++ b/docs/user/install.rst
@ -0,0 +1,44 @@
 .. _install:
 Installation
 ============
 Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
    pip install -U pip setuptools
 The dependencies include `tk`_ and `ghostscript`_.
 .. _tk: https://wiki.tcl.tk/3743
 .. _ghostscript: https://www.ghostscript.com/
 Installing dependencies
 -----------------------
 tk and ghostscript can be installed using your system's default package manager.
 Linux
 ^^^^^
 * Ubuntu
 ::
    sudo apt-get install python-opencv python-tk ghostscript
 * Arch Linux
 ::
    sudo pacman -S opencv tk ghostscript
 OS X
 ^^^^
 ::
    brew install homebrew/science/opencv ghostscript
 Finally, `cd` into the project directory and install by::
    python setup.py install
--- a/docs/user/intro.rst
+++ b/docs/user/intro.rst
@ -0,0 +1,19 @@
 PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart.
 Camelot uses two methods to parse tables from PDFs, :doc:`lattice <lattice>` and :doc:`stream <stream>`. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements.
 .. _The Camelot Project: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf
 Why another pdf table parsing library?
 ======================================
 We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf.
 .. _PDFTables: https://pdftables.com/
 .. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1
 .. _Tabula: http://tabula.technology/
 License
 =======
 MIT License
--- a/docs/user/lattice.rst
+++ b/docs/user/lattice.rst
@ -1,6 +1,5 @@
 .. _lattice:
 =======
 Lattice
 =======
@ -18,7 +17,7 @@ Line segments are detected in the first step.
 .. .. _this: insert link for us-030.pdf
-.. image:: assets/line.png
+.. image:: ../_static/user/line.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -26,7 +25,7 @@ Line segments are detected in the first step.
 The detected line segments are overlapped by `and` ing their pixel intensities to find intersections.
-.. image:: assets/intersection.png
+.. image:: ../_static/user/intersection.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -34,7 +33,7 @@ The detected line segments are overlapped by `and` ing their pixel intensities t
 The detected line segments are overlapped again, this time by `or` ing their pixel intensities and outermost contours are computed to identify potential table boundaries. This helps Lattice in detecting more than one table on a single page.
-.. image:: assets/contour.png
+.. image:: ../_static/user/contour.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -42,7 +41,7 @@ The detected line segments are overlapped again, this time by `or` ing their pix
 Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates.
-.. image:: assets/table.png
+.. image:: ../_static/user/table.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -50,7 +49,7 @@ Since dimensions of a pdf and its image vary; table contours, intersections and
 Spanning cells are then detected using the line segments and intersections.
-.. image:: assets/table_span.png
+.. image:: ../_static/user/table_span.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -86,7 +85,7 @@ Let's consider this pdf file.
 .. .. _this: insert link for row_span_1.pdf
-.. image:: assets/scale_1.png
+.. image:: ../_static/user/scale_1.png
   :height: 674
   :width: 1366
   :scale: 50%
@ -94,7 +93,7 @@ Let's consider this pdf file.
 Clearly, it couldn't detected those small lines in the lower left part. Therefore, we need to increase the value of scale. Let's try a value of 40.
-.. image:: assets/scale_2.png
+.. image:: ../_static/user/scale_2.png
   :height: 674
   :width: 1366
   :scale: 50%
--- a/docs/user/quickstart.rst
+++ b/docs/user/quickstart.rst
@ -0,0 +1,5 @@
 .. toctree::
   :maxdepth: 2
   lattice
   stream
--- a/docs/user/stream.rst
+++ b/docs/user/stream.rst
@ -1,6 +1,5 @@
 .. _stream:
 ======
 Stream
 ======
@ -69,7 +68,7 @@ We can also specify the column x-coordinates. We need to call Stream with debug=
    >>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
    >>> manager.debug_plot()
-.. image:: assets/columns.png
+.. image:: ../_static/user/columns.png
   :height: 674
   :width: 1366
   :scale: 50%