Add install.rst
parent
76e671124d
commit
40404d1f4a
|
|
@ -31,13 +31,13 @@ There's a [command-line interface]() too!
|
|||
|
||||
- **You are in control**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.)
|
||||
- **Metrics**: *Bad* tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
|
||||
- Each table is a pandas DataFrame, which enables seamless integration into data analysis workflows.
|
||||
- Export to multiple formats, including json, excel and html.
|
||||
- Simple and Elegant API, written in Python!
|
||||
- Each table is a **pandas DataFrame**, which enables seamless integration into data analysis workflows.
|
||||
- **Export** to multiple formats, including json, excel and html.
|
||||
- Simple and Elegant API, written in **Python**!
|
||||
|
||||
## Installation
|
||||
|
||||
After [installing dependencies](), you can simply use pip:
|
||||
After [installing the dependencies](), you can simply use pip to install Camelot:
|
||||
|
||||
<pre>
|
||||
$ pip install camelot-py
|
||||
|
|
|
|||
|
|
@ -55,23 +55,30 @@ Why Camelot?
|
|||
------------
|
||||
- **You are in control**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.)
|
||||
- **Metrics**: *Bad* tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
|
||||
- Each table is a pandas DataFrame, which enables seamless integration into data analysis workflows.
|
||||
- Export to multiple formats, including json, excel and html.
|
||||
- Simple and Elegant API, written in Python!
|
||||
- Each table is a **pandas DataFrame**, which enables seamless integration into data analysis workflows.
|
||||
- **Export** to multiple formats, including json, excel and html.
|
||||
- Simple and Elegant API, written in **Python**!
|
||||
|
||||
The User Guide
|
||||
--------------
|
||||
|
||||
This part of the documentation, begins with some background information about why Camelot was created, takes a small dip into the implementation details and then focuses on step-by-step instructions for getting the most out of Camelot.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
user/intro
|
||||
user/install
|
||||
user/quickstart
|
||||
user/advanced
|
||||
user/cli
|
||||
|
||||
The API Documentation / Guide
|
||||
-----------------------------
|
||||
|
||||
If you are looking for information on a specific function, class, or method,
|
||||
this part of the documentation is for you.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
|
|
@ -80,6 +87,9 @@ The API Documentation / Guide
|
|||
The Contributor Guide
|
||||
---------------------
|
||||
|
||||
If you want to contribute to the project, this part of the documentation is for
|
||||
you.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
|
|
|
|||
|
|
@ -1,7 +1,10 @@
|
|||
.. _lattice:
|
||||
.. _advanced:
|
||||
|
||||
Advanced Usage
|
||||
==============
|
||||
|
||||
Lattice
|
||||
=======
|
||||
-------
|
||||
|
||||
Lattice method is designed to work on pdf files which have tables with well-defined grids. It looks for lines on a page to form a table.
|
||||
|
||||
|
|
@ -77,7 +80,7 @@ Finally, the characters found on the page are assigned to cells based on their x
|
|||
"4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
|
||||
|
||||
Scale
|
||||
-----
|
||||
^^^^^
|
||||
|
||||
The scale parameter is used to determine the length of the structuring element used for morphological transformations. The length of vertical and horizontal structuring elements are found by dividing the image's height and width respectively, by `scale`. Large `scale` will lead to a smaller structuring element, which means that smaller lines will be detected. The default value for scale is 15.
|
||||
|
||||
|
|
@ -102,7 +105,7 @@ Clearly, it couldn't detected those small lines in the lower left part. Therefor
|
|||
Voila! It detected the smaller lines.
|
||||
|
||||
Fill
|
||||
----
|
||||
^^^^
|
||||
|
||||
In the file used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the file we used above, vertically.
|
||||
|
||||
|
|
@ -159,7 +162,7 @@ In the file used above, you can see that some cells spanned a lot of rows, `fill
|
|||
"Source: Data...","","",""
|
||||
|
||||
Invert
|
||||
------
|
||||
^^^^^^
|
||||
|
||||
To find line segments, Lattice needs the lines of the pdf file to be in foreground. So, if you encounter a file like this, just set invert to True.
|
||||
|
||||
|
|
@ -190,3 +193,134 @@ Lattice can also parse pdf files with tables like these that are rotated clockwi
|
|||
.. .. _these: insert link for left_rotated_table.pdf
|
||||
|
||||
You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `debug_plot()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
|
||||
|
||||
Stream
|
||||
------
|
||||
|
||||
Stream method is the complete opposite of Lattice and works on pdf files which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
|
||||
|
||||
Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a pdf table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
|
||||
|
||||
Let's run it on this pdf.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(), 'eu-027.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. .. _this: insert link for eu-027.pdf
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"C","Appendix C:...","","",""
|
||||
"","Table C1:...","","",""
|
||||
"","This table...","","",""
|
||||
"Variable","Mean","Std. Dev.","Min","Max"
|
||||
"Age","50.8","15.9","21","90"
|
||||
"Men","0.47","0.50","0","1"
|
||||
"East","0.28","0.45","0","1"
|
||||
"Rural","0.15","0.36","0","1"
|
||||
"Married","0.57","0.50","0","1"
|
||||
"Single","0.21","0.40","0","1"
|
||||
"Divorced","0.13","0.33","0","1"
|
||||
"Widowed","0.08","0.26","0","1"
|
||||
"Separated","0.03","0.16","0","1"
|
||||
"Partner","0.65","0.48","0","1"
|
||||
"Employed","0.55","0.50","0","1"
|
||||
"Fulltime","0.34","0.47","0","1"
|
||||
"Parttime","0.20","0.40","0","1"
|
||||
"Unemployed","0.08","0.28","0","1"
|
||||
"Homemaker","0.19","0.40","0","1"
|
||||
"Retired","0.28","0.45","0","1"
|
||||
"Household size","2.43","1.22","1","9"
|
||||
"Households...","0.37","0.48","0","1"
|
||||
"Number of...","1.67","1.38","0","8"
|
||||
"Lower...","0.08","0.27","0","1"
|
||||
"Upper...","0.60","0.49","0","1"
|
||||
"Post...","0.12","0.33","0","1"
|
||||
"First...","0.17","0.38","0","1"
|
||||
"Other...","0.03","0.17","0","1"
|
||||
"Household...","2,127","1,389","22","22,500"
|
||||
"Gross...","187,281","384,198","0","7,720,000"
|
||||
"Gross...","38,855","114,128","0","2,870,000"
|
||||
"","Source:...","","",""
|
||||
"","","","","ECB"
|
||||
"","","","","Working..."
|
||||
"","","","","Febuary..."
|
||||
|
||||
We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this pdf file.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
|
||||
>>> manager.debug_plot()
|
||||
|
||||
.. image:: ../_static/png/columns.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
After getting the x-coordinates, we just need to pass them to Stream, like this.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(columns=['28,67,180,230,425,475,700']), 'mexican_towns.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"Clave","","Clave","","Clave",""
|
||||
"","Nombre Entidad","","Nombre Municipio","","Nombre Localidad"
|
||||
"Entidad","","Municipio","","Localidad",""
|
||||
"01","Aguascalientes","001","Aguascalientes","0094","Granja Adelita"
|
||||
"01","Aguascalientes","001","Aguascalientes","0096","Agua Azul"
|
||||
"01","Aguascalientes","001","Aguascalientes","0100","Rancho Alegre"
|
||||
"01","Aguascalientes","001","Aguascalientes","0102","Los Arbolitos [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0104","Ardillas de Abajo (Las Ardillas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0106","Arellano"
|
||||
"01","Aguascalientes","001","Aguascalientes","0112","Bajío los Vázquez"
|
||||
"01","Aguascalientes","001","Aguascalientes","0113","Bajío de Montoro"
|
||||
"01","Aguascalientes","001","Aguascalientes","0114","Residencial San Nicolás [Baños la Cantera]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0120","Buenavista de Peñuelas"
|
||||
"01","Aguascalientes","001","Aguascalientes","0121","Cabecita 3 Marías (Rancho Nuevo)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0125","Cañada Grande de Cotorina"
|
||||
"01","Aguascalientes","001","Aguascalientes","0126","Cañada Honda [Estación]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0127","Los Caños"
|
||||
"01","Aguascalientes","001","Aguascalientes","0128","El Cariñán"
|
||||
"01","Aguascalientes","001","Aguascalientes","0129","El Carmen [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0135","El Cedazo (Cedazo de San Antonio)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0138","Centro de Arriba (El Taray)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0139","Cieneguilla (La Lumbrera)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0141","Cobos"
|
||||
"01","Aguascalientes","001","Aguascalientes","0144","El Colorado (El Soyatal)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0146","El Conejal"
|
||||
"01","Aguascalientes","001","Aguascalientes","0157","Cotorina de Abajo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0162","Coyotes"
|
||||
"01","Aguascalientes","001","Aguascalientes","0166","La Huerta (La Cruz)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0170","Cuauhtémoc (Las Palomas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0171","Los Cuervos (Los Ojos de Agua)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0172","San José [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0176","La Chiripa"
|
||||
"01","Aguascalientes","001","Aguascalientes","0182","Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0183","Los Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0190","El Duraznillo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0191","Los Durón"
|
||||
"01","Aguascalientes","001","Aguascalientes","0197","La Escondida"
|
||||
"01","Aguascalientes","001","Aguascalientes","0201","Brande Vin [Bodegas]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0207","Valle Redondo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0209","La Fortuna"
|
||||
"01","Aguascalientes","001","Aguascalientes","0212","Lomas del Gachupín"
|
||||
"01","Aguascalientes","001","Aguascalientes","0213","El Carmen (Gallinas Güeras) [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0216","La Gloria"
|
||||
|
|
@ -1,44 +1,38 @@
|
|||
.. _install:
|
||||
|
||||
Installation
|
||||
============
|
||||
Installation of Camelot
|
||||
=======================
|
||||
|
||||
Make sure you have the most updated versions for `pip` and `setuptools`. You can update them by::
|
||||
|
||||
$ pip install -U pip setuptools
|
||||
|
||||
The dependencies include `tk`_ and `ghostscript`_.
|
||||
This part of the documentation covers the installation of Camelot. First, you'll need to install the dependencies, which include `tk`_ and `ghostscript`_.
|
||||
|
||||
.. _tk: https://wiki.tcl.tk/3743
|
||||
.. _ghostscript: https://www.ghostscript.com/
|
||||
|
||||
Installing dependencies
|
||||
-----------------------
|
||||
|
||||
tk and ghostscript can be installed using your system's default package manager.
|
||||
|
||||
Linux
|
||||
^^^^^
|
||||
|
||||
* Ubuntu
|
||||
|
||||
These can be installed using your system's package manager. If you use Ubuntu, run the following:
|
||||
::
|
||||
|
||||
$ sudo apt-get install python-opencv python-tk ghostscript
|
||||
$ sudo apt install python-tk ghostscript
|
||||
|
||||
* Arch Linux
|
||||
$ pip install camelot-py
|
||||
------------------------
|
||||
|
||||
After installing the dependencies, you can simply use pip to install Camelot:
|
||||
::
|
||||
|
||||
$ sudo pacman -S opencv tk ghostscript
|
||||
$ pip install camelot-py
|
||||
|
||||
OS X
|
||||
^^^^
|
||||
Get the Source Code
|
||||
-------------------
|
||||
|
||||
Alternatively, you can install from source by:
|
||||
|
||||
1. Cloning the GitHub repository.
|
||||
::
|
||||
|
||||
$ brew install homebrew/science/opencv ghostscript
|
||||
$ git clone https://www.github.com/socialcopsdev/camelot
|
||||
|
||||
Finally, `cd` into the project directory and install by::
|
||||
2. And then simply using pip again.
|
||||
::
|
||||
|
||||
$ python setup.py install
|
||||
$ cd camelot
|
||||
$ pip install .
|
||||
|
|
@ -1,5 +1,4 @@
|
|||
.. toctree::
|
||||
:maxdepth: 2
|
||||
.. _quickstart:
|
||||
|
||||
lattice
|
||||
stream
|
||||
Quickstart
|
||||
==========
|
||||
|
|
@ -1,132 +0,0 @@
|
|||
.. _stream:
|
||||
|
||||
Stream
|
||||
======
|
||||
|
||||
Stream method is the complete opposite of Lattice and works on pdf files which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
|
||||
|
||||
Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a pdf table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
|
||||
|
||||
Let's run it on this pdf.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(), 'eu-027.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. .. _this: insert link for eu-027.pdf
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"C","Appendix C:...","","",""
|
||||
"","Table C1:...","","",""
|
||||
"","This table...","","",""
|
||||
"Variable","Mean","Std. Dev.","Min","Max"
|
||||
"Age","50.8","15.9","21","90"
|
||||
"Men","0.47","0.50","0","1"
|
||||
"East","0.28","0.45","0","1"
|
||||
"Rural","0.15","0.36","0","1"
|
||||
"Married","0.57","0.50","0","1"
|
||||
"Single","0.21","0.40","0","1"
|
||||
"Divorced","0.13","0.33","0","1"
|
||||
"Widowed","0.08","0.26","0","1"
|
||||
"Separated","0.03","0.16","0","1"
|
||||
"Partner","0.65","0.48","0","1"
|
||||
"Employed","0.55","0.50","0","1"
|
||||
"Fulltime","0.34","0.47","0","1"
|
||||
"Parttime","0.20","0.40","0","1"
|
||||
"Unemployed","0.08","0.28","0","1"
|
||||
"Homemaker","0.19","0.40","0","1"
|
||||
"Retired","0.28","0.45","0","1"
|
||||
"Household size","2.43","1.22","1","9"
|
||||
"Households...","0.37","0.48","0","1"
|
||||
"Number of...","1.67","1.38","0","8"
|
||||
"Lower...","0.08","0.27","0","1"
|
||||
"Upper...","0.60","0.49","0","1"
|
||||
"Post...","0.12","0.33","0","1"
|
||||
"First...","0.17","0.38","0","1"
|
||||
"Other...","0.03","0.17","0","1"
|
||||
"Household...","2,127","1,389","22","22,500"
|
||||
"Gross...","187,281","384,198","0","7,720,000"
|
||||
"Gross...","38,855","114,128","0","2,870,000"
|
||||
"","Source:...","","",""
|
||||
"","","","","ECB"
|
||||
"","","","","Working..."
|
||||
"","","","","Febuary..."
|
||||
|
||||
We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this pdf file.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
|
||||
>>> manager.debug_plot()
|
||||
|
||||
.. image:: ../_static/png/columns.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
After getting the x-coordinates, we just need to pass them to Stream, like this.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(columns=['28,67,180,230,425,475,700']), 'mexican_towns.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"Clave","","Clave","","Clave",""
|
||||
"","Nombre Entidad","","Nombre Municipio","","Nombre Localidad"
|
||||
"Entidad","","Municipio","","Localidad",""
|
||||
"01","Aguascalientes","001","Aguascalientes","0094","Granja Adelita"
|
||||
"01","Aguascalientes","001","Aguascalientes","0096","Agua Azul"
|
||||
"01","Aguascalientes","001","Aguascalientes","0100","Rancho Alegre"
|
||||
"01","Aguascalientes","001","Aguascalientes","0102","Los Arbolitos [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0104","Ardillas de Abajo (Las Ardillas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0106","Arellano"
|
||||
"01","Aguascalientes","001","Aguascalientes","0112","Bajío los Vázquez"
|
||||
"01","Aguascalientes","001","Aguascalientes","0113","Bajío de Montoro"
|
||||
"01","Aguascalientes","001","Aguascalientes","0114","Residencial San Nicolás [Baños la Cantera]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0120","Buenavista de Peñuelas"
|
||||
"01","Aguascalientes","001","Aguascalientes","0121","Cabecita 3 Marías (Rancho Nuevo)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0125","Cañada Grande de Cotorina"
|
||||
"01","Aguascalientes","001","Aguascalientes","0126","Cañada Honda [Estación]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0127","Los Caños"
|
||||
"01","Aguascalientes","001","Aguascalientes","0128","El Cariñán"
|
||||
"01","Aguascalientes","001","Aguascalientes","0129","El Carmen [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0135","El Cedazo (Cedazo de San Antonio)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0138","Centro de Arriba (El Taray)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0139","Cieneguilla (La Lumbrera)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0141","Cobos"
|
||||
"01","Aguascalientes","001","Aguascalientes","0144","El Colorado (El Soyatal)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0146","El Conejal"
|
||||
"01","Aguascalientes","001","Aguascalientes","0157","Cotorina de Abajo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0162","Coyotes"
|
||||
"01","Aguascalientes","001","Aguascalientes","0166","La Huerta (La Cruz)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0170","Cuauhtémoc (Las Palomas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0171","Los Cuervos (Los Ojos de Agua)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0172","San José [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0176","La Chiripa"
|
||||
"01","Aguascalientes","001","Aguascalientes","0182","Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0183","Los Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0190","El Duraznillo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0191","Los Durón"
|
||||
"01","Aguascalientes","001","Aguascalientes","0197","La Escondida"
|
||||
"01","Aguascalientes","001","Aguascalientes","0201","Brande Vin [Bodegas]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0207","Valle Redondo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0209","La Fortuna"
|
||||
"01","Aguascalientes","001","Aguascalientes","0212","Lomas del Gachupín"
|
||||
"01","Aguascalientes","001","Aguascalientes","0213","El Carmen (Gallinas Güeras) [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0216","La Gloria"
|
||||
Loading…
Reference in New Issue