Add quickstart
parent
40404d1f4a
commit
3a980a46c1
|
|
@ -0,0 +1,3 @@
|
|||
Be cordial or be on your way. -- Kenneth Reitz
|
||||
|
||||
https://www.kennethreitz.org/essays/be-cordial-or-be-on-your-way
|
||||
|
|
@ -0,0 +1 @@
|
|||
# Contribution Guidelines
|
||||
25
README.md
25
README.md
|
|
@ -4,20 +4,22 @@
|
|||
|
||||
**Camelot** is a Python library which makes it easy for *anyone* to extract tables from PDF files!
|
||||
|
||||
## Usage
|
||||
---
|
||||
|
||||
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example, [here](docs/_static/pdf/foo.pdf).
|
||||
|
||||
<pre>
|
||||
>>> import camelot
|
||||
>>> tables = camelot.read_pdf('foo.pdf')
|
||||
>>> tables = camelot.read_pdf('foo.pdf', mesh=True)
|
||||
>>> tables
|
||||
<TableList n=2>
|
||||
<TableList tables=1>
|
||||
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
|
||||
>>> tables[0]
|
||||
<Table shape=(3,4)>
|
||||
<Table shape=(7, 7)>
|
||||
>>> tables[0].parsing_report
|
||||
{
|
||||
'accuracy': 96,
|
||||
'whitespace': 80,
|
||||
'accuracy': 99.02,
|
||||
'whitespace': 12.24,
|
||||
'order': 1,
|
||||
'page': 1
|
||||
}
|
||||
|
|
@ -25,6 +27,15 @@
|
|||
>>> tables[0].df # get a pandas DataFrame!
|
||||
</pre>
|
||||
|
||||
| Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | | | |
|
||||
|------------|-----------|---------------|----------------------|-----------------|-----------------|----------------|
|
||||
| | | | Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle |
|
||||
| 2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |
|
||||
| 2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |
|
||||
| 4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |
|
||||
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
|
||||
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
|
||||
|
||||
There's a [command-line interface]() too!
|
||||
|
||||
## Why Camelot?
|
||||
|
|
@ -49,7 +60,7 @@ The documentation is available at [link]().
|
|||
|
||||
## Development
|
||||
|
||||
The [Contributor's Guide]() has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
|
||||
The [contribution guidelines](CONTRIBUTING.md) has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
|
||||
|
||||
### Source code
|
||||
|
||||
|
|
|
|||
|
|
@ -144,8 +144,8 @@ class Table(object):
|
|||
"""
|
||||
# pretty?
|
||||
report = {
|
||||
'accuracy': self.accuracy,
|
||||
'whitespace': self.whitespace,
|
||||
'accuracy': round(self.accuracy, 2),
|
||||
'whitespace': round(self.whitespace, 2),
|
||||
'order': self.order,
|
||||
'page': self.page
|
||||
}
|
||||
|
|
|
|||
|
|
@ -24,27 +24,30 @@ Release v\ |version|. (:ref:`Installation <install>`)
|
|||
.. _on the way: https://github.com/socialcopsdev/camelot/issues/81
|
||||
.. _planned: https://github.com/socialcopsdev/camelot/issues/101
|
||||
|
||||
Usage
|
||||
-----
|
||||
------------------------
|
||||
|
||||
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example, `here`_.
|
||||
|
||||
.. _here: _static/pdf/foo.pdf
|
||||
|
||||
::
|
||||
|
||||
>>> import camelot
|
||||
>>> tables = camelot.read_pdf('foo.pdf')
|
||||
>>> tables
|
||||
<TableList n=2>
|
||||
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
|
||||
>>> tables[0]
|
||||
<Table shape=(3,4)>
|
||||
>>> tables[0].parsing_report
|
||||
{
|
||||
'accuracy': 96,
|
||||
'whitespace': 80,
|
||||
'order': 1,
|
||||
'page': 1
|
||||
}
|
||||
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
|
||||
>>> tables[0].df # get a pandas DataFrame!
|
||||
>>> import camelot
|
||||
>>> tables = camelot.read_pdf('foo.pdf', mesh=True)
|
||||
>>> tables
|
||||
<TableList tables=1>
|
||||
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
|
||||
>>> tables[0]
|
||||
<Table shape=(7, 7)>
|
||||
>>> tables[0].parsing_report
|
||||
{
|
||||
'accuracy': 99.02,
|
||||
'whitespace': 12.24,
|
||||
'order': 1,
|
||||
'page': 1
|
||||
}
|
||||
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
|
||||
>>> tables[0].df # get a pandas DataFrame!
|
||||
|
||||
.. csv-table::
|
||||
:file: _static/csv/foo.csv
|
||||
|
|
@ -69,6 +72,7 @@ This part of the documentation, begins with some background information about wh
|
|||
|
||||
user/intro
|
||||
user/install
|
||||
user/how-it-works
|
||||
user/quickstart
|
||||
user/advanced
|
||||
user/cli
|
||||
|
|
|
|||
|
|
@ -3,84 +3,44 @@
|
|||
Advanced Usage
|
||||
==============
|
||||
|
||||
Lattice
|
||||
-------
|
||||
This page covers some of the more advanced configurations for :ref:`Stream <stream>` and :ref:`Lattice <lattice>`.
|
||||
|
||||
Lattice method is designed to work on pdf files which have tables with well-defined grids. It looks for lines on a page to form a table.
|
||||
Plot geometry
|
||||
-------------
|
||||
|
||||
Lattice uses OpenCV to apply a set of morphological transformations (erosion and dilation) to find horizontal and vertical line segments in a pdf page after converting it to an image using imagemagick.
|
||||
You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `debug_plot()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
|
||||
|
||||
.. note:: Currently, Lattice only works on pdf files that contain text. However, we plan to add `OCR support`_ in the future.
|
||||
Process background lines
|
||||
------------------------
|
||||
|
||||
.. _OCR support: https://github.com/socialcopsdev/camelot/issues/14
|
||||
|
||||
Let's see how Lattice processes this pdf, step by step.
|
||||
|
||||
Line segments are detected in the first step.
|
||||
|
||||
.. .. _this: insert link for us-030.pdf
|
||||
|
||||
.. image:: ../_static/png/line.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
The detected line segments are overlapped by `and` ing their pixel intensities to find intersections.
|
||||
|
||||
.. image:: ../_static/png/intersection.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
The detected line segments are overlapped again, this time by `or` ing their pixel intensities and outermost contours are computed to identify potential table boundaries. This helps Lattice in detecting more than one table on a single page.
|
||||
|
||||
.. image:: ../_static/png/contour.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
Since dimensions of a pdf and its image vary; table contours, intersections and segments are scaled and translated to the pdf's coordinate space. A representation of the table is then created using these scaled coordinates.
|
||||
|
||||
.. image:: ../_static/png/table.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
Spanning cells are then detected using the line segments and intersections.
|
||||
|
||||
.. image:: ../_static/png/table_span.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
Finally, the characters found on the page are assigned to cells based on their x,y coordinates.
|
||||
To find line segments, Lattice needs the lines of the pdf file to be in foreground. So, if you encounter a file like this, just set invert to True.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.lattice import Lattice
|
||||
|
||||
>>> manager = Pdf(Lattice(), 'us-030.pdf')
|
||||
>>> manager = Pdf(Lattice(invert=True), 'lines_in_background_1.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
:header: "Cycle Name","KI (1/km)","Distance (mi)","Percent Fuel Savings","","",""
|
||||
Specify table areas
|
||||
-------------------
|
||||
|
||||
"","","","Improved Speed","Decreased Accel","Eliminate Stops","Decreased Idle"
|
||||
"2012_2","3.30","1.3","5.9%","9.5%","29.2%","17.4%"
|
||||
"2145_1","0.68","11.2","2.4%","0.1%","9.5%","2.7%"
|
||||
"4234_1","0.59","58.7","8.5%","1.3%","8.5%","3.3%"
|
||||
"2032_2","0.17","57.8","21.7%","0.3%","2.7%","1.2%"
|
||||
"4171_1","0.07","173.9","58.1%","1.6%","2.1%","0.5%"
|
||||
Specify columns
|
||||
---------------
|
||||
|
||||
Scale
|
||||
^^^^^
|
||||
Split text in spanning cells
|
||||
----------------------------
|
||||
|
||||
Flag subscripts and superscripts
|
||||
--------------------------------
|
||||
|
||||
Control how text is grouped into rows
|
||||
-------------------------------------
|
||||
|
||||
Detect small lines
|
||||
------------------
|
||||
|
||||
The scale parameter is used to determine the length of the structuring element used for morphological transformations. The length of vertical and horizontal structuring elements are found by dividing the image's height and width respectively, by `scale`. Large `scale` will lead to a smaller structuring element, which means that smaller lines will be detected. The default value for scale is 15.
|
||||
|
||||
|
|
@ -104,8 +64,11 @@ Clearly, it couldn't detected those small lines in the lower left part. Therefor
|
|||
|
||||
Voila! It detected the smaller lines.
|
||||
|
||||
Fill
|
||||
^^^^
|
||||
Detect faint lines
|
||||
------------------
|
||||
|
||||
Copy text in spanning cells
|
||||
---------------------------
|
||||
|
||||
In the file used above, you can see that some cells spanned a lot of rows, `fill` just copies the same value to all rows/columns of a spanning cell. You can apply fill horizontally, vertically or both. Let us fill the output for the file we used above, vertically.
|
||||
|
||||
|
|
@ -118,209 +81,8 @@ In the file used above, you can see that some cells spanned a lot of rows, `fill
|
|||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
:header: "Plan Type","County","Plan Name","Totals"
|
||||
Shift text in spanning cells
|
||||
----------------------------
|
||||
|
||||
"GMC","Sacramento","Anthem Blue Cross","164,380"
|
||||
"GMC","Sacramento","Health Net","126,547"
|
||||
"GMC","Sacramento","Kaiser Foundation","74,620"
|
||||
"GMC","Sacramento","Molina Healthcare","59,989"
|
||||
"GMC","San Diego","Care 1st Health Plan","71,831"
|
||||
"GMC","San Diego","Community...","264,639"
|
||||
"GMC","San Diego","Health Net","72,404"
|
||||
"GMC","San Diego","Kaiser","50,415"
|
||||
"GMC","San Diego","Molina Healthcare","206,430"
|
||||
"GMC","Total GMC...","","1,091,255"
|
||||
"COHS","Marin","Partnership Health...","36,006"
|
||||
"COHS","Mendocino","Partnership Health...","37,243"
|
||||
"COHS","Napa","Partnership Health...","28,398"
|
||||
"COHS","Solano","Partnership Health...","113,220"
|
||||
"COHS","Sonoma","Partnership Health...","112,271"
|
||||
"COHS","Yolo","Partnership Health...","52,674"
|
||||
"COHS","Del Norte","Partnership Health...","11,242"
|
||||
"COHS","Humboldt","Partnership Health...","49,911"
|
||||
"COHS","Lake","Partnership Health...","29,149"
|
||||
"COHS","Lassen","Partnership Health...","7,360"
|
||||
"COHS","Modoc","Partnership Health...","2,940"
|
||||
"COHS","Shasta","Partnership Health...","61,763"
|
||||
"COHS","Siskiyou","Partnership Health...","16,715"
|
||||
"COHS","Trinity","Partnership Health...","4,542"
|
||||
"COHS","Merced","Central California...","123,907"
|
||||
"COHS","Monterey","Central California...","147,397"
|
||||
"COHS","Santa Cruz","Central California...","69,458"
|
||||
"COHS","Santa Barbara","CenCal","117,609"
|
||||
"COHS","San Luis Obispo","CenCal","55,761"
|
||||
"COHS","Orange","CalOptima","783,079"
|
||||
"COHS","San Mateo","Health Plan...","113,202"
|
||||
"COHS","Ventura","Gold Coast...","202,217"
|
||||
"COHS","Total COHS...","","2,176,064"
|
||||
"Subtotal for...","","","10,132,022"
|
||||
"PCCM","Los Angeles","AIDS Healthcare...","828"
|
||||
"PCCM","San Francisco","Family Mosaic","25"
|
||||
"PCCM","Total PHP...","","853"
|
||||
"All Models...","","","10,132,875"
|
||||
"Source: Data...","","",""
|
||||
|
||||
Invert
|
||||
^^^^^^
|
||||
|
||||
To find line segments, Lattice needs the lines of the pdf file to be in foreground. So, if you encounter a file like this, just set invert to True.
|
||||
|
||||
.. .. _this: insert link for lines_in_background_1.pdf
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.lattice import Lattice
|
||||
|
||||
>>> manager = Pdf(Lattice(invert=True), 'lines_in_background_1.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
:header: "State","Date","Halt stations","Halt days","Persons directly reached(in lakh)","Persons trained","Persons counseled","Persons testedfor HIV"
|
||||
|
||||
"Delhi","1.12.2009","8","17","1.29","3,665","2,409","1,000"
|
||||
"Rajasthan","2.12.2009 to 19.12.2009","","","","","",""
|
||||
"Gujarat","20.12.2009 to 3.1.2010","6","13","6.03","3,810","2,317","1,453"
|
||||
"Maharashtra","4.01.2010 to 1.2.2010","13","26","1.27","5,680","9,027","4,153"
|
||||
"Karnataka","2.2.2010 to 22.2.2010","11","19","1.80","5,741","3,658","3,183"
|
||||
"Kerala","23.2.2010 to 11.3.2010","9","17","1.42","3,559","2,173","855"
|
||||
"Total","","47","92","11.81","22,455","19,584","10,644"
|
||||
|
||||
Lattice can also parse pdf files with tables like these that are rotated clockwise/anti-clockwise by 90 degrees.
|
||||
|
||||
.. .. _these: insert link for left_rotated_table.pdf
|
||||
|
||||
You can call Lattice with debug={'line', 'intersection', 'contour', 'table'}, and call `debug_plot()` which will generate an image like the ones on this page, with the help of which you can modify various parameters. See :doc:`API doc <api>` for more information.
|
||||
|
||||
Stream
|
||||
------
|
||||
|
||||
Stream method is the complete opposite of Lattice and works on pdf files which have text placed uniformly apart across rows to simulate a table. It looks for spaces between text to form a table representation.
|
||||
|
||||
Stream builds on top of PDFMiner's functionality of grouping characters on a page into words and sentences. After getting these words, it groups them into rows based on their y-coordinates and tries to guess the number of columns a pdf table might have by calculating the mode of the number of words in each row. Additionally, the user can specify the number of columns or column x-coordinates.
|
||||
|
||||
Let's run it on this pdf.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(), 'eu-027.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. .. _this: insert link for eu-027.pdf
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"C","Appendix C:...","","",""
|
||||
"","Table C1:...","","",""
|
||||
"","This table...","","",""
|
||||
"Variable","Mean","Std. Dev.","Min","Max"
|
||||
"Age","50.8","15.9","21","90"
|
||||
"Men","0.47","0.50","0","1"
|
||||
"East","0.28","0.45","0","1"
|
||||
"Rural","0.15","0.36","0","1"
|
||||
"Married","0.57","0.50","0","1"
|
||||
"Single","0.21","0.40","0","1"
|
||||
"Divorced","0.13","0.33","0","1"
|
||||
"Widowed","0.08","0.26","0","1"
|
||||
"Separated","0.03","0.16","0","1"
|
||||
"Partner","0.65","0.48","0","1"
|
||||
"Employed","0.55","0.50","0","1"
|
||||
"Fulltime","0.34","0.47","0","1"
|
||||
"Parttime","0.20","0.40","0","1"
|
||||
"Unemployed","0.08","0.28","0","1"
|
||||
"Homemaker","0.19","0.40","0","1"
|
||||
"Retired","0.28","0.45","0","1"
|
||||
"Household size","2.43","1.22","1","9"
|
||||
"Households...","0.37","0.48","0","1"
|
||||
"Number of...","1.67","1.38","0","8"
|
||||
"Lower...","0.08","0.27","0","1"
|
||||
"Upper...","0.60","0.49","0","1"
|
||||
"Post...","0.12","0.33","0","1"
|
||||
"First...","0.17","0.38","0","1"
|
||||
"Other...","0.03","0.17","0","1"
|
||||
"Household...","2,127","1,389","22","22,500"
|
||||
"Gross...","187,281","384,198","0","7,720,000"
|
||||
"Gross...","38,855","114,128","0","2,870,000"
|
||||
"","Source:...","","",""
|
||||
"","","","","ECB"
|
||||
"","","","","Working..."
|
||||
"","","","","Febuary..."
|
||||
|
||||
We can also specify the column x-coordinates. We need to call Stream with debug=True and use matplotlib's interface to note down the column x-coordinates we need. Let's try it on this pdf file.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(debug=True), 'mexican_towns.pdf'), debug=True
|
||||
>>> manager.debug_plot()
|
||||
|
||||
.. image:: ../_static/png/columns.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
After getting the x-coordinates, we just need to pass them to Stream, like this.
|
||||
|
||||
::
|
||||
|
||||
>>> from camelot.pdf import Pdf
|
||||
>>> from camelot.stream import Stream
|
||||
|
||||
>>> manager = Pdf(Stream(columns=['28,67,180,230,425,475,700']), 'mexican_towns.pdf')
|
||||
>>> tables = manager.extract()
|
||||
>>> print tables['page-1']['table-1']['data']
|
||||
|
||||
.. csv-table::
|
||||
|
||||
"Clave","","Clave","","Clave",""
|
||||
"","Nombre Entidad","","Nombre Municipio","","Nombre Localidad"
|
||||
"Entidad","","Municipio","","Localidad",""
|
||||
"01","Aguascalientes","001","Aguascalientes","0094","Granja Adelita"
|
||||
"01","Aguascalientes","001","Aguascalientes","0096","Agua Azul"
|
||||
"01","Aguascalientes","001","Aguascalientes","0100","Rancho Alegre"
|
||||
"01","Aguascalientes","001","Aguascalientes","0102","Los Arbolitos [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0104","Ardillas de Abajo (Las Ardillas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0106","Arellano"
|
||||
"01","Aguascalientes","001","Aguascalientes","0112","Bajío los Vázquez"
|
||||
"01","Aguascalientes","001","Aguascalientes","0113","Bajío de Montoro"
|
||||
"01","Aguascalientes","001","Aguascalientes","0114","Residencial San Nicolás [Baños la Cantera]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0120","Buenavista de Peñuelas"
|
||||
"01","Aguascalientes","001","Aguascalientes","0121","Cabecita 3 Marías (Rancho Nuevo)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0125","Cañada Grande de Cotorina"
|
||||
"01","Aguascalientes","001","Aguascalientes","0126","Cañada Honda [Estación]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0127","Los Caños"
|
||||
"01","Aguascalientes","001","Aguascalientes","0128","El Cariñán"
|
||||
"01","Aguascalientes","001","Aguascalientes","0129","El Carmen [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0135","El Cedazo (Cedazo de San Antonio)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0138","Centro de Arriba (El Taray)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0139","Cieneguilla (La Lumbrera)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0141","Cobos"
|
||||
"01","Aguascalientes","001","Aguascalientes","0144","El Colorado (El Soyatal)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0146","El Conejal"
|
||||
"01","Aguascalientes","001","Aguascalientes","0157","Cotorina de Abajo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0162","Coyotes"
|
||||
"01","Aguascalientes","001","Aguascalientes","0166","La Huerta (La Cruz)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0170","Cuauhtémoc (Las Palomas)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0171","Los Cuervos (Los Ojos de Agua)"
|
||||
"01","Aguascalientes","001","Aguascalientes","0172","San José [Granja]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0176","La Chiripa"
|
||||
"01","Aguascalientes","001","Aguascalientes","0182","Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0183","Los Dolores"
|
||||
"01","Aguascalientes","001","Aguascalientes","0190","El Duraznillo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0191","Los Durón"
|
||||
"01","Aguascalientes","001","Aguascalientes","0197","La Escondida"
|
||||
"01","Aguascalientes","001","Aguascalientes","0201","Brande Vin [Bodegas]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0207","Valle Redondo"
|
||||
"01","Aguascalientes","001","Aguascalientes","0209","La Fortuna"
|
||||
"01","Aguascalientes","001","Aguascalientes","0212","Lomas del Gachupín"
|
||||
"01","Aguascalientes","001","Aguascalientes","0213","El Carmen (Gallinas Güeras) [Rancho]"
|
||||
"01","Aguascalientes","001","Aguascalientes","0216","La Gloria"
|
||||
Tweak PDFMiner margins
|
||||
----------------------
|
||||
|
|
@ -0,0 +1,84 @@
|
|||
.. _how_it_works:
|
||||
|
||||
How It Works
|
||||
============
|
||||
|
||||
This part of the documentation details a high-level explanation of how Camelot extracts tables from PDF files.
|
||||
|
||||
You can choose between two table parsing methods, *Stream* and *Lattice*. The naming for parsing methods inside Camelot (i.e. Stream and Lattice) was inspired from `Tabula`_.
|
||||
|
||||
.. _Tabula: https://github.com/tabulapdf/tabula
|
||||
|
||||
.. _stream:
|
||||
|
||||
Stream
|
||||
------
|
||||
|
||||
Stream can be used to parse tables that have whitespaces between cells to simulate a table structure. It looks for these spaces between text to form a table representation.
|
||||
|
||||
It is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences, using `margins`_. After getting the words given on a page, it groups them into rows based on their *y* coordinates and tries to guess the number of columns the table might have by calculating the mode of the number of words in each row. This mode is used to calculate *x* ranges for the table's columns. It then adds columns to this column range list based on any words that may lie outside or inside the current column *x* ranges.
|
||||
|
||||
.. _margins: https://euske.github.io/pdfminer/#tools
|
||||
|
||||
.. note:: By default, Stream treats the whole PDF page as a table. Automatic table detection for Stream is `in the works`_.
|
||||
|
||||
.. _in the works: https://github.com/socialcopsdev/camelot/issues/102
|
||||
|
||||
.. _lattice:
|
||||
|
||||
Lattice
|
||||
-------
|
||||
|
||||
Lattice is more deterministic in nature, and does not rely on guesses. It can be used to parse tables that have demarcated lines between cells.
|
||||
|
||||
It starts by converting the PDF page to an image using ghostscript and then processing it to get horizontal and vertical line segments by applying a set of morphological transformations (erosion and dilation) using OpenCV.
|
||||
|
||||
Let's see how Lattice processes the `second page of this PDF`_, step-by-step.
|
||||
|
||||
.. _second page of this PDF: https://github.com/socialcopsdev/camelot/blob/docs/tests/files/tabula/icdar2013-dataset/competition-dataset-us/us-030.pdf
|
||||
|
||||
1. Line segments are detected.
|
||||
|
||||
.. image:: ../_static/png/line.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
2. Line intersections are detected, by overlapping the detected line segments and "`and`_"ing their pixel intensities.
|
||||
|
||||
.. _and: https://en.wikipedia.org/wiki/Logical_conjunction
|
||||
|
||||
.. image:: ../_static/png/intersection.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
3. Table boundaries are computed, by overlapping the detected line segments again, this time by "`or`_"ing their pixel intensities.
|
||||
|
||||
.. _or: https://en.wikipedia.org/wiki/Logical_disjunction
|
||||
|
||||
.. image:: ../_static/png/contour.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
4. Since dimensions of the PDF page and its image vary; the detected table boundaries, line intersections and line segments are scaled and translated to the PDF page's coordinate space, and a representation of the table is created.
|
||||
|
||||
.. image:: ../_static/png/table.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
5. Spanning cells are detected using the line segments and line intersections.
|
||||
|
||||
.. image:: ../_static/png/table_span.png
|
||||
:height: 674
|
||||
:width: 1366
|
||||
:scale: 50%
|
||||
:align: left
|
||||
|
||||
6. Finally, the words found on the page are assigned to the table's cells based on their *x* and *y* coordinates.
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
.. _install:
|
||||
.. _install:
|
||||
|
||||
Installation of Camelot
|
||||
=======================
|
||||
|
|
|
|||
|
|
@ -1,4 +1,92 @@
|
|||
.. _quickstart:
|
||||
|
||||
Quickstart
|
||||
==========
|
||||
==========
|
||||
|
||||
In a hurry to extract tables from PDFs? This document gives a good introduction to help you get started with using Camelot.
|
||||
|
||||
Parse a PDF
|
||||
-----------
|
||||
|
||||
Parsing a PDF to extract tables with Camelot is very simple.
|
||||
|
||||
Begin by importing the Camelot module::
|
||||
|
||||
>>> import camelot
|
||||
|
||||
Now, let's try to read a PDF. You can check out the PDF used in this example, `here`_. Since the PDF has a table with clearly demarcated lines, we will use the :ref:`Lattice <lattice>` method here. To do that we will set the ``mesh`` keyword argument to ``True``.
|
||||
|
||||
.. note:: :ref:`Stream <stream>` is used by default.
|
||||
|
||||
.. _here: _static/pdf/foo.pdf
|
||||
|
||||
::
|
||||
|
||||
>>> tables = camelot.read_pdf('foo.pdf', mesh=True)
|
||||
>>> tables
|
||||
<TableList n=1>
|
||||
|
||||
Now, we have a :class:`TableList <camelot.core.TableList>` object called ``tables``, which is a list of :class:`Table <camelot.core.Table>` objects. We can get everything we need from this object.
|
||||
|
||||
We can access each table using its index. We can see that the ``tables`` object has only one table, since ``n=1``. Let's access the table using the index ``0`` and take a look at its ``shape``.
|
||||
|
||||
::
|
||||
|
||||
>>> tables[0]
|
||||
<Table shape=(7, 7)>
|
||||
|
||||
Let's print the parsing report.
|
||||
|
||||
::
|
||||
|
||||
>>> print(tables[0].parsing_report)
|
||||
{
|
||||
'accuracy': 99.02,
|
||||
'whitespace': 12.24,
|
||||
'order': 1,
|
||||
'page': 1
|
||||
}
|
||||
|
||||
Woah! The accuracy is top-notch and whitespace is less, that means the table was parsed correctly (most probably). You can access the table as a pandas DataFrame by using its ``df``.
|
||||
|
||||
::
|
||||
|
||||
>>> tables[0].df
|
||||
|
||||
.. csv-table::
|
||||
:file: ../_static/csv/foo.csv
|
||||
|
||||
Looks good! You can be export the table as a CSV file using its :meth:`to_csv() <camelot.core.Table.to_csv>` method. Alternatively you can use :meth:`to_json() <camelot.core.Table.to_json>`, :meth:`to_excel() <camelot.core.Table.to_excel>` or :meth:`to_html() <camelot.core.Table.to_html>` methods to export the table as JSON, Excel and HTML files respectively.
|
||||
|
||||
::
|
||||
|
||||
>>> tables[0].to_csv('foo.csv')
|
||||
|
||||
This will export the table as a CSV file at the path specified. In this case, it is ``foo.csv`` in the current directory.
|
||||
|
||||
You can also export all tables at once, using the ``tables`` object's :meth:`export() <camelot.core.TableList.export>` method.
|
||||
|
||||
::
|
||||
|
||||
>>> tables.export('foo.csv', f='csv')
|
||||
|
||||
This will export all tables as CSV files at the path specified. Alternatively, you can use ``f='json'``, ``f='excel'`` or ``f='html'``.
|
||||
|
||||
.. note:: The :meth:`export() <camelot.core.TableList.export>` method exports files with a ``page-*-table-*`` suffix. In the example above, the single table in the list will be exported to ``foo-page-1-table-1.csv``. If the list contains multiple tables, multiple files will be created. To avoid filling up your path with multiple files, you can use ``compress=True`` to add all exported files to a ZIP archive.
|
||||
|
||||
.. note:: Camelot handles rotated PDF pages automatically. As an exercise, try to extract the table out of `this PDF file`_.
|
||||
|
||||
.. _this PDF file: ../_static/pdf/rotated.pdf
|
||||
|
||||
Specify page numbers
|
||||
--------------------
|
||||
|
||||
By default, Camelot only parses the first page of the PDF. To specify multiple pages, you can use the ``pages`` keyword argument::
|
||||
|
||||
>>> camelot.read_pdf('your.pdf', pages='1,2,3')
|
||||
|
||||
The ``pages`` keyword argument accepts pages as comma-separated string of page numbers. You can also specify page ranges, such as ``pages=1,4-10,20-30`` or ``pages=1,4-10,20-end``.
|
||||
|
||||
------------------------
|
||||
|
||||
Ready for more? Check out the :ref:`advanced <advanced>` section.
|
||||
Loading…
Reference in New Issue