Add doc fixes

pull/2/head
Vinayak Mehta 2018-09-24 16:26:35 +05:30
parent 36b1dee5d9
commit 3600025a22
5 changed files with 27 additions and 28 deletions

View File

@ -3,8 +3,8 @@
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
Camelot: PDF Table Parsing for Humans Camelot: PDF Table Extraction for Humans
===================================== ========================================
Release v\ |version|. (:ref:`Installation <install>`) Release v\ |version|. (:ref:`Installation <install>`)
@ -63,10 +63,10 @@ Why Camelot?
- **Export** to multiple formats, including json, excel and html. - **Export** to multiple formats, including json, excel and html.
- Simple and Elegant API, written in **Python**! - Simple and Elegant API, written in **Python**!
See `comparison with other PDF parsing libraries and tools`_. See `comparison with other PDF table extraction libraries and tools`_.
.. _ETL and data analysis workflows: https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 .. _ETL and data analysis workflows: https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873
.. _comparison with other PDF parsing libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Parsing-libraries-and-tools .. _comparison with other PDF table extraction libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
The User Guide The User Guide
-------------- --------------

View File

@ -77,7 +77,7 @@ This, as we shall later see, is very helpful with :ref:`Stream <stream>`, for no
table table
^^^^^ ^^^^^
Let's plot the table (to see if it was detected correctly or not). This geometry type, along with contour, line and joint is useful for debugging and improving the parsing output, in case the table wasn't detected correctly. More on that later. Let's plot the table (to see if it was detected correctly or not). This geometry type, along with contour, line and joint is useful for debugging and improving the extraction output, in case the table wasn't detected correctly. More on that later.
:: ::
@ -220,7 +220,7 @@ In this case, the text that `other tools`_ return, will be ``24.912``. This is h
You can solve this by passing ``flag_size=True``, which will enclose the superscripts and subscripts with ``<s></s>``, based on font size, as shown below. You can solve this by passing ``flag_size=True``, which will enclose the superscripts and subscripts with ``<s></s>``, based on font size, as shown below.
.. _other tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Parsing-libraries-and-tools .. _other tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
:: ::

View File

@ -9,27 +9,26 @@ You can print the help for the interface, by typing ``camelot --help`` in your f
:: ::
$ camelot --help
Usage: camelot [OPTIONS] COMMAND [ARGS]... Usage: camelot [OPTIONS] COMMAND [ARGS]...
Camelot: PDF Table Extraction for Humans
Options: Options:
--version Show the version and exit. --version Show the version and exit.
-p, --pages TEXT Comma-separated page numbers to parse. -p, --pages TEXT Comma-separated page numbers. Example: 1,3,4
Example: 1,3,4 or 1,4-end or 1,4-end.
-o, --output TEXT Output file path. -o, --output TEXT Output file path.
-f, --format [csv|json|excel|html] -f, --format [csv|json|excel|html]
Output file format. Output file format.
-z, --zip Whether or not to create a ZIP archive. -z, --zip Create ZIP archive.
-split, --split_text Whether or not to split text if it spans -split, --split_text Split text that spans across multiple cells.
across multiple cells. -flag, --flag_size Flag text based on font size. Useful to
-flag, --flag_size (inactive) Whether or not to flag text which detect super/subscripts.
has uncommon size. (Useful to detect
super/subscripts)
-M, --margins <FLOAT FLOAT FLOAT>... -M, --margins <FLOAT FLOAT FLOAT>...
char_margin, line_margin, word_margin for PDFMiner char_margin, line_margin and
PDFMiner. word_margin.
--help Show this message and exit. --help Show this message and exit.
Commands: Commands:
lattice Use lines between text to parse table. lattice Use lines between text to parse the table.
stream Use spaces between text to parse table. stream Use spaces between text to parse the table.

View File

@ -14,8 +14,8 @@ Sadly, a lot of open data is given out as tables which are trapped inside PDF fi
.. _PostScript: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf .. _PostScript: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf
Why another PDF Table Parsing library? Why another PDF Table Extraction library?
-------------------------------------- -----------------------------------------
There are both open (`Tabula`_, `pdf-table-extract`_) and closed-source (`smallpdf`_, `PDFTables`_) tools that are widely used, to extract tables from PDF files. They either give a nice output, or fail miserably. There is no in-between. This is not helpful, since everything in the real world, including PDF table extraction, is fuzzy, leading to creation of adhoc table extraction scripts for each different type of PDF that the user wants to parse. There are both open (`Tabula`_, `pdf-table-extract`_) and closed-source (`smallpdf`_, `PDFTables`_) tools that are widely used, to extract tables from PDF files. They either give a nice output, or fail miserably. There is no in-between. This is not helpful, since everything in the real world, including PDF table extraction, is fuzzy, leading to creation of adhoc table extraction scripts for each different type of PDF that the user wants to parse.
@ -27,7 +27,7 @@ Here is a `comparison`_ of Camelot's output with outputs from other open-source
.. _pdf-table-extract: https://github.com/ashima/pdf-table-extract .. _pdf-table-extract: https://github.com/ashima/pdf-table-extract
.. _PDFTables: https://pdftables.com/ .. _PDFTables: https://pdftables.com/
.. _Smallpdf: https://smallpdf.com .. _Smallpdf: https://smallpdf.com
.. _comparison: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Parsing-libraries-and-tools .. _comparison: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
What's in a name? What's in a name?
----------------- -----------------

View File

@ -5,10 +5,10 @@ Quickstart
In a hurry to extract tables from PDFs? This document gives a good introduction to help you get started with using Camelot. In a hurry to extract tables from PDFs? This document gives a good introduction to help you get started with using Camelot.
Parse a PDF Read the PDF
----------- ------------
Parsing a PDF to extract tables with Camelot is very simple. Reading a PDF to extract tables with Camelot is very simple.
Begin by importing the Camelot module:: Begin by importing the Camelot module::
@ -47,7 +47,7 @@ Let's print the parsing report.
'page': 1 'page': 1
} }
Woah! The accuracy is top-notch and whitespace is less, that means the table was parsed correctly (most probably). You can access the table as a pandas DataFrame by using the :class:`table <camelot.core.Table>` object's ``df`` property. Woah! The accuracy is top-notch and whitespace is less, that means the table was extracted correctly (most probably). You can access the table as a pandas DataFrame by using the :class:`table <camelot.core.Table>` object's ``df`` property.
:: ::
@ -81,7 +81,7 @@ This will export all tables as CSV files at the path specified. Alternatively, y
Specify page numbers Specify page numbers
-------------------- --------------------
By default, Camelot only parses the first page of the PDF. To specify multiple pages, you can use the ``pages`` keyword argument:: By default, Camelot only uses the first page of the PDF to extract tables. To specify multiple pages, you can use the ``pages`` keyword argument::
>>> camelot.read_pdf('your.pdf', pages='1,2,3') >>> camelot.read_pdf('your.pdf', pages='1,2,3')