From affd4a8f86f13016f72201a2ae38ab84f0ae7b28 Mon Sep 17 00:00:00 2001 From: Vinayak Mehta Date: Wed, 12 Sep 2018 21:51:39 +0530 Subject: [PATCH] Update README and intro.rst --- README.md | 8 ++++++++ docs/index.rst | 9 ++++++++- docs/user/intro.rst | 36 ++++++++++++++++++++++++++---------- 3 files changed, 42 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 52bd87c..0b606e4 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,14 @@ There's a [command-line interface]() too! +## Why Camelot? + +- **You are in control**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.) +- **Metrics**: *Bad* tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table. +- Each table is a pandas DataFrame, which enables seamless integration into data analysis workflows. +- Export to multiple formats, including json, excel and html. +- Simple and Elegant API, written in Python! + ## Installation After [installing dependencies](), you can simply use pip: diff --git a/docs/index.rst b/docs/index.rst index f577085..224340d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -51,6 +51,14 @@ Usage There's a :ref:`command-line interface ` too! +Why Camelot? +------------ +- **You are in control**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.) +- **Metrics**: *Bad* tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table. +- Each table is a pandas DataFrame, which enables seamless integration into data analysis workflows. +- Export to multiple formats, including json, excel and html. +- Simple and Elegant API, written in Python! + The User Guide -------------- @@ -60,7 +68,6 @@ The User Guide user/intro user/install user/quickstart - user/cli The API Documentation / Guide ----------------------------- diff --git a/docs/user/intro.rst b/docs/user/intro.rst index ed87f78..ae117c5 100644 --- a/docs/user/intro.rst +++ b/docs/user/intro.rst @@ -1,19 +1,35 @@ -PDF started as `The Camelot Project`_ when people wanted a cross-platform way for sending and viewing documents. A pdf file contains characters placed at specific x,y-coordinates. Spaces are simulated by placing characters relatively far apart. +.. _intro: -Camelot uses two methods to parse tables from PDFs, :doc:`lattice ` and :doc:`stream `. The names were taken from Tabula but the implementation is somewhat different, though it follows the same philosophy. Lattice looks for lines between text elements while stream looks for whitespace between text elements. +Introduction +============ + +The Camelot Project +------------------- + +The Portable Document Format (PDF) was born out of `The Camelot Project`_ when a need was felt for "a universal to communicate documents across a wide variety of machine configurations, operating systems and communication networks". The goal was to make these documents viewable on any display and printable on any modern printers. The invention of the `PostScript`_ page description language, which enabled the creation of fixed-layout flat documents (with text, fonts, graphics, images encapsulated), solved the problem. + +At a very high level, PostScript defines instructions, such as, "place this character at this x,y coordinate on a plane". Spaces can be *simulated* by placing characters relatively far apart. Tables can be *simulated* in the same way by placing characters (and words) in two-dimensional grids. A PDF viewer just takes these instructions and drawS everything for us to view. There is no table data structure which can be extracted and used for analysis, it's just characters on a plane! + +Sadly, a lot of open data is given out as tables which are trapped inside PDF files. .. _The Camelot Project: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf +.. _PostScript: http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf -Why another pdf table parsing library? -====================================== +Why another PDF Table Parsing library? +-------------------------------------- -We tried a lot of tools available online to parse tables from pdf files. `PDFTables`_, `SolidConverter`_ are closed source, commercial products and a free trial doesn't last forever. `Tabula`_, which is open source, isn't very scalable. We found nothing that gave us complete control over the parsing process. In most cases, we didn't get the correct output and had to resort to writing custom scripts for each type of pdf. +There are both open (`Tabula`_) and closed-source (`PDFTables`_, `smallpdf`_) tools that are used widely to extract tables from PDF files. They either give you a nice output, or fail miserably, there is no in-between. This does not help most users, since everything in the real world, including PDF table extraction, is fuzzy.. Which leads them to create adhoc extraction scripts for each different type of PDF they want to parse. + +Camelot was created with the goal of offering the users complete control over table extraction. If the users are not able to the desired output with the default configuration, they should be able to tweak the parameters to get the data out! + +You can check out Camelot's `comparison with other libraries and tools`_. -.. _PDFTables: https://pdftables.com/ -.. _SolidConverter: http://www.soliddocuments.com/pdf/-to-word-converter/304/1 .. _Tabula: http://tabula.technology/ +.. _PDFTables: https://pdftables.com/ +.. _Smallpdf: https://smallpdf.com +.. _comparison with other libraries and tools: https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Parsing-libraries-and-tools -License -======= +Camelot License +--------------- -MIT License \ No newline at end of file + .. include:: ../../LICENSE \ No newline at end of file