diff --git a/docs/user/advanced.rst b/docs/user/advanced.rst index 9f3f127..204e9c0 100644 --- a/docs/user/advanced.rst +++ b/docs/user/advanced.rst @@ -42,9 +42,9 @@ The following geometries are available for plotting. You can pass them to the :m .. note:: The last three geometries can only be used with :ref:`Lattice `, i.e. when ``mesh=True``. -Let's generate a plot for each geometry using this `PDF <_static/pdf/foo.pdf>`__ as an example. +Let's generate a plot for each geometry using this `PDF <../_static/pdf/foo.pdf>`__ as an example. -.. warning:: By default, :meth:`plot_geometry() ` will use the first page of the PDF. Since this method is useful only for debugging, it makes sense to use it for one page at a time. If you pass a page range to this method, multiple plots will be generated one by one, each popping up as you close the previous one. To abort, you can use ``Ctrl + C``. +.. warning:: By default, :meth:`plot_geometry() ` will use the first page of the PDF. Since this method is useful only for debugging, it makes sense to use it for one page at a time. If you pass a page range to this method, multiple plots will be generated one by one, a new one popping up as you close the previous one. To abort, you can use ``Ctrl + C``. .. _geometry_text: @@ -73,7 +73,7 @@ This, as we shall later see, is very helpful with :ref:`Stream `, for no table ^^^^^ -Passing ``geometry_type=text`` creates a plot for tables detected on a PDF page. This geometry, along with contour, line and joint is useful for debugging and improving the parsing output, as we shall see later. +Passing ``geometry_type=table`` creates a plot for tables detected on a PDF page. This geometry, along with contour, line and joint is useful for debugging and improving the parsing output, as we shall see later. :: @@ -91,7 +91,7 @@ Passing ``geometry_type=text`` creates a plot for tables detected on a PDF page. contour ^^^^^^^ -Passing ``geometry_type=text`` creates a plot for table boundaries detected on a PDF page. +Passing ``geometry_type=contour`` creates a plot for table boundaries detected on a PDF page. :: @@ -109,7 +109,7 @@ Passing ``geometry_type=text`` creates a plot for table boundaries detected on a line ^^^^ -Passing ``geometry_type=text`` creates a plot for lines detected on a PDF page. +Passing ``geometry_type=line`` creates a plot for lines detected on a PDF page. :: @@ -127,7 +127,7 @@ Passing ``geometry_type=text`` creates a plot for lines detected on a PDF page. joint ^^^^^ -Passing ``geometry_type=text`` creates a plot for line intersections detected on a PDF page. +Passing ``geometry_type=joint`` creates a plot for line intersections detected on a PDF page. :: @@ -143,9 +143,9 @@ Passing ``geometry_type=text`` creates a plot for line intersections detected on Specify table areas ------------------- -Since :ref:`Stream ` treats the whole page as a table, `for now`_, it's useful to specify table boundaries in cases such as this `PDF <_static/pdf/table_areas.pdf>`__. You can `plot the text `_ on this page and note the left-top and right-bottom coordinates of the table. +Since :ref:`Stream ` treats the whole page as a table, `for now`_, it's useful to specify table boundaries in cases such as this `PDF <../_static/pdf/table_areas.pdf>`__. You can :ref:`plot the text ` on this page and note the left-top and right-bottom coordinates of the table. -Table areas that you want Camelot to analyze can be passed as a list of comma-separated strings to :meth:`read_pdf() `. +Table areas that you want Camelot to analyze can be passed as a list of comma-separated strings to :meth:`read_pdf() `, using the ``table_areas`` keyword argument. .. _for now: https://github.com/socialcopsdev/camelot/issues/102 @@ -160,15 +160,15 @@ Table areas that you want Camelot to analyze can be passed as a list of comma-se Specify column separators ------------------------- -In cases like this `PDF <_static/pdf/column_separators.pdf>`__, where the text is very close to each other, it is possible that Camelot may guess the column separators' coordinates incorrectly. To correct this, you can explicitly specify the *x* coordinate for each column separator by `plotting the text `_ on the page. +In cases like this `PDF <../_static/pdf/column_separators.pdf>`__, where the text is very close to each other, it is possible that Camelot may guess the column separators' coordinates incorrectly. To correct this, you can explicitly specify the *x* coordinate for each column separator by :ref:`plotting the text ` on the page. -You can pass the column separators as a list of comma-separated strings to :meth:`read_pdf() `. +You can pass the column separators as a list of comma-separated strings to :meth:`read_pdf() `, using the ``columns`` keyword argument. -In case you passed a single column separators string list, and no table area is specified, the separators will be applied to the whole page. When a list of table areas is specified and there is a need to specify column separators as well, the length of both lists should be equal, each table area will be mapped to each column separators' string using their indices. +In case you passed a single column separators string list, and no table area is specified, the separators will be applied to the whole page. When a list of table areas is specified and there is a need to specify column separators as well, **the length of both lists should be equal**. Each table area will be mapped to each column separators' string using their indices. If you have specified two table areas, ``table_areas=['12,23,43,54', '20,33,55,67']``, and only want to specify column separators for the first table (since you can see by looking at the table that Camelot will be able to get it perfectly!), you can pass an empty string for the second table in the column separators' list, like this, ``columns=['10,120,200,400', '']``. -Let's get back to the *x* coordinates we got from `plotting text `_ that exists on this `PDF <_static/pdf/column_separators.pdf>`__, and get the table out! +Let's get back to the *x* coordinates we got from :ref:`plotting text ` that exists on this `PDF <../_static/pdf/column_separators.pdf>`__, and get the table out! :: @@ -204,13 +204,13 @@ To deal with cases like the output from the previous section, you can pass ``spl Flag superscripts and subscripts -------------------------------- -There might be cases where you want to differentiate between the text and superscripts and subscripts, like this `PDF <_static/pdf/superscript.pdf>`_. +There might be cases where you want to differentiate between the text and superscripts and subscripts, like this `PDF <../_static/pdf/superscript.pdf>`_. .. figure:: ../_static/png/superscript.png :alt: A PDF with superscripts :align: left -In this case, the text that `other tools`_ return, will be ``24.912``. This is harmless as long as there is that decimal point involved. When it isn't, you'll be left wondering why the results of your data analysis were 10x bigger! +In this case, the text that `other tools`_ return, will be ``24.912``. This is harmless as long as there is that decimal point involved. When it isn't there, you'll be left wondering why the results of your data analysis were 10x bigger! You can solve this by passing ``flag_size=True``, which will enclose the superscripts and subscripts with ````, based on font size, as shown below. @@ -270,12 +270,14 @@ As you can already guess, the larger the ``line_size_scaling``, the smaller the .. warning:: Making ``line_size_scaling`` very large (>150) will lead to text getting detected as lines. -Here's one `PDF <_static/pdf/short_lines.pdf>`__ where small lines separating the the headers don't get detected with the default value of 15. Let's `plot the table `_ for this PDF. +Here's one `PDF <../_static/pdf/short_lines.pdf>`__ where small lines separating the the headers don't get detected with the default value of 15. .. figure:: ../_static/png/short_lines.png :alt: A PDF table with short lines :align: left +Let's :ref:`plot the table ` for this PDF. + :: >>> camelot.plot_geometry('short_lines.pdf', mesh=True, geometry_type='table') @@ -294,7 +296,7 @@ Clearly, the smaller lines separating the headers, couldn't be detected. Let's t :alt: An improved plot of the PDF table with short lines :align: left -Voila! Camelot can now see those lines. Let's using this value in :meth:`read_pdf() ` and get our table. +Voila! Camelot can now see those lines. Let's use this value in :meth:`read_pdf() ` and get our table. :: @@ -318,11 +320,11 @@ Voila! Camelot can now see those lines. Let's using this value in :meth:`read_pd Shift text in spanning cells ---------------------------- -By default, the :ref:`Lattice ` method shifts text in spanning cells, first to the left and then to the top, as you can observe in the output table above. However, this behavior can be changed using the ``shift_text`` keyword argument. Think of it as setting the *gravity* for a table, it decides where the text moves and finally comes to rest. +By default, the :ref:`Lattice ` method shifts text in spanning cells, first to the left and then to the top, as you can observe in the output table above. However, this behavior can be changed using the ``shift_text`` keyword argument. Think of it as setting the *gravity* for a table, it decides the direction in which the text will move and finally come to rest. ``shift_text`` expects a list with one or more characters from the following set: ``('', l', 'r', 't', 'b')``, which are then applied *in order*. The default, as we discussed above, is ``['l', 't']``. -We'll use the `PDF <_static/pdf/short_lines.pdf>`__ from the previous example. Let's pass ``shift_text=['']``, which basically means that the text will experience weightlessness! (It will remain in place.) +We'll use the `PDF <../_static/pdf/short_lines.pdf>`__ from the previous example. Let's pass ``shift_text=['']``, which basically means that the text will experience weightlessness! (It will remain in place.) .. figure:: ../_static/png/short_lines.png :alt: A PDF table with short lines @@ -347,7 +349,7 @@ We'll use the `PDF <_static/pdf/short_lines.pdf>`__ from the previous example. L "Knowledge &Practices on HTN &","2400","Men (≥ 18 yrs)","-","-","-","1728" "DM","2400","Women (≥ 18 yrs)","-","-","-","1728" -No surprises there, it did remain in place. Let's pass ``shift_text=['r', 'b']``, to set the *gravity* to right-bottom, and move the text in that direction. +No surprises there, it did remain in place (observe the strings "2400" and "All the available individuals"). Let's pass ``shift_text=['r', 'b']``, to set the *gravity* to right-bottom, and move the text in that direction. :: @@ -371,11 +373,11 @@ No surprises there, it did remain in place. Let's pass ``shift_text=['r', 'b']`` Copy text in spanning cells --------------------------- -You can copy text in spanning cells when using :ref:`Lattice `, in either horizontal or vertical direction or both. This behavior is disabled by default. +You can copy text in spanning cells when using :ref:`Lattice `, in either horizontal or vertical direction, or both. This behavior is disabled by default. ``copy_text`` expects a list with one or more characters from the following set: ``('v', 'h')``, which are then applied *in order*. -Let's try it out on this `PDF <_static/pdf/copy_text.pdf>`__. First, let's check out the output table to see if we need to use any other configuration parameters. +Let's try it out on this `PDF <../_static/pdf/copy_text.pdf>`__. First, let's check out the output table to see if we need to use any other configuration parameters. :: @@ -392,7 +394,7 @@ Let's try it out on this `PDF <_static/pdf/copy_text.pdf>`__. First, let's check "","","Birbhum","v. Food Poisoning","199","0","31/12/13","31/12/13","Under control","..." "","","Howrah","vi. Viral Hepatitis A &E","85","0","26/12/13","27/12/13","Under surveillance","..." -We don't need anything else. Now, let's pass ``copy_text=['v']`` to copy text in the vertical direction. This can save you some time by not having to do this in your cleaning script! +We don't need anything else. Now, let's pass ``copy_text=['v']`` to copy text in the vertical direction. This can save you some time by not having to add this step in your cleaning script! :: diff --git a/docs/user/how-it-works.rst b/docs/user/how-it-works.rst index 5c9362b..5df9fd4 100644 --- a/docs/user/how-it-works.rst +++ b/docs/user/how-it-works.rst @@ -35,7 +35,7 @@ It starts by converting the PDF page to an image using ghostscript and then proc Let's see how Lattice processes the `second page of this PDF`_, step-by-step. -.. _second page of this PDF: _static/pdf/us-030.pdf +.. _second page of this PDF: ../_static/pdf/us-030.pdf 1. Line segments are detected. diff --git a/docs/user/install.rst b/docs/user/install.rst index 7552afc..823e92a 100644 --- a/docs/user/install.rst +++ b/docs/user/install.rst @@ -5,7 +5,7 @@ Installation of Camelot This part of the documentation covers the installation of Camelot. First, you'll need to install the dependencies, which include `tk`_ and `ghostscript`_. -.. _tk: https://wiki.tcl.tk/3743 +.. _tk: https://packages.ubuntu.com/trusty/python-tk .. _ghostscript: https://www.ghostscript.com/ These can be installed using your system's package manager. If you use Ubuntu, run the following: diff --git a/docs/user/intro.rst b/docs/user/intro.rst index 902b1a2..5321648 100644 --- a/docs/user/intro.rst +++ b/docs/user/intro.rst @@ -17,9 +17,9 @@ Sadly, a lot of open data is given out as tables which are trapped inside PDF fi Why another PDF Table Parsing library? -------------------------------------- -There are both open (`Tabula`_) and closed-source (`PDFTables`_, `smallpdf`_) tools that are used widely to extract tables from PDF files. They either give nice output, or fail miserably. There is no in-between. This does not help most users, since everything in the real world, including PDF table extraction, is fuzzy. Which leads to creation adhoc table extraction scripts for each different type of PDF that the user wants to parse. +There are both open (`Tabula`_) and closed-source (`PDFTables`_, `smallpdf`_) tools that are used widely to extract tables from PDF files. They either give a nice output, or fail miserably. There is no in-between. This does not help most users, since everything in the real world, including PDF table extraction, is fuzzy. Which leads to creation of adhoc table extraction scripts for each different type of PDF that the user wants to parse. -Camelot was created with the goal of offering its users complete control over table extraction. If the users are not able to the desired output with the default configuration, they should be able to tweak the parameters and get the tables out! +Camelot was created with the goal of offering its users complete control over table extraction. If the users are not able to get the desired output with the default configuration, they should be able to tweak the parameters and get the tables out! Here is a `comparison`_ of Camelot's output with outputs from other PDF parsing libraries and tools. diff --git a/docs/user/quickstart.rst b/docs/user/quickstart.rst index c94a28c..c5e2e3a 100644 --- a/docs/user/quickstart.rst +++ b/docs/user/quickstart.rst @@ -18,7 +18,7 @@ Now, let's try to read a PDF. You can check out the PDF used in this example, `h .. note:: :ref:`Stream ` is used by default. -.. _here: _static/pdf/foo.pdf +.. _here: ../_static/pdf/foo.pdf :: @@ -28,7 +28,7 @@ Now, let's try to read a PDF. You can check out the PDF used in this example, `h Now, we have a :class:`TableList ` object called ``tables``, which is a list of :class:`Table ` objects. We can get everything we need from this object. -We can access each table using its index. We can see that the ``tables`` object has only one table, since ``n=1``. Let's access the table using the index ``0`` and take a look at its ``shape``. +We can access each table using its index. From the code snippet above, we can see that the ``tables`` object has only one table, since ``n=1``. Let's access the table using the index ``0`` and take a look at its ``shape``. :: @@ -39,7 +39,7 @@ Let's print the parsing report. :: - >>> print(tables[0].parsing_report) + >>> print tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, @@ -47,7 +47,7 @@ Let's print the parsing report. 'page': 1 } -Woah! The accuracy is top-notch and whitespace is less, that means the table was parsed correctly (most probably). You can access the table as a pandas DataFrame by using its ``df``. +Woah! The accuracy is top-notch and whitespace is less, that means the table was parsed correctly (most probably). You can access the table as a pandas DataFrame by using the :class:`table object's` ``df`` property. :: @@ -72,7 +72,7 @@ You can also export all tables at once, using the ``tables`` object's :meth:`exp This will export all tables as CSV files at the path specified. Alternatively, you can use ``f='json'``, ``f='excel'`` or ``f='html'``. -.. note:: The :meth:`export() ` method exports files with a ``page-*-table-*`` suffix. In the example above, the single table in the list will be exported to ``foo-page-1-table-1.csv``. If the list contains multiple tables, multiple files will be created. To avoid filling up your path with multiple files, you can use ``compress=True`` to add all exported files to a ZIP archive. +.. note:: The :meth:`export() ` method exports files with a ``page-*-table-*`` suffix. In the example above, the single table in the list will be exported to ``foo-page-1-table-1.csv``. If the list contains multiple tables, multiple files will be created. To avoid filling up your path with multiple files, you can use ``compress=True``, which will create a single ZIP archive at your path with all the exported files. .. note:: Camelot handles rotated PDF pages automatically. As an exercise, try to extract the table out of `this PDF file`_.