camelot-py/docs/user/quickstart.rst

.. _quickstart:

Quickstart
==========

In a hurry to extract tables from PDFs? This document gives a good introduction to help you get started with using Camelot.

Parse a PDF
-----------

Parsing a PDF to extract tables with Camelot is very simple.

Begin by importing the Camelot module::

    >>> import camelot

Now, let's try to read a PDF. You can check out the PDF used in this example, `here`_. Since the PDF has a table with clearly demarcated lines, we will use the :ref:`Lattice <lattice>` method here. To do that we will set the ``mesh`` keyword argument to ``True``.

.. note:: :ref:`Stream <stream>` is used by default.

.. _here: _static/pdf/foo.pdf

::

    >>> tables = camelot.read_pdf('foo.pdf', mesh=True)
    >>> tables
    <TableList n=1>

Now, we have a :class:`TableList <camelot.core.TableList>` object called ``tables``, which is a list of :class:`Table <camelot.core.Table>` objects. We can get everything we need from this object.

We can access each table using its index. We can see that the ``tables`` object has only one table, since ``n=1``. Let's access the table using the index ``0`` and take a look at its ``shape``.

::

    >>> tables[0]
    <Table shape=(7, 7)>

Let's print the parsing report.

::

    >>> print(tables[0].parsing_report)
    {
        'accuracy': 99.02,
        'whitespace': 12.24,
        'order': 1,
        'page': 1
    }

Woah! The accuracy is top-notch and whitespace is less, that means the table was parsed correctly (most probably). You can access the table as a pandas DataFrame by using its ``df``.

::

    >>> tables[0].df

.. csv-table::
  :file: ../_static/csv/foo.csv

Looks good! You can be export the table as a CSV file using its :meth:`to_csv() <camelot.core.Table.to_csv>` method. Alternatively you can use :meth:`to_json() <camelot.core.Table.to_json>`, :meth:`to_excel() <camelot.core.Table.to_excel>` or :meth:`to_html() <camelot.core.Table.to_html>` methods to export the table as JSON, Excel and HTML files respectively.

::

    >>> tables[0].to_csv('foo.csv')

This will export the table as a CSV file at the path specified. In this case, it is ``foo.csv`` in the current directory.

You can also export all tables at once, using the ``tables`` object's :meth:`export() <camelot.core.TableList.export>` method.

::

    >>> tables.export('foo.csv', f='csv')

This will export all tables as CSV files at the path specified. Alternatively, you can use ``f='json'``, ``f='excel'`` or ``f='html'``.

.. note:: The :meth:`export() <camelot.core.TableList.export>` method exports files with a ``page-*-table-*`` suffix. In the example above, the single table in the list will be exported to ``foo-page-1-table-1.csv``. If the list contains multiple tables, multiple files will be created. To avoid filling up your path with multiple files, you can use ``compress=True`` to add all exported files to a ZIP archive.

.. note:: Camelot handles rotated PDF pages automatically. As an exercise, try to extract the table out of `this PDF file`_.

.. _this PDF file: ../_static/pdf/rotated.pdf

Specify page numbers
--------------------

By default, Camelot only parses the first page of the PDF. To specify multiple pages, you can use the ``pages`` keyword argument::

    >>> camelot.read_pdf('your.pdf', pages='1,2,3')

The ``pages`` keyword argument accepts pages as comma-separated string of page numbers. You can also specify page ranges, such as ``pages=1,4-10,20-30`` or ``pages=1,4-10,20-end``.

------------------------

Ready for more? Check out the :ref:`advanced <advanced>` section.