Merge pull request #2 from camelot-dev/master

Update fork
pull/216/head
anakin87 2020-12-08 18:37:55 +01:00 committed by GitHub
commit 644e17edec
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
18 changed files with 358 additions and 190 deletions

View File

@ -0,0 +1,48 @@
---
name: Bug report
about: Please follow this template to submit bug reports.
title: ''
labels: bug
assignees: ''
---
<!-- Please read the filing issues section of the contributor's guide first: https://camelot-py.readthedocs.io/en/master/dev/contributing.html -->
**Describe the bug**
A clear and concise description of what the bug is.
**Steps to reproduce the bug**
Steps used to install `camelot`:
1. Add step here (you can add more steps too)
Steps to reproduce the behavior:
1. Add step here (you can add more steps too)
**Expected behavior**
A clear and concise description of what you expected to happen.
**Code**
Add the Camelot code snippet that you used.
```
import camelot
# add your code here
```
**PDF**
Add the PDF file that you want to extract tables from.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Environment**
- OS: [e.g. MacOS]
- Python version:
- Numpy version:
- OpenCV version:
- Ghostscript version:
- Camelot version:
**Additional context**
Add any other context about the problem here.

View File

@ -1,12 +1,7 @@
MIT License
Modifications:
Copyright (c) 2019 Camelot Developers
Original project:
Copyright (c) 2018 Peeply Private Ltd (Singapore)
Copyright (c) 2019-2020 Camelot Developers
Copyright (c) 2018-2019 Peeply Private Ltd (Singapore)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -10,13 +10,13 @@
[![image](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![image](https://img.shields.io/badge/continous%20quality-deepsource-lightgrey)](https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge)
**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files!
**Camelot** is a Python library that can help you extract tables from PDFs!
**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), which is a web interface for Camelot!
**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), the web interface to Camelot!
---
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
**Here's how you can extract tables from PDFs.** You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
<pre>
>>> import camelot
@ -46,24 +46,27 @@
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
There's a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html) too!
Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html)!
**Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
## Why Camelot?
- **You are in control.**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873).
- **Export** to multiple formats, including JSON, Excel, HTML and Sqlite.
- **Configurability**: Camelot gives you control over the table extraction process with its [tweakable settings](https://camelot-py.readthedocs.io/en/master/user/advanced.html).
- **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.
See [comparison with other PDF table extraction libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
## Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot).
## Installation
### Using conda
The easiest way to install Camelot is to install it with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
<pre>
$ conda install -c conda-forge camelot-py
@ -71,7 +74,7 @@ $ conda install -c conda-forge camelot-py
### Using pip
After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can simply use pip to install Camelot:
After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:
<pre>
$ pip install "camelot-py[cv]"
@ -94,40 +97,16 @@ $ pip install ".[cv]"
## Documentation
Great documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
## Development
The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
### Source code
You can check the latest sources with:
<pre>
$ git clone https://www.github.com/camelot-dev/camelot
</pre>
### Setting up a development environment
You can install the development dependencies easily, using pip:
<pre>
$ pip install "camelot-py[dev]"
</pre>
### Testing
After installation, you can run tests using:
<pre>
$ python setup.py test
</pre>
The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
## Wrappers
- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.
## Contributing
The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
## Versioning
Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out [HISTORY.md](https://github.com/camelot-dev/camelot/blob/master/HISTORY.md).
@ -135,9 +114,3 @@ Camelot uses [Semantic Versioning](https://semver.org/). For the available versi
## License
This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/master/LICENSE) file for details.
## Support the development
You can support our work on Camelot with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot). Organizations who use camelot can also sponsor the project for an acknowledgement on [our documentation site](https://camelot-py.readthedocs.io/en/master/) and this README.
Special thanks to all the users, organizations and contributors that support Camelot!

View File

@ -70,7 +70,8 @@ class PDFHandler(object):
if pages == "1":
page_numbers.append({"start": 1, "end": 1})
else:
infile = PdfFileReader(open(filepath, "rb"), strict=False)
instream = open(filepath, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted:
infile.decrypt(self.password)
if pages == "all":
@ -84,6 +85,7 @@ class PDFHandler(object):
page_numbers.append({"start": int(a), "end": int(b)})
else:
page_numbers.append({"start": int(r), "end": int(r)})
instream.close()
P = []
for p in page_numbers:
P.extend(range(p["start"], p["end"] + 1))
@ -122,7 +124,8 @@ class PDFHandler(object):
if rotation != "":
fpath_new = "".join([froot.replace("page", "p"), "_rotated", fext])
os.rename(fpath, fpath_new)
infile = PdfFileReader(open(fpath_new, "rb"), strict=False)
instream = open(fpath_new, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted:
infile.decrypt(self.password)
outfile = PdfFileWriter()
@ -134,6 +137,7 @@ class PDFHandler(object):
outfile.addPage(p)
with open(fpath, "wb") as f:
outfile.write(f)
instream.close()
def parse(
self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs

View File

@ -121,6 +121,7 @@ class Stream(BaseParser):
row_y = 0
rows = []
temp = []
for t in text:
# is checking for upright necessary?
# if t.get_text().strip() and all([obj.upright for obj in t._objs if
@ -131,8 +132,10 @@ class Stream(BaseParser):
temp = []
row_y = t.y0
temp.append(t)
rows.append(sorted(temp, key=lambda t: t.x0))
__ = rows.pop(0) # TODO: hacky
if len(rows) > 1:
__ = rows.pop(0) # TODO: hacky
return rows
@staticmethod
@ -345,43 +348,46 @@ class Stream(BaseParser):
else:
# calculate mode of the list of number of elements in
# each row to guess the number of columns
ncols = max(set(elements), key=elements.count)
if ncols == 1:
# if mode is 1, the page usually contains not tables
# but there can be cases where the list can be skewed,
# try to remove all 1s from list in this case and
# see if the list contains elements, if yes, then use
# the mode after removing 1s
elements = list(filter(lambda x: x != 1, elements))
if len(elements):
ncols = max(set(elements), key=elements.count)
else:
warnings.warn(
f"No tables found in table area {table_idx + 1}"
if not len(elements):
cols = [(text_x_min, text_x_max)]
else:
ncols = max(set(elements), key=elements.count)
if ncols == 1:
# if mode is 1, the page usually contains not tables
# but there can be cases where the list can be skewed,
# try to remove all 1s from list in this case and
# see if the list contains elements, if yes, then use
# the mode after removing 1s
elements = list(filter(lambda x: x != 1, elements))
if len(elements):
ncols = max(set(elements), key=elements.count)
else:
warnings.warn(
f"No tables found in table area {table_idx + 1}"
)
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
inner_text = []
for i in range(1, len(cols)):
left = cols[i - 1][1]
right = cols[i][0]
inner_text.extend(
[
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
)
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
inner_text = []
for i in range(1, len(cols)):
left = cols[i - 1][1]
right = cols[i][0]
inner_text.extend(
[
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
)
outer_text = [
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
]
inner_text.extend(outer_text)
cols = self._add_columns(cols, inner_text, self.row_tol)
cols = self._join_columns(cols, text_x_min, text_x_max)
outer_text = [
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
]
inner_text.extend(outer_text)
cols = self._add_columns(cols, inner_text, self.row_tol)
cols = self._join_columns(cols, text_x_min, text_x_max)
return cols, rows

View File

@ -353,7 +353,7 @@ def text_in_bbox(bbox, text):
Returns
-------
t_bbox : list
List of PDFMiner text objects that lie inside table.
List of PDFMiner text objects that lie inside table, discarding the overlapping ones
"""
lb = (bbox[0], bbox[1])
@ -364,7 +364,97 @@ def text_in_bbox(bbox, text):
if lb[0] - 2 <= (t.x0 + t.x1) / 2.0 <= rt[0] + 2
and lb[1] - 2 <= (t.y0 + t.y1) / 2.0 <= rt[1] + 2
]
return t_bbox
# Avoid duplicate text by discarding overlapping boxes
rest = {t for t in t_bbox}
for ba in t_bbox:
for bb in rest.copy():
if ba == bb:
continue
if bbox_intersect(ba, bb):
# if the intersection is larger than 80% of ba's size, we keep the longest
if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
if bbox_longer(bb, ba):
rest.discard(ba)
unique_boxes = list(rest)
return unique_boxes
def bbox_intersection_area(ba, bb) -> float:
"""Returns area of the intersection of the bounding boxes of two PDFMiner objects.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
intersection_area : float
Area of the intersection of the bounding boxes of both objects
"""
x_left = max(ba.x0, bb.x0)
y_top = min(ba.y1, bb.y1)
x_right = min(ba.x1, bb.x1)
y_bottom = max(ba.y0, bb.y0)
if x_right < x_left or y_bottom > y_top:
return 0.0
intersection_area = (x_right - x_left) * (y_top - y_bottom)
return intersection_area
def bbox_area(bb) -> float:
"""Returns area of the bounding box of a PDFMiner object.
Parameters
----------
bb : PDFMiner text object
Returns
-------
area : float
Area of the bounding box of the object
"""
return (bb.x1 - bb.x0) * (bb.y1 - bb.y0)
def bbox_intersect(ba, bb) -> bool:
"""Returns True if the bounding boxes of two PDFMiner objects intersect.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
overlaps : bool
True if the bounding boxes intersect
"""
return ba.x1 >= bb.x0 and bb.x1 >= ba.x0 and ba.y1 >= bb.y0 and bb.y1 >= ba.y0
def bbox_longer(ba, bb) -> bool:
"""Returns True if the bounding box of the first PDFMiner object is longer or equal to the second.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
longer : bool
True if the bounding box of the first object is longer or equal
"""
return (ba.x1 - ba.x0) >= (bb.x1 - bb.x0)
def merge_close_lines(ar, line_tol=2):
@ -411,7 +501,7 @@ def text_strip(text, strip=""):
return text
stripped = re.sub(
fr"[{''.join(map(re.escape, strip))}]", "", text, re.UNICODE
fr"[{''.join(map(re.escape, strip))}]", "", text, flags=re.UNICODE
)
return stripped

View File

@ -63,7 +63,7 @@ master_doc = 'index'
# General information about the project.
project = u'Camelot'
copyright = u'2019, Camelot Developers'
copyright = u'2020, Camelot Developers'
author = u'Vinayak Mehta'
# The version info for the project you're documenting, acts as replacement for

View File

@ -36,15 +36,15 @@ Release v\ |version|. (:ref:`Installation <install>`)
.. image:: https://img.shields.io/badge/continous%20quality-deepsource-lightgrey
:target: https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge
**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files!
**Camelot** is a Python library that can help you extract tables from PDFs!
.. note:: You can also check out `Excalibur`_, which is a web interface for Camelot!
.. note:: You can also check out `Excalibur`_, the web interface to Camelot!
.. _Excalibur: https://github.com/camelot-dev/excalibur
----
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example `here`_.
**Here's how you can extract tables from PDFs.** You can check out the PDF used in this example `here`_.
.. _here: _static/pdf/foo.pdf
@ -70,7 +70,7 @@ Release v\ |version|. (:ref:`Installation <install>`)
.. csv-table::
:file: _static/csv/foo.csv
There's a :ref:`command-line interface <cli>` too!
Camelot also comes packaged with a :ref:`command-line interface <cli>`!
.. note:: Camelot only works with text-based PDFs and not scanned documents. (As Tabula `explains`_, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
@ -79,27 +79,27 @@ There's a :ref:`command-line interface <cli>` too!
Why Camelot?
------------
- **You are in control.** Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into `ETL and data analysis workflows`_.
- **Export** to multiple formats, including JSON, Excel and HTML.
See `comparison with other PDF table extraction libraries and tools`_.
- **Configurability**: Camelot gives you control over the table extraction process with its :ref:`tweakable settings <advanced>`.
- **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into `ETL and data analysis workflows`_. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.
.. _ETL and data analysis workflows: https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873
.. _comparison with other PDF table extraction libraries and tools: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
Support us on OpenCollective
----------------------------
See `comparison with similar libraries and tools`_.
If Camelot helped you extract tables from PDFs, please consider supporting its development by `becoming a backer or a sponsor on OpenCollective`_!
.. _comparison with similar libraries and tools: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
.. _becoming a backer or a sponsor on OpenCollective: https://opencollective.com/camelot
Support the development
-----------------------
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation `on OpenCollective`_!
.. _on OpenCollective: https://opencollective.com/camelot
The User Guide
--------------
This part of the documentation begins with some background information about why Camelot was created, takes a small dip into the implementation details and then focuses on step-by-step instructions for getting the most out of Camelot.
This part of the documentation begins with some background information about why Camelot was created, takes you through some implementation details, and then focuses on step-by-step instructions for getting the most out of Camelot.
.. toctree::
:maxdepth: 2
@ -115,8 +115,7 @@ This part of the documentation begins with some background information about why
The API Documentation/Guide
---------------------------
If you are looking for information on a specific function, class, or method,
this part of the documentation is for you.
If you are looking for information on a specific function, class, or method, this part of the documentation is for you.
.. toctree::
:maxdepth: 2
@ -126,8 +125,7 @@ this part of the documentation is for you.
The Contributor Guide
---------------------
If you want to contribute to the project, this part of the documentation is for
you.
If you want to contribute to the project, this part of the documentation is for you.
.. toctree::
:maxdepth: 2

View File

@ -3,72 +3,59 @@
Installation of dependencies
============================
The dependencies `Tkinter`_ and `ghostscript`_ can be installed using your system's package manager. You can run one of the following, based on your OS.
.. _Tkinter: https://wiki.python.org/moin/TkInter
.. _ghostscript: https://www.ghostscript.com
The dependencies `Ghostscript <https://www.ghostscript.com>`_ and `Tkinter <https://wiki.python.org/moin/TkInter>`_ can be installed using your system's package manager or by running their installer.
OS-specific instructions
------------------------
For Ubuntu
^^^^^^^^^^
Ubuntu
^^^^^^
::
$ apt install python-tk ghostscript
$ apt install ghostscript python3-tk
Or for Python 3::
$ apt install python3-tk ghostscript
For macOS
^^^^^^^^^
MacOS
^^^^^
::
$ brew install tcl-tk ghostscript
$ brew install ghostscript tcl-tk
For Windows
^^^^^^^^^^^
Windows
^^^^^^^
For Tkinter, you can download the `ActiveTcl Community Edition`_ from ActiveState. For ghostscript, you can get the installer at the `ghostscript downloads page`_.
For Ghostscript, you can get the installer at their `downloads page <https://www.ghostscript.com/download/gsdnld.html>`_. And for Tkinter, you can download the `ActiveTcl Community Edition <https://www.activestate.com/activetcl/downloads>`_ from ActiveState.
.. _ActiveTcl Community Edition: https://www.activestate.com/activetcl/downloads
.. _ghostscript downloads page: https://www.ghostscript.com/download/gsdnld.html
.. _as shown here: https://java.com/en/download/help/path.xml
Checks to see if dependencies are installed correctly
-----------------------------------------------------
Checks to see if dependencies were installed correctly
------------------------------------------------------
You can run the following checks to see if the dependencies were installed correctly.
You can do the following checks to see if the dependencies were installed correctly.
For Ghostscript
^^^^^^^^^^^^^^^
Open the Python REPL and run the following:
For Ubuntu/MacOS::
>>> from ctypes.util import find_library
>>> find_library("gs")
"libgs.so.9"
For Windows::
>>> from ctypes.util import find_library
>>> find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll"))
<name-of-ghostscript-library-on-windows>
**Check:** The output of the ``find_library`` function should not be empty.
If the output is empty, then it's possible that the Ghostscript library is not available one of the ``LD_LIBRARY_PATH``/``DYLD_LIBRARY_PATH``/``PATH`` variables depending on your operating system. In this case, you may have to modify one of those path variables.
For Tkinter
^^^^^^^^^^^
Launch Python, and then at the prompt, type::
>>> import Tkinter
Or in Python 3::
Launch Python and then import Tkinter::
>>> import tkinter
If you have Tkinter, Python will not print an error message, and if not, you will see an ``ImportError``.
For ghostscript
^^^^^^^^^^^^^^^
Run the following to check the ghostscript version.
For Ubuntu/macOS::
$ gs -version
For Windows::
C:\> gswin64c.exe -version
Or for Windows 32-bit::
C:\> gswin32c.exe -version
If you have ghostscript, you should see the ghostscript version and copyright information.
**Check:** Importing ``tkinter`` should not raise an import error.

View File

@ -5,42 +5,35 @@ Installation of Camelot
This part of the documentation covers the steps to install Camelot.
Using conda
-----------
After :ref:`installing the dependencies <install_deps>`, which include `Ghostscript <https://www.ghostscript.com>`_ and `Tkinter <https://wiki.python.org/moin/TkInter>`_, you can use one of the following methods to install Camelot:
The easiest way to install Camelot is to install it with `conda`_, which is a package manager and environment management system for the `Anaconda`_ distribution.
::
.. warning:: The ``lattice`` flavor will fail to run if Ghostscript is not installed. You may run into errors as shown in `issue #193 <https://github.com/camelot-dev/camelot/issues/193>`_.
$ conda install -c conda-forge camelot-py
pip
---
.. note:: Camelot is available for Python 2.7, 3.5, 3.6 and 3.7 on Linux, macOS and Windows. For Windows, you will need to install ghostscript which you can get from their `downloads page`_.
.. _conda: https://conda.io/docs/
.. _Anaconda: http://docs.continuum.io/anaconda/
.. _downloads page: https://www.ghostscript.com/download/gsdnld.html
.. _conda-forge: https://conda-forge.org/
Using pip
---------
After :ref:`installing the dependencies <install_deps>`, which include `Tkinter`_ and `ghostscript`_, you can simply use pip to install Camelot::
To install Camelot from PyPI using ``pip``, please include the extra ``cv`` requirement as shown::
$ pip install "camelot-py[cv]"
.. _Tkinter: https://wiki.python.org/moin/TkInter
.. _ghostscript: https://www.ghostscript.com
conda
-----
`conda`_ is a package manager and environment management system for the `Anaconda <https://anaconda.org>`_ distribution. It can be used to install Camelot from the ``conda-forge`` channel::
$ conda install -c conda-forge camelot-py
From the source code
--------------------
After :ref:`installing the dependencies <install_deps>`, you can install from the source by:
After :ref:`installing the dependencies <install_deps>`, you can install Camelot from source by:
1. Cloning the GitHub repository.
::
$ git clone https://www.github.com/camelot-dev/camelot
2. Then simply using pip again.
2. And then simply using pip again.
::
$ cd camelot

View File

@ -2798,3 +2798,51 @@ data_stream_layout_kwargs = [
["A.O.P Cornas", ""],
["Domaine Lionnet « Terre Brûlée » 2012", "15 €"],
]
data_stream_duplicated_text = [
['', '2012 BETTER VARIETIES Harvest Report for Minnesota Central [ MNCE ]', '', '', '', '', '', '', '', '',
'ALL SEASON TEST'],
['', 'Doug Toreen, Renville County, MN 55310 [ BIRD ISLAND ]', '', '', '', '', '', '', '', '',
'1.3 - 2.0 MAT. GROUP'],
['PREV. CROP/HERB:', 'Corn / Surpass, Roundup', '', '', '', '', '', '', '', '', 'S2MNCE01'],
['SOIL DESCRIPTION:', '', 'Canisteo clay loam, mod. well drained, non-irrigated', '', '', '', '', '', '', '', ''],
['SOIL CONDITIONS:', '', 'High P, high K, 6.7 pH, 3.9% OM, Low SCN', '', '', '', '', '', '', '', '30" ROW SPACING'],
['TILLAGE/CULTIVATION:', 'conventional w/ fall till', '', '', '', '', '', '', '', '', ''],
['PEST MANAGEMENT:', 'Roundup twice', '', '', '', '', '', '', '', '', ''],
['SEEDED - RATE:', 'May 15', '140 000 /A', '', '', '', '', '', '', 'TOP 30 for YIELD of 63 TESTED', ''],
['HARVESTED - STAND:', 'Oct 3', '122 921 /A', '', '', '', '', '', '', 'AVERAGE of (3) REPLICATIONS', ''],
['', '', '', '', 'SCN', 'Seed', 'Yield', 'Moisture', 'Lodging', 'Stand', 'Gross'],
['Company/Brand', 'Product/Brand†', 'Technol.†', 'Mat.', 'Resist.', 'Trmt.†', 'Bu/A', '%', '%', '(x 1000)',
'Income'], ['Kruger', 'K2 1901', 'RR2Y', '1.9', 'R', 'Ac,PV', '56.4', '7.6', '0', '126.3', '$846'],
['Stine', '19RA02 §', 'RR2Y', '1.9', 'R', 'CMB', '55.3', '7.6', '0', '120.0', '$830'],
['Wensman', 'W 3190NR2', 'RR2Y', '1.9', 'R', 'Ac', '54.5', '7.6', '0', '119.5', '$818'],
['Hefty', 'H17Y12', 'RR2Y', '1.7', 'MR', 'I', '53.7', '7.7', '0', '124.4', '$806'],
['Dyna-Gro', 'S15RY53', 'RR2Y', '1.5', 'R', 'Ac', '53.6', '7.7', '0', '126.8', '$804'],
['LG Seeds', 'C2050R2', 'RR2Y', '2.1', 'R', 'Ac', '53.6', '7.7', '0', '123.9', '$804'],
['Titan Pro', '19M42', 'RR2Y', '1.9', 'R', 'CMB', '53.6', '7.7', '0', '121.0', '$804'],
['Stine', '19RA02 (2) §', 'RR2Y', '1.9', 'R', 'CMB', '53.4', '7.7', '0', '123.9', '$801'],
['Asgrow', 'AG1832 §', 'RR2Y', '1.8', 'MR', 'Ac,PV', '52.9', '7.7', '0', '122.0', '$794'],
['Prairie Brand', 'PB-1566R2', 'RR2Y', '1.5', 'R', 'CMB', '52.8', '7.7', '0', '122.9', '$792'],
['Channel', '1901R2', 'RR2Y', '1.9', 'R', 'Ac,PV', '52.8', '7.6', '0', '123.4', '$791'],
['Titan Pro', '20M1', 'RR2Y', '2.0', 'R', 'Am', '52.5', '7.5', '0', '124.4', '$788'],
['Kruger', 'K2-2002', 'RR2Y', '2.0', 'R', 'Ac,PV', '52.4', '7.9', '0', '125.4', '$786'],
['Channel', '1700R2', 'RR2Y', '1.7', 'R', 'Ac,PV', '52.3', '7.9', '0', '123.9', '$784'],
['Hefty', 'H16Y11', 'RR2Y', '1.6', 'MR', 'I', '51.4', '7.6', '0', '123.9', '$771'],
['Anderson', '162R2Y', 'RR2Y', '1.6', 'R', 'None', '51.3', '7.5', '0', '119.5', '$770'],
['Titan Pro', '15M22', 'RR2Y', '1.5', 'R', 'CMB', '51.3', '7.8', '0', '125.4', '$769'],
['Dairyland', 'DSR-1710R2Y', 'RR2Y', '1.7', 'R', 'CMB', '51.3', '7.7', '0', '122.0', '$769'],
['Hefty', 'H20R3', 'RR2Y', '2.0', 'MR', 'I', '50.5', '8.2', '0', '121.0', '$757'],
['Prairie Brand', 'PB 1743R2', 'RR2Y', '1.7', 'R', 'CMB', '50.2', '7.7', '0', '125.8', '$752'],
['Gold Country', '1741', 'RR2Y', '1.7', 'R', 'Ac', '50.1', '7.8', '0', '123.9', '$751'],
['Trelay', '20RR43', 'RR2Y', '2.0', 'R', 'Ac,Ex', '49.9', '7.6', '0', '127.8', '$749'],
['Hefty', 'H14R3', 'RR2Y', '1.4', 'MR', 'I', '49.7', '7.7', '0', '122.9', '$746'],
['Prairie Brand', 'PB-2099NRR2', 'RR2Y', '2.0', 'R', 'CMB', '49.6', '7.8', '0', '126.3', '$743'],
['Wensman', 'W 3174NR2', 'RR2Y', '1.7', 'R', 'Ac', '49.3', '7.6', '0', '122.5', '$740'],
['Kruger', 'K2 1602', 'RR2Y', '1.6', 'R', 'Ac,PV', '48.7', '7.6', '0', '125.4', '$731'],
['NK Brand', 'S18-C2 §', 'RR2Y', '1.8', 'R', 'CMB', '48.7', '7.7', '0', '126.8', '$731'],
['Kruger', 'K2 1902', 'RR2Y', '1.9', 'R', 'Ac,PV', '48.7', '7.5', '0', '124.4', '$730'],
['Prairie Brand', 'PB-1823R2', 'RR2Y', '1.8', 'R', 'None', '48.5', '7.6', '0', '121.0', '$727'],
['Gold Country', '1541', 'RR2Y', '1.5', 'R', 'Ac', '48.4', '7.6', '0', '110.4', '$726'],
['', '', '', '', '', 'Test Average =', '47.6', '7.7', '0', '122.9', '$713'],
['', '', '', '', '', 'LSD (0.10) =', '5.7', '0.3', 'ns', '37.8', '566.4']
]

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -160,8 +160,8 @@ def test_cli_output_format():
def test_cli_quiet():
with TemporaryDirectory() as tempdir:
infile = os.path.join(testdir, "blank.pdf")
outfile = os.path.join(tempdir, "blank.csv")
infile = os.path.join(testdir, "empty.pdf")
outfile = os.path.join(tempdir, "empty.csv")
runner = CliRunner()
result = runner.invoke(

View File

@ -314,3 +314,11 @@ def test_version_generation_with_prerelease_revision():
generate_version(version, prerelease=prerelease, revision=revision)
== "0.7.3-alpha.2"
)
def test_stream_duplicated_text():
df = pd.DataFrame(data_stream_duplicated_text)
filename = os.path.join(testdir, "birdisland.pdf")
tables = camelot.read_pdf(filename, flavor="stream")
assert_frame_equal(df, tables[0].df)

View File

@ -55,15 +55,33 @@ def test_image_warning():
)
def test_no_tables_found():
filename = os.path.join(testdir, "blank.pdf")
def test_lattice_no_tables_on_page():
filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings():
warnings.simplefilter("error")
with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename)
tables = camelot.read_pdf(filename, flavor="lattice")
assert str(e.value) == "No tables found on page-1"
def test_stream_no_tables_on_page():
filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings():
warnings.simplefilter("error")
with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename, flavor="stream")
assert str(e.value) == "No tables found on page-1"
def test_stream_no_tables_in_area():
filename = os.path.join(testdir, "only_page_number.pdf")
with warnings.catch_warnings():
warnings.simplefilter("error")
with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename, flavor="stream")
assert str(e.value) == "No tables found in table area 1"
def test_no_tables_found_logs_suppressed():
filename = os.path.join(testdir, "foo.pdf")
with warnings.catch_warnings():
@ -77,7 +95,7 @@ def test_no_tables_found_logs_suppressed():
def test_no_tables_found_warnings_suppressed():
filename = os.path.join(testdir, "blank.pdf")
filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings():
# the test should fail if any warning is thrown
warnings.simplefilter("error")