Merge pull request #2 from camelot-dev/master

Update fork
pull/216/head
anakin87 2020-12-08 18:37:55 +01:00 committed by GitHub
commit 644e17edec
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
18 changed files with 358 additions and 190 deletions

View File

@ -0,0 +1,48 @@
---
name: Bug report
about: Please follow this template to submit bug reports.
title: ''
labels: bug
assignees: ''
---
<!-- Please read the filing issues section of the contributor's guide first: https://camelot-py.readthedocs.io/en/master/dev/contributing.html -->
**Describe the bug**
A clear and concise description of what the bug is.
**Steps to reproduce the bug**
Steps used to install `camelot`:
1. Add step here (you can add more steps too)
Steps to reproduce the behavior:
1. Add step here (you can add more steps too)
**Expected behavior**
A clear and concise description of what you expected to happen.
**Code**
Add the Camelot code snippet that you used.
```
import camelot
# add your code here
```
**PDF**
Add the PDF file that you want to extract tables from.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Environment**
- OS: [e.g. MacOS]
- Python version:
- Numpy version:
- OpenCV version:
- Ghostscript version:
- Camelot version:
**Additional context**
Add any other context about the problem here.

View File

@ -1,12 +1,7 @@
MIT License MIT License
Modifications: Copyright (c) 2019-2020 Camelot Developers
Copyright (c) 2018-2019 Peeply Private Ltd (Singapore)
Copyright (c) 2019 Camelot Developers
Original project:
Copyright (c) 2018 Peeply Private Ltd (Singapore)
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal

View File

@ -10,13 +10,13 @@
[![image](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![image](https://img.shields.io/badge/continous%20quality-deepsource-lightgrey)](https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge) [![image](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![image](https://img.shields.io/badge/continous%20quality-deepsource-lightgrey)](https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge)
**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files! **Camelot** is a Python library that can help you extract tables from PDFs!
**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), which is a web interface for Camelot! **Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), the web interface to Camelot!
--- ---
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf). **Here's how you can extract tables from PDFs.** You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
<pre> <pre>
>>> import camelot >>> import camelot
@ -46,24 +46,27 @@
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% | | 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% | | 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
There's a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html) too! Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html)!
**Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".) **Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
## Why Camelot? ## Why Camelot?
- **You are in control.**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.) - **Configurability**: Camelot gives you control over the table extraction process with its [tweakable settings](https://camelot-py.readthedocs.io/en/master/user/advanced.html).
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table. - **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). - **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.
- **Export** to multiple formats, including JSON, Excel, HTML and Sqlite.
See [comparison with other PDF table extraction libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools). See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
## Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot).
## Installation ## Installation
### Using conda ### Using conda
The easiest way to install Camelot is to install it with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution. The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
<pre> <pre>
$ conda install -c conda-forge camelot-py $ conda install -c conda-forge camelot-py
@ -71,7 +74,7 @@ $ conda install -c conda-forge camelot-py
### Using pip ### Using pip
After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can simply use pip to install Camelot: After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:
<pre> <pre>
$ pip install "camelot-py[cv]" $ pip install "camelot-py[cv]"
@ -94,40 +97,16 @@ $ pip install ".[cv]"
## Documentation ## Documentation
Great documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/). The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
## Development
The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
### Source code
You can check the latest sources with:
<pre>
$ git clone https://www.github.com/camelot-dev/camelot
</pre>
### Setting up a development environment
You can install the development dependencies easily, using pip:
<pre>
$ pip install "camelot-py[dev]"
</pre>
### Testing
After installation, you can run tests using:
<pre>
$ python setup.py test
</pre>
## Wrappers ## Wrappers
- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot. - [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.
## Contributing
The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
## Versioning ## Versioning
Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out [HISTORY.md](https://github.com/camelot-dev/camelot/blob/master/HISTORY.md). Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out [HISTORY.md](https://github.com/camelot-dev/camelot/blob/master/HISTORY.md).
@ -135,9 +114,3 @@ Camelot uses [Semantic Versioning](https://semver.org/). For the available versi
## License ## License
This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/master/LICENSE) file for details. This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/master/LICENSE) file for details.
## Support the development
You can support our work on Camelot with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot). Organizations who use camelot can also sponsor the project for an acknowledgement on [our documentation site](https://camelot-py.readthedocs.io/en/master/) and this README.
Special thanks to all the users, organizations and contributors that support Camelot!

View File

@ -70,7 +70,8 @@ class PDFHandler(object):
if pages == "1": if pages == "1":
page_numbers.append({"start": 1, "end": 1}) page_numbers.append({"start": 1, "end": 1})
else: else:
infile = PdfFileReader(open(filepath, "rb"), strict=False) instream = open(filepath, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted: if infile.isEncrypted:
infile.decrypt(self.password) infile.decrypt(self.password)
if pages == "all": if pages == "all":
@ -84,6 +85,7 @@ class PDFHandler(object):
page_numbers.append({"start": int(a), "end": int(b)}) page_numbers.append({"start": int(a), "end": int(b)})
else: else:
page_numbers.append({"start": int(r), "end": int(r)}) page_numbers.append({"start": int(r), "end": int(r)})
instream.close()
P = [] P = []
for p in page_numbers: for p in page_numbers:
P.extend(range(p["start"], p["end"] + 1)) P.extend(range(p["start"], p["end"] + 1))
@ -122,7 +124,8 @@ class PDFHandler(object):
if rotation != "": if rotation != "":
fpath_new = "".join([froot.replace("page", "p"), "_rotated", fext]) fpath_new = "".join([froot.replace("page", "p"), "_rotated", fext])
os.rename(fpath, fpath_new) os.rename(fpath, fpath_new)
infile = PdfFileReader(open(fpath_new, "rb"), strict=False) instream = open(fpath_new, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted: if infile.isEncrypted:
infile.decrypt(self.password) infile.decrypt(self.password)
outfile = PdfFileWriter() outfile = PdfFileWriter()
@ -134,6 +137,7 @@ class PDFHandler(object):
outfile.addPage(p) outfile.addPage(p)
with open(fpath, "wb") as f: with open(fpath, "wb") as f:
outfile.write(f) outfile.write(f)
instream.close()
def parse( def parse(
self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs

View File

@ -121,6 +121,7 @@ class Stream(BaseParser):
row_y = 0 row_y = 0
rows = [] rows = []
temp = [] temp = []
for t in text: for t in text:
# is checking for upright necessary? # is checking for upright necessary?
# if t.get_text().strip() and all([obj.upright for obj in t._objs if # if t.get_text().strip() and all([obj.upright for obj in t._objs if
@ -131,8 +132,10 @@ class Stream(BaseParser):
temp = [] temp = []
row_y = t.y0 row_y = t.y0
temp.append(t) temp.append(t)
rows.append(sorted(temp, key=lambda t: t.x0)) rows.append(sorted(temp, key=lambda t: t.x0))
__ = rows.pop(0) # TODO: hacky if len(rows) > 1:
__ = rows.pop(0) # TODO: hacky
return rows return rows
@staticmethod @staticmethod
@ -345,43 +348,46 @@ class Stream(BaseParser):
else: else:
# calculate mode of the list of number of elements in # calculate mode of the list of number of elements in
# each row to guess the number of columns # each row to guess the number of columns
ncols = max(set(elements), key=elements.count) if not len(elements):
if ncols == 1: cols = [(text_x_min, text_x_max)]
# if mode is 1, the page usually contains not tables else:
# but there can be cases where the list can be skewed, ncols = max(set(elements), key=elements.count)
# try to remove all 1s from list in this case and if ncols == 1:
# see if the list contains elements, if yes, then use # if mode is 1, the page usually contains not tables
# the mode after removing 1s # but there can be cases where the list can be skewed,
elements = list(filter(lambda x: x != 1, elements)) # try to remove all 1s from list in this case and
if len(elements): # see if the list contains elements, if yes, then use
ncols = max(set(elements), key=elements.count) # the mode after removing 1s
else: elements = list(filter(lambda x: x != 1, elements))
warnings.warn( if len(elements):
f"No tables found in table area {table_idx + 1}" ncols = max(set(elements), key=elements.count)
else:
warnings.warn(
f"No tables found in table area {table_idx + 1}"
)
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
inner_text = []
for i in range(1, len(cols)):
left = cols[i - 1][1]
right = cols[i][0]
inner_text.extend(
[
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
) )
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r] outer_text = [
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol) t
inner_text = [] for direction in self.t_bbox
for i in range(1, len(cols)): for t in self.t_bbox[direction]
left = cols[i - 1][1] if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
right = cols[i][0] ]
inner_text.extend( inner_text.extend(outer_text)
[ cols = self._add_columns(cols, inner_text, self.row_tol)
t cols = self._join_columns(cols, text_x_min, text_x_max)
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
)
outer_text = [
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
]
inner_text.extend(outer_text)
cols = self._add_columns(cols, inner_text, self.row_tol)
cols = self._join_columns(cols, text_x_min, text_x_max)
return cols, rows return cols, rows

View File

@ -353,7 +353,7 @@ def text_in_bbox(bbox, text):
Returns Returns
------- -------
t_bbox : list t_bbox : list
List of PDFMiner text objects that lie inside table. List of PDFMiner text objects that lie inside table, discarding the overlapping ones
""" """
lb = (bbox[0], bbox[1]) lb = (bbox[0], bbox[1])
@ -364,7 +364,97 @@ def text_in_bbox(bbox, text):
if lb[0] - 2 <= (t.x0 + t.x1) / 2.0 <= rt[0] + 2 if lb[0] - 2 <= (t.x0 + t.x1) / 2.0 <= rt[0] + 2
and lb[1] - 2 <= (t.y0 + t.y1) / 2.0 <= rt[1] + 2 and lb[1] - 2 <= (t.y0 + t.y1) / 2.0 <= rt[1] + 2
] ]
return t_bbox
# Avoid duplicate text by discarding overlapping boxes
rest = {t for t in t_bbox}
for ba in t_bbox:
for bb in rest.copy():
if ba == bb:
continue
if bbox_intersect(ba, bb):
# if the intersection is larger than 80% of ba's size, we keep the longest
if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
if bbox_longer(bb, ba):
rest.discard(ba)
unique_boxes = list(rest)
return unique_boxes
def bbox_intersection_area(ba, bb) -> float:
"""Returns area of the intersection of the bounding boxes of two PDFMiner objects.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
intersection_area : float
Area of the intersection of the bounding boxes of both objects
"""
x_left = max(ba.x0, bb.x0)
y_top = min(ba.y1, bb.y1)
x_right = min(ba.x1, bb.x1)
y_bottom = max(ba.y0, bb.y0)
if x_right < x_left or y_bottom > y_top:
return 0.0
intersection_area = (x_right - x_left) * (y_top - y_bottom)
return intersection_area
def bbox_area(bb) -> float:
"""Returns area of the bounding box of a PDFMiner object.
Parameters
----------
bb : PDFMiner text object
Returns
-------
area : float
Area of the bounding box of the object
"""
return (bb.x1 - bb.x0) * (bb.y1 - bb.y0)
def bbox_intersect(ba, bb) -> bool:
"""Returns True if the bounding boxes of two PDFMiner objects intersect.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
overlaps : bool
True if the bounding boxes intersect
"""
return ba.x1 >= bb.x0 and bb.x1 >= ba.x0 and ba.y1 >= bb.y0 and bb.y1 >= ba.y0
def bbox_longer(ba, bb) -> bool:
"""Returns True if the bounding box of the first PDFMiner object is longer or equal to the second.
Parameters
----------
ba : PDFMiner text object
bb : PDFMiner text object
Returns
-------
longer : bool
True if the bounding box of the first object is longer or equal
"""
return (ba.x1 - ba.x0) >= (bb.x1 - bb.x0)
def merge_close_lines(ar, line_tol=2): def merge_close_lines(ar, line_tol=2):
@ -411,7 +501,7 @@ def text_strip(text, strip=""):
return text return text
stripped = re.sub( stripped = re.sub(
fr"[{''.join(map(re.escape, strip))}]", "", text, re.UNICODE fr"[{''.join(map(re.escape, strip))}]", "", text, flags=re.UNICODE
) )
return stripped return stripped

View File

@ -63,7 +63,7 @@ master_doc = 'index'
# General information about the project. # General information about the project.
project = u'Camelot' project = u'Camelot'
copyright = u'2019, Camelot Developers' copyright = u'2020, Camelot Developers'
author = u'Vinayak Mehta' author = u'Vinayak Mehta'
# The version info for the project you're documenting, acts as replacement for # The version info for the project you're documenting, acts as replacement for

View File

@ -36,15 +36,15 @@ Release v\ |version|. (:ref:`Installation <install>`)
.. image:: https://img.shields.io/badge/continous%20quality-deepsource-lightgrey .. image:: https://img.shields.io/badge/continous%20quality-deepsource-lightgrey
:target: https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge :target: https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge
**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files! **Camelot** is a Python library that can help you extract tables from PDFs!
.. note:: You can also check out `Excalibur`_, which is a web interface for Camelot! .. note:: You can also check out `Excalibur`_, the web interface to Camelot!
.. _Excalibur: https://github.com/camelot-dev/excalibur .. _Excalibur: https://github.com/camelot-dev/excalibur
---- ----
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example `here`_. **Here's how you can extract tables from PDFs.** You can check out the PDF used in this example `here`_.
.. _here: _static/pdf/foo.pdf .. _here: _static/pdf/foo.pdf
@ -70,7 +70,7 @@ Release v\ |version|. (:ref:`Installation <install>`)
.. csv-table:: .. csv-table::
:file: _static/csv/foo.csv :file: _static/csv/foo.csv
There's a :ref:`command-line interface <cli>` too! Camelot also comes packaged with a :ref:`command-line interface <cli>`!
.. note:: Camelot only works with text-based PDFs and not scanned documents. (As Tabula `explains`_, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".) .. note:: Camelot only works with text-based PDFs and not scanned documents. (As Tabula `explains`_, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
@ -79,27 +79,27 @@ There's a :ref:`command-line interface <cli>` too!
Why Camelot? Why Camelot?
------------ ------------
- **You are in control.** Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.) - **Configurability**: Camelot gives you control over the table extraction process with its :ref:`tweakable settings <advanced>`.
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table. - **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into `ETL and data analysis workflows`_. - **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into `ETL and data analysis workflows`_. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.
- **Export** to multiple formats, including JSON, Excel and HTML.
See `comparison with other PDF table extraction libraries and tools`_.
.. _ETL and data analysis workflows: https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 .. _ETL and data analysis workflows: https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873
.. _comparison with other PDF table extraction libraries and tools: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
Support us on OpenCollective See `comparison with similar libraries and tools`_.
----------------------------
If Camelot helped you extract tables from PDFs, please consider supporting its development by `becoming a backer or a sponsor on OpenCollective`_! .. _comparison with similar libraries and tools: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools
.. _becoming a backer or a sponsor on OpenCollective: https://opencollective.com/camelot Support the development
-----------------------
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation `on OpenCollective`_!
.. _on OpenCollective: https://opencollective.com/camelot
The User Guide The User Guide
-------------- --------------
This part of the documentation begins with some background information about why Camelot was created, takes a small dip into the implementation details and then focuses on step-by-step instructions for getting the most out of Camelot. This part of the documentation begins with some background information about why Camelot was created, takes you through some implementation details, and then focuses on step-by-step instructions for getting the most out of Camelot.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
@ -115,8 +115,7 @@ This part of the documentation begins with some background information about why
The API Documentation/Guide The API Documentation/Guide
--------------------------- ---------------------------
If you are looking for information on a specific function, class, or method, If you are looking for information on a specific function, class, or method, this part of the documentation is for you.
this part of the documentation is for you.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
@ -126,8 +125,7 @@ this part of the documentation is for you.
The Contributor Guide The Contributor Guide
--------------------- ---------------------
If you want to contribute to the project, this part of the documentation is for If you want to contribute to the project, this part of the documentation is for you.
you.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2

View File

@ -3,72 +3,59 @@
Installation of dependencies Installation of dependencies
============================ ============================
The dependencies `Tkinter`_ and `ghostscript`_ can be installed using your system's package manager. You can run one of the following, based on your OS. The dependencies `Ghostscript <https://www.ghostscript.com>`_ and `Tkinter <https://wiki.python.org/moin/TkInter>`_ can be installed using your system's package manager or by running their installer.
.. _Tkinter: https://wiki.python.org/moin/TkInter
.. _ghostscript: https://www.ghostscript.com
OS-specific instructions OS-specific instructions
------------------------ ------------------------
For Ubuntu Ubuntu
^^^^^^^^^^ ^^^^^^
:: ::
$ apt install python-tk ghostscript $ apt install ghostscript python3-tk
Or for Python 3:: MacOS
^^^^^
$ apt install python3-tk ghostscript
For macOS
^^^^^^^^^
:: ::
$ brew install tcl-tk ghostscript $ brew install ghostscript tcl-tk
For Windows Windows
^^^^^^^^^^^ ^^^^^^^
For Tkinter, you can download the `ActiveTcl Community Edition`_ from ActiveState. For ghostscript, you can get the installer at the `ghostscript downloads page`_. For Ghostscript, you can get the installer at their `downloads page <https://www.ghostscript.com/download/gsdnld.html>`_. And for Tkinter, you can download the `ActiveTcl Community Edition <https://www.activestate.com/activetcl/downloads>`_ from ActiveState.
.. _ActiveTcl Community Edition: https://www.activestate.com/activetcl/downloads Checks to see if dependencies are installed correctly
.. _ghostscript downloads page: https://www.ghostscript.com/download/gsdnld.html -----------------------------------------------------
.. _as shown here: https://java.com/en/download/help/path.xml
Checks to see if dependencies were installed correctly You can run the following checks to see if the dependencies were installed correctly.
------------------------------------------------------
You can do the following checks to see if the dependencies were installed correctly. For Ghostscript
^^^^^^^^^^^^^^^
Open the Python REPL and run the following:
For Ubuntu/MacOS::
>>> from ctypes.util import find_library
>>> find_library("gs")
"libgs.so.9"
For Windows::
>>> from ctypes.util import find_library
>>> find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll"))
<name-of-ghostscript-library-on-windows>
**Check:** The output of the ``find_library`` function should not be empty.
If the output is empty, then it's possible that the Ghostscript library is not available one of the ``LD_LIBRARY_PATH``/``DYLD_LIBRARY_PATH``/``PATH`` variables depending on your operating system. In this case, you may have to modify one of those path variables.
For Tkinter For Tkinter
^^^^^^^^^^^ ^^^^^^^^^^^
Launch Python, and then at the prompt, type:: Launch Python and then import Tkinter::
>>> import Tkinter
Or in Python 3::
>>> import tkinter >>> import tkinter
If you have Tkinter, Python will not print an error message, and if not, you will see an ``ImportError``. **Check:** Importing ``tkinter`` should not raise an import error.
For ghostscript
^^^^^^^^^^^^^^^
Run the following to check the ghostscript version.
For Ubuntu/macOS::
$ gs -version
For Windows::
C:\> gswin64c.exe -version
Or for Windows 32-bit::
C:\> gswin32c.exe -version
If you have ghostscript, you should see the ghostscript version and copyright information.

View File

@ -5,42 +5,35 @@ Installation of Camelot
This part of the documentation covers the steps to install Camelot. This part of the documentation covers the steps to install Camelot.
Using conda After :ref:`installing the dependencies <install_deps>`, which include `Ghostscript <https://www.ghostscript.com>`_ and `Tkinter <https://wiki.python.org/moin/TkInter>`_, you can use one of the following methods to install Camelot:
-----------
The easiest way to install Camelot is to install it with `conda`_, which is a package manager and environment management system for the `Anaconda`_ distribution. .. warning:: The ``lattice`` flavor will fail to run if Ghostscript is not installed. You may run into errors as shown in `issue #193 <https://github.com/camelot-dev/camelot/issues/193>`_.
::
$ conda install -c conda-forge camelot-py pip
---
.. note:: Camelot is available for Python 2.7, 3.5, 3.6 and 3.7 on Linux, macOS and Windows. For Windows, you will need to install ghostscript which you can get from their `downloads page`_. To install Camelot from PyPI using ``pip``, please include the extra ``cv`` requirement as shown::
.. _conda: https://conda.io/docs/
.. _Anaconda: http://docs.continuum.io/anaconda/
.. _downloads page: https://www.ghostscript.com/download/gsdnld.html
.. _conda-forge: https://conda-forge.org/
Using pip
---------
After :ref:`installing the dependencies <install_deps>`, which include `Tkinter`_ and `ghostscript`_, you can simply use pip to install Camelot::
$ pip install "camelot-py[cv]" $ pip install "camelot-py[cv]"
.. _Tkinter: https://wiki.python.org/moin/TkInter conda
.. _ghostscript: https://www.ghostscript.com -----
`conda`_ is a package manager and environment management system for the `Anaconda <https://anaconda.org>`_ distribution. It can be used to install Camelot from the ``conda-forge`` channel::
$ conda install -c conda-forge camelot-py
From the source code From the source code
-------------------- --------------------
After :ref:`installing the dependencies <install_deps>`, you can install from the source by: After :ref:`installing the dependencies <install_deps>`, you can install Camelot from source by:
1. Cloning the GitHub repository. 1. Cloning the GitHub repository.
:: ::
$ git clone https://www.github.com/camelot-dev/camelot $ git clone https://www.github.com/camelot-dev/camelot
2. Then simply using pip again. 2. And then simply using pip again.
:: ::
$ cd camelot $ cd camelot

View File

@ -2798,3 +2798,51 @@ data_stream_layout_kwargs = [
["A.O.P Cornas", ""], ["A.O.P Cornas", ""],
["Domaine Lionnet « Terre Brûlée » 2012", "15 €"], ["Domaine Lionnet « Terre Brûlée » 2012", "15 €"],
] ]
data_stream_duplicated_text = [
['', '2012 BETTER VARIETIES Harvest Report for Minnesota Central [ MNCE ]', '', '', '', '', '', '', '', '',
'ALL SEASON TEST'],
['', 'Doug Toreen, Renville County, MN 55310 [ BIRD ISLAND ]', '', '', '', '', '', '', '', '',
'1.3 - 2.0 MAT. GROUP'],
['PREV. CROP/HERB:', 'Corn / Surpass, Roundup', '', '', '', '', '', '', '', '', 'S2MNCE01'],
['SOIL DESCRIPTION:', '', 'Canisteo clay loam, mod. well drained, non-irrigated', '', '', '', '', '', '', '', ''],
['SOIL CONDITIONS:', '', 'High P, high K, 6.7 pH, 3.9% OM, Low SCN', '', '', '', '', '', '', '', '30" ROW SPACING'],
['TILLAGE/CULTIVATION:', 'conventional w/ fall till', '', '', '', '', '', '', '', '', ''],
['PEST MANAGEMENT:', 'Roundup twice', '', '', '', '', '', '', '', '', ''],
['SEEDED - RATE:', 'May 15', '140 000 /A', '', '', '', '', '', '', 'TOP 30 for YIELD of 63 TESTED', ''],
['HARVESTED - STAND:', 'Oct 3', '122 921 /A', '', '', '', '', '', '', 'AVERAGE of (3) REPLICATIONS', ''],
['', '', '', '', 'SCN', 'Seed', 'Yield', 'Moisture', 'Lodging', 'Stand', 'Gross'],
['Company/Brand', 'Product/Brand†', 'Technol.†', 'Mat.', 'Resist.', 'Trmt.†', 'Bu/A', '%', '%', '(x 1000)',
'Income'], ['Kruger', 'K2 1901', 'RR2Y', '1.9', 'R', 'Ac,PV', '56.4', '7.6', '0', '126.3', '$846'],
['Stine', '19RA02 §', 'RR2Y', '1.9', 'R', 'CMB', '55.3', '7.6', '0', '120.0', '$830'],
['Wensman', 'W 3190NR2', 'RR2Y', '1.9', 'R', 'Ac', '54.5', '7.6', '0', '119.5', '$818'],
['Hefty', 'H17Y12', 'RR2Y', '1.7', 'MR', 'I', '53.7', '7.7', '0', '124.4', '$806'],
['Dyna-Gro', 'S15RY53', 'RR2Y', '1.5', 'R', 'Ac', '53.6', '7.7', '0', '126.8', '$804'],
['LG Seeds', 'C2050R2', 'RR2Y', '2.1', 'R', 'Ac', '53.6', '7.7', '0', '123.9', '$804'],
['Titan Pro', '19M42', 'RR2Y', '1.9', 'R', 'CMB', '53.6', '7.7', '0', '121.0', '$804'],
['Stine', '19RA02 (2) §', 'RR2Y', '1.9', 'R', 'CMB', '53.4', '7.7', '0', '123.9', '$801'],
['Asgrow', 'AG1832 §', 'RR2Y', '1.8', 'MR', 'Ac,PV', '52.9', '7.7', '0', '122.0', '$794'],
['Prairie Brand', 'PB-1566R2', 'RR2Y', '1.5', 'R', 'CMB', '52.8', '7.7', '0', '122.9', '$792'],
['Channel', '1901R2', 'RR2Y', '1.9', 'R', 'Ac,PV', '52.8', '7.6', '0', '123.4', '$791'],
['Titan Pro', '20M1', 'RR2Y', '2.0', 'R', 'Am', '52.5', '7.5', '0', '124.4', '$788'],
['Kruger', 'K2-2002', 'RR2Y', '2.0', 'R', 'Ac,PV', '52.4', '7.9', '0', '125.4', '$786'],
['Channel', '1700R2', 'RR2Y', '1.7', 'R', 'Ac,PV', '52.3', '7.9', '0', '123.9', '$784'],
['Hefty', 'H16Y11', 'RR2Y', '1.6', 'MR', 'I', '51.4', '7.6', '0', '123.9', '$771'],
['Anderson', '162R2Y', 'RR2Y', '1.6', 'R', 'None', '51.3', '7.5', '0', '119.5', '$770'],
['Titan Pro', '15M22', 'RR2Y', '1.5', 'R', 'CMB', '51.3', '7.8', '0', '125.4', '$769'],
['Dairyland', 'DSR-1710R2Y', 'RR2Y', '1.7', 'R', 'CMB', '51.3', '7.7', '0', '122.0', '$769'],
['Hefty', 'H20R3', 'RR2Y', '2.0', 'MR', 'I', '50.5', '8.2', '0', '121.0', '$757'],
['Prairie Brand', 'PB 1743R2', 'RR2Y', '1.7', 'R', 'CMB', '50.2', '7.7', '0', '125.8', '$752'],
['Gold Country', '1741', 'RR2Y', '1.7', 'R', 'Ac', '50.1', '7.8', '0', '123.9', '$751'],
['Trelay', '20RR43', 'RR2Y', '2.0', 'R', 'Ac,Ex', '49.9', '7.6', '0', '127.8', '$749'],
['Hefty', 'H14R3', 'RR2Y', '1.4', 'MR', 'I', '49.7', '7.7', '0', '122.9', '$746'],
['Prairie Brand', 'PB-2099NRR2', 'RR2Y', '2.0', 'R', 'CMB', '49.6', '7.8', '0', '126.3', '$743'],
['Wensman', 'W 3174NR2', 'RR2Y', '1.7', 'R', 'Ac', '49.3', '7.6', '0', '122.5', '$740'],
['Kruger', 'K2 1602', 'RR2Y', '1.6', 'R', 'Ac,PV', '48.7', '7.6', '0', '125.4', '$731'],
['NK Brand', 'S18-C2 §', 'RR2Y', '1.8', 'R', 'CMB', '48.7', '7.7', '0', '126.8', '$731'],
['Kruger', 'K2 1902', 'RR2Y', '1.9', 'R', 'Ac,PV', '48.7', '7.5', '0', '124.4', '$730'],
['Prairie Brand', 'PB-1823R2', 'RR2Y', '1.8', 'R', 'None', '48.5', '7.6', '0', '121.0', '$727'],
['Gold Country', '1541', 'RR2Y', '1.5', 'R', 'Ac', '48.4', '7.6', '0', '110.4', '$726'],
['', '', '', '', '', 'Test Average =', '47.6', '7.7', '0', '122.9', '$713'],
['', '', '', '', '', 'LSD (0.10) =', '5.7', '0.3', 'ns', '37.8', '566.4']
]

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -160,8 +160,8 @@ def test_cli_output_format():
def test_cli_quiet(): def test_cli_quiet():
with TemporaryDirectory() as tempdir: with TemporaryDirectory() as tempdir:
infile = os.path.join(testdir, "blank.pdf") infile = os.path.join(testdir, "empty.pdf")
outfile = os.path.join(tempdir, "blank.csv") outfile = os.path.join(tempdir, "empty.csv")
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(

View File

@ -314,3 +314,11 @@ def test_version_generation_with_prerelease_revision():
generate_version(version, prerelease=prerelease, revision=revision) generate_version(version, prerelease=prerelease, revision=revision)
== "0.7.3-alpha.2" == "0.7.3-alpha.2"
) )
def test_stream_duplicated_text():
df = pd.DataFrame(data_stream_duplicated_text)
filename = os.path.join(testdir, "birdisland.pdf")
tables = camelot.read_pdf(filename, flavor="stream")
assert_frame_equal(df, tables[0].df)

View File

@ -55,15 +55,33 @@ def test_image_warning():
) )
def test_no_tables_found(): def test_lattice_no_tables_on_page():
filename = os.path.join(testdir, "blank.pdf") filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings(): with warnings.catch_warnings():
warnings.simplefilter("error") warnings.simplefilter("error")
with pytest.raises(UserWarning) as e: with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename) tables = camelot.read_pdf(filename, flavor="lattice")
assert str(e.value) == "No tables found on page-1" assert str(e.value) == "No tables found on page-1"
def test_stream_no_tables_on_page():
filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings():
warnings.simplefilter("error")
with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename, flavor="stream")
assert str(e.value) == "No tables found on page-1"
def test_stream_no_tables_in_area():
filename = os.path.join(testdir, "only_page_number.pdf")
with warnings.catch_warnings():
warnings.simplefilter("error")
with pytest.raises(UserWarning) as e:
tables = camelot.read_pdf(filename, flavor="stream")
assert str(e.value) == "No tables found in table area 1"
def test_no_tables_found_logs_suppressed(): def test_no_tables_found_logs_suppressed():
filename = os.path.join(testdir, "foo.pdf") filename = os.path.join(testdir, "foo.pdf")
with warnings.catch_warnings(): with warnings.catch_warnings():
@ -77,7 +95,7 @@ def test_no_tables_found_logs_suppressed():
def test_no_tables_found_warnings_suppressed(): def test_no_tables_found_warnings_suppressed():
filename = os.path.join(testdir, "blank.pdf") filename = os.path.join(testdir, "empty.pdf")
with warnings.catch_warnings(): with warnings.catch_warnings():
# the test should fail if any warning is thrown # the test should fail if any warning is thrown
warnings.simplefilter("error") warnings.simplefilter("error")