Commit Graph

75 Commits (42f8321c8caaf42dcb99bb6ff2c9211f02f65927)

Author SHA1 Message Date
Frh 9c971a18f0 Linting 2020-06-14 12:36:24 -07:00
Frh 92322e1545 Address post-merge linting issues. 2020-06-14 12:21:01 -07:00
Frh b43aca8ff5 Merge branch 'master' into hybrid-parser 2020-06-14 08:53:43 -07:00
Frh 9abdd00cec Enable process_background option for hybrid
Trim empty cols and lines
2020-06-11 17:20:37 -07:00
Frh 4a761611bf WIP: Introduce actual hybrid parser
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.

Still missing: more unit tests, plotting of steps.
2020-06-11 17:20:37 -07:00
Frh 4b3eee4b05 Linting 2020-06-11 17:20:37 -07:00
Frh 55fd459634 Minor linting 2020-06-11 17:20:37 -07:00
Frh ada4809a59 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-06-11 17:20:37 -07:00
Frh e31e978ebe Fix off by one error in column identification 2020-06-11 17:20:37 -07:00
Frh 21dc6a46a0 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-06-11 17:20:37 -07:00
Frh e1572a10c9 Linting 2020-06-11 17:20:36 -07:00
Frh 9eb4f65fc9 Remove f-strings, fix url based unit tests
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-06-11 17:20:36 -07:00
Frh a401d33fd9 Refactor out _text_bbox 2020-06-11 17:20:36 -07:00
Frh 7ad5b843ab Move generic code to utils 2020-06-11 17:20:36 -07:00
Frh 14cd328644 Refactor common code hybrid / stream 2020-06-11 17:20:36 -07:00
Frh db645627ff Prefer showing diffs at the row level 2020-06-11 17:20:36 -07:00
Frh a2a831110e Fix in table diff 2020-06-11 17:20:36 -07:00
Frh 1a47c3df89 Prettier plotting, improve gaps calculation 2020-06-11 17:20:36 -07:00
Frh e0e3ff4e07 Add support for region/area for hybrid 2020-06-11 17:20:36 -07:00
Frh 64576fd836 More refactoring / linting 2020-06-11 17:20:36 -07:00
Frh f37ed50fed More linting, refactor 2020-06-11 17:20:36 -07:00
Frh 20f18b478f Lint, refactor 2020-06-11 17:20:36 -07:00
Frh 37483ca202 Prep work for new hybrid parser introduction
Refactor parsers by moving common code to the base class
Maintain Python 3.5 compatibility by removing f"{}"
2020-06-11 17:20:36 -07:00
Frh 161f71230d Refactor base classes and improve plotting
Move common code to base class to reduce duplication
Stream plots display pdf background for better context
2020-06-11 17:20:36 -07:00
Frh bd2aab5b2d Fix unit tests, lint, drop Python 2 support
Drop EOL Python 2 support. Resolve unit test discrepancies.
Update unit tests to pass in Travis across all supported Py.
Linting.
2020-06-11 17:20:35 -07:00
Vinayak Mehta 52b2a595b4
Add f-strings and remove python3.5 test job 2020-05-24 18:14:43 +05:30
Vinayak Mehta f725f04223
Remove future imports 2020-05-24 17:33:13 +05:30
Vinayak Mehta 3afb72b872
Fix read_pdf(url) and test data 2020-05-24 17:26:52 +05:30
Vinayak Mehta a97b50ef21 Update flavor kwargs 2019-07-06 22:59:51 +05:30
Dimiter Naydenov 240ea6c411 Fixed strip_text argument getting ignored 2019-07-04 12:12:52 +03:00
Vinayak Mehta 2115a0e177 Blacken code 2019-07-03 23:47:42 +05:30
Vinayak Mehta ce727d9558 Fix split text bug 2019-03-22 02:28:29 +05:30
Vinayak Mehta 03f301b25c Add table regions support 2019-01-04 19:17:54 +05:30
Vinayak Mehta 9d90cadac0 Fix variable name 2019-01-03 15:47:05 +05:30
Vinayak Mehta f605bd8f94 Fix #239 2019-01-03 14:55:47 +05:30
Vinayak Mehta 62ed4753cd Make python2 compat 2018-12-24 13:10:48 +05:30
Vinayak Mehta 2b3461deab Add support to read from url 2018-12-24 12:55:52 +05:30
Vinayak Mehta 50b4468aff Rename kwargs and add tests 2018-12-21 15:09:37 +05:30
Vinayak Mehta f6aa21c31f Add strip_text 2018-12-20 16:32:16 +05:30
Vinayak Mehta ca6cefa362 Add extra_kwargs 2018-12-17 11:49:05 +05:30
Vinayak Mehta 5e71f0b0e6 Fix #192 2018-12-13 12:50:30 +05:30
Oshawk 90aaba6eec [MRG + 1] Make pep8 (#125)
* Make setup.py pep8

Add new line at end of file, fix bare except, remove unused import.

* Make tests/*.py pep8

Add some newlines at and of files and a visual indent.

* Make docs/*.py pep8

Fix block comments and add new lines at end of files.

* Make camelot/*.py pep8

Fixed unused import, a few weirdly ordered imports, a docstring typo and  many new lines at the end of lines.

* Fix imports

Fix import order and remove a couple more unused imports.

* Fix indents

Fix indentation (no opening delimiter alignment).

* Add newlines
2018-10-05 16:55:43 +05:30
Vinayak Mehta 6e8079df84
[MRG] Add tests for output formats and parser kwargs (#126)
* Remove unused image processing code

* Add opencv back-compat comment

* Add tests for parser special cases

* Fix lattice table area test

* Add tests for output format

* Add openpyxl dep
2018-10-05 16:15:30 +05:30
Vinayak Mehta c5bde5e2ad
[MRG] Add error/warning tests (#113)
* Add unknown flavor test

* Add input kwargs test

* Remove unused utils

* Add unsupported format test

* Add stream unequal tables-columns length test

* Add python3 compat

* Add no tables found test

* Convert util info log to warning
2018-10-02 19:28:42 +05:30
Vinayak Mehta fc0542bd3c
Add Python 3 compatibility (#109)
* Add python3 compat

* Update .gitignore

* Update .gitignore again

* Remove debugging return

* Add unicode_literals import

* Bump version

* Add python3-tk note
2018-09-28 21:58:29 +05:30
Vinayak Mehta 3170a9689f Add flavors 2018-09-23 10:53:32 +05:30
Vinayak Mehta 17ea5f335e Fix docstrings and interlinks 2018-09-11 08:31:37 +05:30
Vinayak Mehta 7bb1aee9b6 Add CLI 2018-09-10 15:16:41 +05:30
Vinayak Mehta d3beaafc99 Add temporary directory context manager 2018-09-09 18:10:55 +05:30
Vinayak Mehta 9a6ed555c8 Fix get_rotation 2018-09-09 10:04:54 +05:30