Commit Graph

614 Commits (9c971a18f01892c2124c4e7b0e5c4c8fefa8177c)

Author SHA1 Message Date
Frh 1ccaa0630d Improve hybrid plotting
* plot info passed through debug_info
* display each text edge
2020-06-11 17:20:36 -07:00
Frh e0e3ff4e07 Add support for region/area for hybrid 2020-06-11 17:20:36 -07:00
Frh f5fe92c22e Interim check-in, test failing and lots of todos 2020-06-11 17:20:36 -07:00
Frh c1c9358778 More linting 2020-06-11 17:20:36 -07:00
Frh 931b2f20f6 Try to silence bandit messages on valid asserts 2020-06-11 17:20:36 -07:00
Frh 878ef96fa7 More linting 2020-06-11 17:20:36 -07:00
Frh 07e2e1640d Linting 2020-06-11 17:20:36 -07:00
Frh e8e80a8cbb Fix unit test 2020-06-11 17:20:36 -07:00
Frh f9a6543c36 Initial Hybrid parser, for now identical to Stream 2020-06-11 17:20:36 -07:00
Frh 64576fd836 More refactoring / linting 2020-06-11 17:20:36 -07:00
Frh 8ed4cdf399 Fix unit test with plotting 2020-06-11 17:20:36 -07:00
Frh f37ed50fed More linting, refactor 2020-06-11 17:20:36 -07:00
Frh 20f18b478f Lint, refactor 2020-06-11 17:20:36 -07:00
Frh ff2ce6f47c Further refactor
Move common parse error stats computation to base parser
Move copy_spanning_text logic to the table
2020-06-11 17:20:36 -07:00
Frh 37483ca202 Prep work for new hybrid parser introduction
Refactor parsers by moving common code to the base class
Maintain Python 3.5 compatibility by removing f"{}"
2020-06-11 17:20:36 -07:00
Frh 161f71230d Refactor base classes and improve plotting
Move common code to base class to reduce duplication
Stream plots display pdf background for better context
2020-06-11 17:20:36 -07:00
Frh bd2aab5b2d Fix unit tests, lint, drop Python 2 support
Drop EOL Python 2 support. Resolve unit test discrepancies.
Update unit tests to pass in Travis across all supported Py.
Linting.
2020-06-11 17:20:35 -07:00
Vinayak Mehta 5efbcdcebb
Update requirements.txt 2020-05-24 19:04:50 +05:30
Vinayak Mehta 189fe58bf2
Update requirements.txt 2020-05-24 19:01:03 +05:30
Vinayak Mehta 1575ec1bf0
Add .readthedocs.yml 2020-05-24 18:56:33 +05:30
Vinayak Mehta d5d6a5962b
Bump version and update HISTORY.md 2020-05-24 18:36:13 +05:30
Vinayak Mehta 420d5aa624
Merge pull request #146 from camelot-dev/add-python38-travis
[MRG] Fix test data and drop python2 support
2020-05-24 18:31:27 +05:30
Vinayak Mehta a22fa63c4e
Fix syntax errors 2020-05-24 18:19:48 +05:30
Vinayak Mehta 52b2a595b4
Add f-strings and remove python3.5 test job 2020-05-24 18:14:43 +05:30
Vinayak Mehta afa1ba7c1f
Fix test indent 2020-05-24 17:38:48 +05:30
Vinayak Mehta f725f04223
Remove future imports 2020-05-24 17:33:13 +05:30
Vinayak Mehta 3afb72b872
Fix read_pdf(url) and test data 2020-05-24 17:26:52 +05:30
Vinayak Mehta 6dd9b6ce01
Create FUNDING.yml 2020-05-24 16:14:43 +05:30
Vinayak Mehta fc1b6f6227
Add python38 test job for travis 2020-05-24 15:27:48 +05:30
Frh ba5169b33d Enable process_background option for hybrid
Trim empty cols and lines
2020-05-08 15:08:12 -07:00
Frh ae429fc248 Hybrid parser fixes
Improve parser comparison notebook to flag identical parses, display
multiple tables correctly
Fix tolerance parameter inclusion for hybrid.
2020-05-04 18:52:11 -07:00
Frh 79ea4adcd1 Add baseline test for hybrid
Fix first split merge issue
2020-05-04 17:41:57 -07:00
Frh 77d289bd86 WIP: Introduce actual hybrid parser
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.

Still missing: more unit tests, plotting of steps.
2020-05-04 16:27:01 -07:00
Frh 6711f877bf Rename WIP parser "network", actual Hybrid to come 2020-05-02 16:14:03 -07:00
Frh c7ab3a4c32 Raise tolerance of plot differences 2020-04-30 17:06:45 -07:00
Frh d663dd18fd Fix plotting unit tests
Enforce order of textline plotting for unit test consistency in 3.6.
Create wrapper around camelot plot that enforces backwards consistency
with older versions of matplotlib.
2020-04-30 16:54:37 -07:00
Frh f3aded5b17 Linting 2020-04-29 13:52:58 -07:00
Frh 8a63e8e794 Minor linting 2020-04-29 12:31:02 -07:00
Frh c0903b8ca9 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-04-29 11:46:40 -07:00
Frh 04fc542dc3 Fix off by one error in column identification 2020-04-29 09:45:55 -07:00
Frh 918416e7e4 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-04-28 22:43:55 -07:00
Frh 3220b02ebc Create notebook to help debug hybrid parser algo
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-04-28 12:26:12 -07:00
Frh 6add19ae27 Prep for vertical text improvements
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-04-28 11:46:12 -07:00
Frh c51c24a416 Linting 2020-04-25 22:47:23 -07:00
Frh a2c5ee7f06 Add parser comparizon notebook 2020-04-25 21:55:21 -07:00
Frh 30a0b2e4bc Add Parser comparison notebook to help visualizing 2020-04-25 21:55:01 -07:00
Frh 56dd31090c Remove another f-string 2020-04-25 21:33:15 -07:00
Frh 2624010197 Remove f-strings, fix url based unit tests
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-04-25 21:14:56 -07:00
Frh 016776939e Plot improvements, address 132
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132

Fixes https://github.com/camelot-dev/camelot/issues/132
2020-04-25 20:51:00 -07:00
Frh 84ec5c6acd Rename member for clarity, fixed unit test
_textlines_alignments becomes _textline_to_alignments
2020-04-25 17:15:16 -07:00