Commit Graph

598 Commits (4145361907c8cf8e95c3bb8c60e240df712063df)

Author SHA1 Message Date
Frh 931b2f20f6 Try to silence bandit messages on valid asserts 2020-06-11 17:20:36 -07:00
Frh 878ef96fa7 More linting 2020-06-11 17:20:36 -07:00
Frh 07e2e1640d Linting 2020-06-11 17:20:36 -07:00
Frh e8e80a8cbb Fix unit test 2020-06-11 17:20:36 -07:00
Frh f9a6543c36 Initial Hybrid parser, for now identical to Stream 2020-06-11 17:20:36 -07:00
Frh 64576fd836 More refactoring / linting 2020-06-11 17:20:36 -07:00
Frh 8ed4cdf399 Fix unit test with plotting 2020-06-11 17:20:36 -07:00
Frh f37ed50fed More linting, refactor 2020-06-11 17:20:36 -07:00
Frh 20f18b478f Lint, refactor 2020-06-11 17:20:36 -07:00
Frh ff2ce6f47c Further refactor
Move common parse error stats computation to base parser
Move copy_spanning_text logic to the table
2020-06-11 17:20:36 -07:00
Frh 37483ca202 Prep work for new hybrid parser introduction
Refactor parsers by moving common code to the base class
Maintain Python 3.5 compatibility by removing f"{}"
2020-06-11 17:20:36 -07:00
Frh 161f71230d Refactor base classes and improve plotting
Move common code to base class to reduce duplication
Stream plots display pdf background for better context
2020-06-11 17:20:36 -07:00
Frh bd2aab5b2d Fix unit tests, lint, drop Python 2 support
Drop EOL Python 2 support. Resolve unit test discrepancies.
Update unit tests to pass in Travis across all supported Py.
Linting.
2020-06-11 17:20:35 -07:00
Frh ba5169b33d Enable process_background option for hybrid
Trim empty cols and lines
2020-05-08 15:08:12 -07:00
Frh ae429fc248 Hybrid parser fixes
Improve parser comparison notebook to flag identical parses, display
multiple tables correctly
Fix tolerance parameter inclusion for hybrid.
2020-05-04 18:52:11 -07:00
Frh 79ea4adcd1 Add baseline test for hybrid
Fix first split merge issue
2020-05-04 17:41:57 -07:00
Frh 77d289bd86 WIP: Introduce actual hybrid parser
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.

Still missing: more unit tests, plotting of steps.
2020-05-04 16:27:01 -07:00
Frh 6711f877bf Rename WIP parser "network", actual Hybrid to come 2020-05-02 16:14:03 -07:00
Frh c7ab3a4c32 Raise tolerance of plot differences 2020-04-30 17:06:45 -07:00
Frh d663dd18fd Fix plotting unit tests
Enforce order of textline plotting for unit test consistency in 3.6.
Create wrapper around camelot plot that enforces backwards consistency
with older versions of matplotlib.
2020-04-30 16:54:37 -07:00
Frh f3aded5b17 Linting 2020-04-29 13:52:58 -07:00
Frh 8a63e8e794 Minor linting 2020-04-29 12:31:02 -07:00
Frh c0903b8ca9 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-04-29 11:46:40 -07:00
Frh 04fc542dc3 Fix off by one error in column identification 2020-04-29 09:45:55 -07:00
Frh 918416e7e4 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-04-28 22:43:55 -07:00
Frh 3220b02ebc Create notebook to help debug hybrid parser algo
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-04-28 12:26:12 -07:00
Frh 6add19ae27 Prep for vertical text improvements
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-04-28 11:46:12 -07:00
Frh c51c24a416 Linting 2020-04-25 22:47:23 -07:00
Frh a2c5ee7f06 Add parser comparizon notebook 2020-04-25 21:55:21 -07:00
Frh 30a0b2e4bc Add Parser comparison notebook to help visualizing 2020-04-25 21:55:01 -07:00
Frh 56dd31090c Remove another f-string 2020-04-25 21:33:15 -07:00
Frh 2624010197 Remove f-strings, fix url based unit tests
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-04-25 21:14:56 -07:00
Frh 016776939e Plot improvements, address 132
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132

Fixes https://github.com/camelot-dev/camelot/issues/132
2020-04-25 20:51:00 -07:00
Frh 84ec5c6acd Rename member for clarity, fixed unit test
_textlines_alignments becomes _textline_to_alignments
2020-04-25 17:15:16 -07:00
Frh 22f4287788 Improve edgeplot for hybrid 2020-04-25 13:31:10 -07:00
Frh bb842f21b9 Further refactoring 2020-04-24 21:11:31 -07:00
Frh f42557ab8b Common parent TextBaseParser for Stream and Hybrid 2020-04-24 15:54:58 -07:00
Frh 5290fb6a7d Refactor out _text_bbox 2020-04-24 15:18:38 -07:00
Frh 8ad9e569cf Further simplification 2020-04-24 12:48:51 -07:00
Frh efe81292ca Enforce text_edge as subcase of text_alignment
TextNetworks is a list of TextAlignments
2020-04-24 12:42:13 -07:00
Frh 58b2c1d0fd Define TextEdge as a bounded TextAlignment 2020-04-23 18:26:55 -07:00
Frh 3ea8d81900 Update test to reflect different order of edges 2020-04-23 14:45:35 -07:00
Frh 5db49d4fde More refactoring across stream and hybrid.
Stream now much faster, whole test is 72s instead of 92s
2020-04-23 14:42:13 -07:00
Frh adb14d3522 Refactoring TextEdges code across hybrid and stream 2020-04-23 12:55:09 -07:00
Frh 414708d8c7 Move generic code to utils 2020-04-22 19:08:06 -07:00
Frh 36d5a09ad6 Refactor common code hybrid / stream 2020-04-22 17:33:15 -07:00
Frh 489e996bd8 Address last unit test 2020-04-22 16:02:49 -07:00
Frh ec0ca1e009 Unit test fixes 2020-04-22 15:36:37 -07:00
Frh 6962c714f9 Unit test fix 2020-04-22 14:50:59 -07:00
Frh 7b0ac03f8e Prefer showing diffs at the row level 2020-04-22 14:50:45 -07:00