Commit Graph

508 Commits (6711f877bf284148edb5e1ff0ca8306a4677531e)

Author SHA1 Message Date
Frh 6711f877bf Rename WIP parser "network", actual Hybrid to come 2020-05-02 16:14:03 -07:00
Frh c7ab3a4c32 Raise tolerance of plot differences 2020-04-30 17:06:45 -07:00
Frh d663dd18fd Fix plotting unit tests
Enforce order of textline plotting for unit test consistency in 3.6.
Create wrapper around camelot plot that enforces backwards consistency
with older versions of matplotlib.
2020-04-30 16:54:37 -07:00
Frh f3aded5b17 Linting 2020-04-29 13:52:58 -07:00
Frh 8a63e8e794 Minor linting 2020-04-29 12:31:02 -07:00
Frh c0903b8ca9 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-04-29 11:46:40 -07:00
Frh 04fc542dc3 Fix off by one error in column identification 2020-04-29 09:45:55 -07:00
Frh 918416e7e4 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-04-28 22:43:55 -07:00
Frh 3220b02ebc Create notebook to help debug hybrid parser algo
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-04-28 12:26:12 -07:00
Frh 6add19ae27 Prep for vertical text improvements
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-04-28 11:46:12 -07:00
Frh c51c24a416 Linting 2020-04-25 22:47:23 -07:00
Frh a2c5ee7f06 Add parser comparizon notebook 2020-04-25 21:55:21 -07:00
Frh 30a0b2e4bc Add Parser comparison notebook to help visualizing 2020-04-25 21:55:01 -07:00
Frh 56dd31090c Remove another f-string 2020-04-25 21:33:15 -07:00
Frh 2624010197 Remove f-strings, fix url based unit tests
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-04-25 21:14:56 -07:00
Frh 016776939e Plot improvements, address 132
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132

Fixes https://github.com/camelot-dev/camelot/issues/132
2020-04-25 20:51:00 -07:00
Frh 84ec5c6acd Rename member for clarity, fixed unit test
_textlines_alignments becomes _textline_to_alignments
2020-04-25 17:15:16 -07:00
Frh 22f4287788 Improve edgeplot for hybrid 2020-04-25 13:31:10 -07:00
Frh bb842f21b9 Further refactoring 2020-04-24 21:11:31 -07:00
Frh f42557ab8b Common parent TextBaseParser for Stream and Hybrid 2020-04-24 15:54:58 -07:00
Frh 5290fb6a7d Refactor out _text_bbox 2020-04-24 15:18:38 -07:00
Frh 8ad9e569cf Further simplification 2020-04-24 12:48:51 -07:00
Frh efe81292ca Enforce text_edge as subcase of text_alignment
TextNetworks is a list of TextAlignments
2020-04-24 12:42:13 -07:00
Frh 58b2c1d0fd Define TextEdge as a bounded TextAlignment 2020-04-23 18:26:55 -07:00
Frh 3ea8d81900 Update test to reflect different order of edges 2020-04-23 14:45:35 -07:00
Frh 5db49d4fde More refactoring across stream and hybrid.
Stream now much faster, whole test is 72s instead of 92s
2020-04-23 14:42:13 -07:00
Frh adb14d3522 Refactoring TextEdges code across hybrid and stream 2020-04-23 12:55:09 -07:00
Frh 414708d8c7 Move generic code to utils 2020-04-22 19:08:06 -07:00
Frh 36d5a09ad6 Refactor common code hybrid / stream 2020-04-22 17:33:15 -07:00
Frh 489e996bd8 Address last unit test 2020-04-22 16:02:49 -07:00
Frh ec0ca1e009 Unit test fixes 2020-04-22 15:36:37 -07:00
Frh 6962c714f9 Unit test fix 2020-04-22 14:50:59 -07:00
Frh 7b0ac03f8e Prefer showing diffs at the row level 2020-04-22 14:50:45 -07:00
Frh fab13ee5b8 Unit test fix 2020-04-22 14:25:03 -07:00
Frh df3d28837d Loosen cells header expansion algorithm
Accept cells if they're at least 50% within the table's bounds.
2020-04-22 14:24:47 -07:00
Frh 0be58de1cb Fix in table diff 2020-04-22 14:23:52 -07:00
Frh 9a82408a9a Prettier plotting, improve gaps calculation 2020-04-22 14:08:22 -07:00
Frh cd338ff4e2 Draw parse constraints for easier debug
* Display regions and areas rectangles
2020-04-21 14:24:44 -07:00
Frh ad27a11d35 Refactor code in plotting 2020-04-21 13:57:12 -07:00
Frh fb69bd9299 Improve hybrid plotting
* plot info passed through debug_info
* display each text edge
2020-04-20 16:54:06 -07:00
Frh 175655d31b Add support for region/area for hybrid 2020-04-20 11:20:59 -07:00
Frh 57c5957bad Interim check-in, test failing and lots of todos 2020-04-19 18:26:38 -07:00
Frh d0bd1cfd1f More linting 2020-04-19 17:35:19 -07:00
Frh dec8f2d0eb Try to silence bandit messages on valid asserts 2020-04-19 17:17:25 -07:00
Frh 69c7728867 More linting 2020-04-19 17:05:33 -07:00
Frh 89fe090ec4 Linting 2020-04-19 16:40:14 -07:00
Frh e59b3f5efb Fix unit test 2020-04-19 16:38:25 -07:00
Frh d520a77bb7 Initial Hybrid parser, for now identical to Stream 2020-04-19 16:27:01 -07:00
Frh 58823e57e9 More refactoring / linting 2020-04-19 15:41:45 -07:00
Frh d673a3b6e0 Fix unit test with plotting 2020-04-19 15:07:59 -07:00