Frh
ae429fc248
Hybrid parser fixes
...
Improve parser comparison notebook to flag identical parses, display
multiple tables correctly
Fix tolerance parameter inclusion for hybrid.
2020-05-04 18:52:11 -07:00
Frh
79ea4adcd1
Add baseline test for hybrid
...
Fix first split merge issue
2020-05-04 17:41:57 -07:00
Frh
77d289bd86
WIP: Introduce actual hybrid parser
...
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.
Still missing: more unit tests, plotting of steps.
2020-05-04 16:27:01 -07:00
Frh
6711f877bf
Rename WIP parser "network", actual Hybrid to come
2020-05-02 16:14:03 -07:00
Frh
c7ab3a4c32
Raise tolerance of plot differences
2020-04-30 17:06:45 -07:00
Frh
d663dd18fd
Fix plotting unit tests
...
Enforce order of textline plotting for unit test consistency in 3.6.
Create wrapper around camelot plot that enforces backwards consistency
with older versions of matplotlib.
2020-04-30 16:54:37 -07:00
Frh
f3aded5b17
Linting
2020-04-29 13:52:58 -07:00
Frh
8a63e8e794
Minor linting
2020-04-29 12:31:02 -07:00
Frh
c0903b8ca9
Improve column detection for hybrid flavor
...
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-04-29 11:46:40 -07:00
Frh
04fc542dc3
Fix off by one error in column identification
2020-04-29 09:45:55 -07:00
Frh
918416e7e4
Improve hybrid table body discovery algo
...
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-04-28 22:43:55 -07:00
Frh
3220b02ebc
Create notebook to help debug hybrid parser algo
...
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-04-28 12:26:12 -07:00
Frh
6add19ae27
Prep for vertical text improvements
...
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-04-28 11:46:12 -07:00
Frh
c51c24a416
Linting
2020-04-25 22:47:23 -07:00
Frh
a2c5ee7f06
Add parser comparizon notebook
2020-04-25 21:55:21 -07:00
Frh
30a0b2e4bc
Add Parser comparison notebook to help visualizing
2020-04-25 21:55:01 -07:00
Frh
56dd31090c
Remove another f-string
2020-04-25 21:33:15 -07:00
Frh
2624010197
Remove f-strings, fix url based unit tests
...
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-04-25 21:14:56 -07:00
Frh
016776939e
Plot improvements, address 132
...
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132
Fixes https://github.com/camelot-dev/camelot/issues/132
2020-04-25 20:51:00 -07:00
Frh
84ec5c6acd
Rename member for clarity, fixed unit test
...
_textlines_alignments becomes _textline_to_alignments
2020-04-25 17:15:16 -07:00
Frh
22f4287788
Improve edgeplot for hybrid
2020-04-25 13:31:10 -07:00
Frh
bb842f21b9
Further refactoring
2020-04-24 21:11:31 -07:00
Frh
f42557ab8b
Common parent TextBaseParser for Stream and Hybrid
2020-04-24 15:54:58 -07:00
Frh
5290fb6a7d
Refactor out _text_bbox
2020-04-24 15:18:38 -07:00
Frh
8ad9e569cf
Further simplification
2020-04-24 12:48:51 -07:00
Frh
efe81292ca
Enforce text_edge as subcase of text_alignment
...
TextNetworks is a list of TextAlignments
2020-04-24 12:42:13 -07:00
Frh
58b2c1d0fd
Define TextEdge as a bounded TextAlignment
2020-04-23 18:26:55 -07:00
Frh
3ea8d81900
Update test to reflect different order of edges
2020-04-23 14:45:35 -07:00
Frh
5db49d4fde
More refactoring across stream and hybrid.
...
Stream now much faster, whole test is 72s instead of 92s
2020-04-23 14:42:13 -07:00
Frh
adb14d3522
Refactoring TextEdges code across hybrid and stream
2020-04-23 12:55:09 -07:00
Frh
414708d8c7
Move generic code to utils
2020-04-22 19:08:06 -07:00
Frh
36d5a09ad6
Refactor common code hybrid / stream
2020-04-22 17:33:15 -07:00
Frh
489e996bd8
Address last unit test
2020-04-22 16:02:49 -07:00
Frh
ec0ca1e009
Unit test fixes
2020-04-22 15:36:37 -07:00
Frh
6962c714f9
Unit test fix
2020-04-22 14:50:59 -07:00
Frh
7b0ac03f8e
Prefer showing diffs at the row level
2020-04-22 14:50:45 -07:00
Frh
fab13ee5b8
Unit test fix
2020-04-22 14:25:03 -07:00
Frh
df3d28837d
Loosen cells header expansion algorithm
...
Accept cells if they're at least 50% within the table's bounds.
2020-04-22 14:24:47 -07:00
Frh
0be58de1cb
Fix in table diff
2020-04-22 14:23:52 -07:00
Frh
9a82408a9a
Prettier plotting, improve gaps calculation
2020-04-22 14:08:22 -07:00
Frh
cd338ff4e2
Draw parse constraints for easier debug
...
* Display regions and areas rectangles
2020-04-21 14:24:44 -07:00
Frh
ad27a11d35
Refactor code in plotting
2020-04-21 13:57:12 -07:00
Frh
fb69bd9299
Improve hybrid plotting
...
* plot info passed through debug_info
* display each text edge
2020-04-20 16:54:06 -07:00
Frh
175655d31b
Add support for region/area for hybrid
2020-04-20 11:20:59 -07:00
Frh
57c5957bad
Interim check-in, test failing and lots of todos
2020-04-19 18:26:38 -07:00
Frh
d0bd1cfd1f
More linting
2020-04-19 17:35:19 -07:00
Frh
dec8f2d0eb
Try to silence bandit messages on valid asserts
2020-04-19 17:17:25 -07:00
Frh
69c7728867
More linting
2020-04-19 17:05:33 -07:00
Frh
89fe090ec4
Linting
2020-04-19 16:40:14 -07:00
Frh
e59b3f5efb
Fix unit test
2020-04-19 16:38:25 -07:00