Frh
b43aca8ff5
Merge branch 'master' into hybrid-parser
2020-06-14 08:53:43 -07:00
Frh
4fb1e93efd
Bump dev libraries requirements to avoid conflicts
...
* If Travis uses pytest-cov >= 2.10, it also needs pytest >= 4.6
2020-06-12 18:23:06 -07:00
Frh
4145361907
Merge branch 'hybrid-parser' of https://github.com/FrancoisHuet/camelot into hybrid-parser
2020-06-12 17:32:39 -07:00
Frh
1813b80b8a
Merge fix
2020-06-12 17:12:24 -07:00
Frh
529ea36904
Updated comparison notebook
2020-06-11 17:20:37 -07:00
Frh
9abdd00cec
Enable process_background option for hybrid
...
Trim empty cols and lines
2020-06-11 17:20:37 -07:00
Frh
63adfd5468
Hybrid parser fixes
...
Improve parser comparison notebook to flag identical parses, display
multiple tables correctly
Fix tolerance parameter inclusion for hybrid.
2020-06-11 17:20:37 -07:00
Frh
7fae107560
Add baseline test for hybrid
...
Fix first split merge issue
2020-06-11 17:20:37 -07:00
Frh
4a761611bf
WIP: Introduce actual hybrid parser
...
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.
Still missing: more unit tests, plotting of steps.
2020-06-11 17:20:37 -07:00
Frh
edad1efd1b
Rename WIP parser "network", actual Hybrid to come
2020-06-11 17:20:37 -07:00
Frh
2867aecb5e
Raise tolerance of plot differences
2020-06-11 17:20:37 -07:00
Frh
9e385bf8fc
Fix plotting unit tests
...
Enforce order of textline plotting for unit test consistency in 3.6.
Create wrapper around camelot plot that enforces backwards consistency
with older versions of matplotlib.
2020-06-11 17:20:37 -07:00
Frh
4b3eee4b05
Linting
2020-06-11 17:20:37 -07:00
Frh
55fd459634
Minor linting
2020-06-11 17:20:37 -07:00
Frh
ada4809a59
Improve column detection for hybrid flavor
...
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-06-11 17:20:37 -07:00
Frh
e31e978ebe
Fix off by one error in column identification
2020-06-11 17:20:37 -07:00
Frh
21dc6a46a0
Improve hybrid table body discovery algo
...
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-06-11 17:20:37 -07:00
Frh
a04e7702b2
Create notebook to help debug hybrid parser algo
...
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-06-11 17:20:37 -07:00
Frh
8f5e2bba4d
Prep for vertical text improvements
...
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-06-11 17:20:37 -07:00
Frh
e1572a10c9
Linting
2020-06-11 17:20:36 -07:00
Frh
f7aafcd05c
Add parser comparizon notebook
2020-06-11 17:20:36 -07:00
Frh
90f8d11d47
Add Parser comparison notebook to help visualizing
2020-06-11 17:20:36 -07:00
Frh
15d99b1d00
Remove another f-string
2020-06-11 17:20:36 -07:00
Frh
9eb4f65fc9
Remove f-strings, fix url based unit tests
...
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-06-11 17:20:36 -07:00
Frh
81de841ca0
Plot improvements, address 132
...
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132
Fixes https://github.com/camelot-dev/camelot/issues/132
2020-06-11 17:20:36 -07:00
Frh
dbaab66e43
Rename member for clarity, fixed unit test
...
_textlines_alignments becomes _textline_to_alignments
2020-06-11 17:20:36 -07:00
Frh
a0e46916e2
Improve edgeplot for hybrid
2020-06-11 17:20:36 -07:00
Frh
c9a73a1ad7
Further refactoring
2020-06-11 17:20:36 -07:00
Frh
18581640be
Common parent TextBaseParser for Stream and Hybrid
2020-06-11 17:20:36 -07:00
Frh
a401d33fd9
Refactor out _text_bbox
2020-06-11 17:20:36 -07:00
Frh
87d95a098c
Further simplification
2020-06-11 17:20:36 -07:00
Frh
22b6e33efa
Enforce text_edge as subcase of text_alignment
...
TextNetworks is a list of TextAlignments
2020-06-11 17:20:36 -07:00
Frh
2d97fbc036
Define TextEdge as a bounded TextAlignment
2020-06-11 17:20:36 -07:00
Frh
0b8aac977a
Update test to reflect different order of edges
2020-06-11 17:20:36 -07:00
Frh
8903ef77d4
More refactoring across stream and hybrid.
...
Stream now much faster, whole test is 72s instead of 92s
2020-06-11 17:20:36 -07:00
Frh
92c8abdca3
Refactoring TextEdges code across hybrid and stream
2020-06-11 17:20:36 -07:00
Frh
7ad5b843ab
Move generic code to utils
2020-06-11 17:20:36 -07:00
Frh
14cd328644
Refactor common code hybrid / stream
2020-06-11 17:20:36 -07:00
Frh
bfc2719aff
Address last unit test
2020-06-11 17:20:36 -07:00
Frh
d3d625a08d
Unit test fixes
2020-06-11 17:20:36 -07:00
Frh
13268beb6f
Unit test fix
2020-06-11 17:20:36 -07:00
Frh
db645627ff
Prefer showing diffs at the row level
2020-06-11 17:20:36 -07:00
Frh
549ab0ebe6
Unit test fix
2020-06-11 17:20:36 -07:00
Frh
356af846db
Loosen cells header expansion algorithm
...
Accept cells if they're at least 50% within the table's bounds.
2020-06-11 17:20:36 -07:00
Frh
a2a831110e
Fix in table diff
2020-06-11 17:20:36 -07:00
Frh
1a47c3df89
Prettier plotting, improve gaps calculation
2020-06-11 17:20:36 -07:00
Frh
d2cf8520cb
Draw parse constraints for easier debug
...
* Display regions and areas rectangles
2020-06-11 17:20:36 -07:00
Frh
310a8cd80a
Refactor code in plotting
2020-06-11 17:20:36 -07:00
Frh
1ccaa0630d
Improve hybrid plotting
...
* plot info passed through debug_info
* display each text edge
2020-06-11 17:20:36 -07:00
Frh
e0e3ff4e07
Add support for region/area for hybrid
2020-06-11 17:20:36 -07:00