Commit Graph

28 Commits (ada4809a59fd8765408b67133d825271794f0c1e)

Author SHA1 Message Date
Frh ada4809a59 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-06-11 17:20:37 -07:00
Frh 21dc6a46a0 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-06-11 17:20:37 -07:00
Frh a04e7702b2 Create notebook to help debug hybrid parser algo
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-06-11 17:20:37 -07:00
Frh 8f5e2bba4d Prep for vertical text improvements
plot.text shows vertical text in red
_generate_columns_and_rows split between hybrid and stream
2020-06-11 17:20:37 -07:00
Frh e1572a10c9 Linting 2020-06-11 17:20:36 -07:00
Frh 15d99b1d00 Remove another f-string 2020-06-11 17:20:36 -07:00
Frh 9eb4f65fc9 Remove f-strings, fix url based unit tests
f-strings fail unit tests in Python <3.7, removed them for .format.
Made download_url simulate Mozilla/5.0 to restore unit tests, since
server targetted was 403ing.
2020-06-11 17:20:36 -07:00
Frh 81de841ca0 Plot improvements, address 132
Plot takes an optional axes parameter, allowing notebooks more
flexibility.
Header heuristic in hybrid won't include headers which span the
entire table.
Added unit test for issue #132

Fixes https://github.com/camelot-dev/camelot/issues/132
2020-06-11 17:20:36 -07:00
Frh dbaab66e43 Rename member for clarity, fixed unit test
_textlines_alignments becomes _textline_to_alignments
2020-06-11 17:20:36 -07:00
Frh a0e46916e2 Improve edgeplot for hybrid 2020-06-11 17:20:36 -07:00
Frh c9a73a1ad7 Further refactoring 2020-06-11 17:20:36 -07:00
Frh 18581640be Common parent TextBaseParser for Stream and Hybrid 2020-06-11 17:20:36 -07:00
Frh a401d33fd9 Refactor out _text_bbox 2020-06-11 17:20:36 -07:00
Frh 87d95a098c Further simplification 2020-06-11 17:20:36 -07:00
Frh 22b6e33efa Enforce text_edge as subcase of text_alignment
TextNetworks is a list of TextAlignments
2020-06-11 17:20:36 -07:00
Frh 2d97fbc036 Define TextEdge as a bounded TextAlignment 2020-06-11 17:20:36 -07:00
Frh 8903ef77d4 More refactoring across stream and hybrid.
Stream now much faster, whole test is 72s instead of 92s
2020-06-11 17:20:36 -07:00
Frh 92c8abdca3 Refactoring TextEdges code across hybrid and stream 2020-06-11 17:20:36 -07:00
Frh 7ad5b843ab Move generic code to utils 2020-06-11 17:20:36 -07:00
Frh 14cd328644 Refactor common code hybrid / stream 2020-06-11 17:20:36 -07:00
Frh bfc2719aff Address last unit test 2020-06-11 17:20:36 -07:00
Frh 356af846db Loosen cells header expansion algorithm
Accept cells if they're at least 50% within the table's bounds.
2020-06-11 17:20:36 -07:00
Frh 1a47c3df89 Prettier plotting, improve gaps calculation 2020-06-11 17:20:36 -07:00
Frh 1ccaa0630d Improve hybrid plotting
* plot info passed through debug_info
* display each text edge
2020-06-11 17:20:36 -07:00
Frh e0e3ff4e07 Add support for region/area for hybrid 2020-06-11 17:20:36 -07:00
Frh f5fe92c22e Interim check-in, test failing and lots of todos 2020-06-11 17:20:36 -07:00
Frh 07e2e1640d Linting 2020-06-11 17:20:36 -07:00
Frh f9a6543c36 Initial Hybrid parser, for now identical to Stream 2020-06-11 17:20:36 -07:00