Commit Graph

6 Commits (4a761611bf0164c6e176f3082b8d5b478629fe65)

Author SHA1 Message Date
Frh 4a761611bf WIP: Introduce actual hybrid parser
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.

Still missing: more unit tests, plotting of steps.
2020-06-11 17:20:37 -07:00
Frh edad1efd1b Rename WIP parser "network", actual Hybrid to come 2020-06-11 17:20:37 -07:00
Frh ada4809a59 Improve column detection for hybrid flavor
No longer rely on the mode but on the parsing analysis during network
detection.
Added unit test for complex table with vertical header and mixed
horizontal / vertical text.
2020-06-11 17:20:37 -07:00
Frh 21dc6a46a0 Improve hybrid table body discovery algo
While searching for table body boundaries, exclude rows that include
cells crossing previously discovered rows.
2020-06-11 17:20:37 -07:00
Frh a04e7702b2 Create notebook to help debug hybrid parser algo
Plot vertical col anchors found by hybrid parser
Include vertical text in col/row generation
2020-06-11 17:20:37 -07:00
Frh f7aafcd05c Add parser comparizon notebook 2020-06-11 17:20:36 -07:00