Add line_overlap and boxes_flow to LAParams

pull/219/head
Arnie97 2020-12-17 22:12:24 +08:00
parent 7709e58d64
commit 0dee385578
2 changed files with 9 additions and 3 deletions

View File

@ -838,23 +838,27 @@ def compute_whitespace(d):
def get_page_layout( def get_page_layout(
filename, filename,
line_overlap=0.5,
char_margin=1.0, char_margin=1.0,
line_margin=0.5, line_margin=0.5,
word_margin=0.1, word_margin=0.1,
boxes_flow=0.5,
detect_vertical=True, detect_vertical=True,
all_texts=True, all_texts=True,
): ):
"""Returns a PDFMiner LTPage object and page dimension of a single """Returns a PDFMiner LTPage object and page dimension of a single
page pdf. See https://euske.github.io/pdfminer/ to get definitions page pdf. To get the definitions of kwargs, see
of kwargs. https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
Parameters Parameters
---------- ----------
filename : string filename : string
Path to pdf file. Path to pdf file.
line_overlap : float
char_margin : float char_margin : float
line_margin : float line_margin : float
word_margin : float word_margin : float
boxes_flow : float
detect_vertical : bool detect_vertical : bool
all_texts : bool all_texts : bool
@ -872,9 +876,11 @@ def get_page_layout(
if not document.is_extractable: if not document.is_extractable:
raise PDFTextExtractionNotAllowed(f"Text extraction is not allowed: {filename}") raise PDFTextExtractionNotAllowed(f"Text extraction is not allowed: {filename}")
laparams = LAParams( laparams = LAParams(
line_overlap=line_overlap,
char_margin=char_margin, char_margin=char_margin,
line_margin=line_margin, line_margin=line_margin,
word_margin=word_margin, word_margin=word_margin,
boxes_flow=boxes_flow,
detect_vertical=detect_vertical, detect_vertical=detect_vertical,
all_texts=all_texts, all_texts=all_texts,
) )

View File

@ -618,7 +618,7 @@ Tweak layout generation
Camelot is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences. In some cases (such as `#170 <https://github.com/camelot-dev/camelot/issues/170>`_ and `#215 <https://github.com/camelot-dev/camelot/issues/215>`_), PDFMiner can group characters that should belong to the same sentence into separate sentences. Camelot is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences. In some cases (such as `#170 <https://github.com/camelot-dev/camelot/issues/170>`_ and `#215 <https://github.com/camelot-dev/camelot/issues/215>`_), PDFMiner can group characters that should belong to the same sentence into separate sentences.
To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://euske.github.io/pdfminer/>`_. To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://pdfminersix.rtfd.io/en/latest/reference/composable.html>`_.
:: ::