Merge pull request #219 from Arnie97/master
[MRG] Add line_overlap and boxes_flow to LAParamspull/222/head^2
commit
ec21904595
|
|
@ -838,23 +838,27 @@ def compute_whitespace(d):
|
||||||
|
|
||||||
def get_page_layout(
|
def get_page_layout(
|
||||||
filename,
|
filename,
|
||||||
|
line_overlap=0.5,
|
||||||
char_margin=1.0,
|
char_margin=1.0,
|
||||||
line_margin=0.5,
|
line_margin=0.5,
|
||||||
word_margin=0.1,
|
word_margin=0.1,
|
||||||
|
boxes_flow=0.5,
|
||||||
detect_vertical=True,
|
detect_vertical=True,
|
||||||
all_texts=True,
|
all_texts=True,
|
||||||
):
|
):
|
||||||
"""Returns a PDFMiner LTPage object and page dimension of a single
|
"""Returns a PDFMiner LTPage object and page dimension of a single
|
||||||
page pdf. See https://euske.github.io/pdfminer/ to get definitions
|
page pdf. To get the definitions of kwargs, see
|
||||||
of kwargs.
|
https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
filename : string
|
filename : string
|
||||||
Path to pdf file.
|
Path to pdf file.
|
||||||
|
line_overlap : float
|
||||||
char_margin : float
|
char_margin : float
|
||||||
line_margin : float
|
line_margin : float
|
||||||
word_margin : float
|
word_margin : float
|
||||||
|
boxes_flow : float
|
||||||
detect_vertical : bool
|
detect_vertical : bool
|
||||||
all_texts : bool
|
all_texts : bool
|
||||||
|
|
||||||
|
|
@ -874,9 +878,11 @@ def get_page_layout(
|
||||||
f"Text extraction is not allowed: {filename}"
|
f"Text extraction is not allowed: {filename}"
|
||||||
)
|
)
|
||||||
laparams = LAParams(
|
laparams = LAParams(
|
||||||
|
line_overlap=line_overlap,
|
||||||
char_margin=char_margin,
|
char_margin=char_margin,
|
||||||
line_margin=line_margin,
|
line_margin=line_margin,
|
||||||
word_margin=word_margin,
|
word_margin=word_margin,
|
||||||
|
boxes_flow=boxes_flow,
|
||||||
detect_vertical=detect_vertical,
|
detect_vertical=detect_vertical,
|
||||||
all_texts=all_texts,
|
all_texts=all_texts,
|
||||||
)
|
)
|
||||||
|
|
|
||||||
|
|
@ -618,7 +618,7 @@ Tweak layout generation
|
||||||
|
|
||||||
Camelot is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences. In some cases (such as `#170 <https://github.com/camelot-dev/camelot/issues/170>`_ and `#215 <https://github.com/camelot-dev/camelot/issues/215>`_), PDFMiner can group characters that should belong to the same sentence into separate sentences.
|
Camelot is built on top of PDFMiner's functionality of grouping characters on a page into words and sentences. In some cases (such as `#170 <https://github.com/camelot-dev/camelot/issues/170>`_ and `#215 <https://github.com/camelot-dev/camelot/issues/215>`_), PDFMiner can group characters that should belong to the same sentence into separate sentences.
|
||||||
|
|
||||||
To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://euske.github.io/pdfminer/>`_.
|
To deal with such cases, you can tweak PDFMiner's `LAParams kwargs <https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L33>`_ to improve layout generation, by passing the keyword arguments as a dict using ``layout_kwargs`` in :meth:`read_pdf() <camelot.read_pdf>`. To know more about the parameters you can tweak, you can check out `PDFMiner docs <https://pdfminersix.rtfd.io/en/latest/reference/composable.html>`_.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue