Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_table ignores ordering defined while loading the document #153

Open
paulopaixaoamaral opened this issue Jan 13, 2021 · 1 comment

Comments

@paulopaixaoamaral
Copy link
Collaborator

Bug Report

extract_table re-orders the table rows by the y axis (top to bottom), which works for most cases.

The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:
extract_table_bug

In the above case, even if element_ordering is properly set in load to adjust to the page split, extract_table would return:

[["C", "D"], ["E", "F"],["HEADER 1", "HEADER 2"], ["A", "B"]]

Should we make extract_table obey the ordering on which the document elements are defined? Or should we add some sort of rows_sort and columns_sort options to the function?

@jstockwin
Copy link
Owner

jstockwin commented Jan 14, 2021

Hum good point. You're right that this is an issue, but I'm not quite sure what the fix is.

In extract_table we do:

    sorted_rows = sorted(
        rows,
        key=lambda row: (
            row[0].page_number,
            max(-(elem.bounding_box.y1) for elem in row),
        ),
    )
    sorted_cols = sorted(
        cols, key=lambda col: max(elem.bounding_box.x0 for elem in col)
    )

so for the rows we're only interested in the y value, and for the cols we're only interested in the x value.

Ideally, we'd make extract_table use the element_ordering function (or just use the element indexes, which achieves the same thing). However, I'm not quite sure what this looks like. It doesn't feel like it nicely splits into rows and cols somehow.

We might be able to keep the current one, but add an additional sort. Something like:

    sorted_rows = sorted(
        rows,
        key=lambda row: (
            row[0].page_number,
            max(elem.index for elem in row),  # <-----
            max(-(elem.bounding_box.y1) for elem in row),
        ),
    )
    sorted_cols = sorted(
        cols, key=lambda col: (
            max(elem.bounding_box.x0 for elem in col),
            max(elem.index for elem in row)  # <-----
        )
    )

This basically means do what we did before but if all of the elements in a row comes after another row, it'll keep it after. I think this fixes your example.

However, the other thing I've noticed given your example is if the table in the second col is long enough to overlap the start of the table:
image

then the way we detect rows is going to pick up [header A, header B, E, F] as a row, and then we have even bigger problems.

I think it's very hard to make these functions cope with this case. It might be that we have to simple extract each table separately and join them. The pdf-parser isn't really properly aware of the layout of the page. Element ordering does a lot, however, you can't extract tables just from the order, it has to be done from the location on the page. Given it's done from the location on the page, I think strange ordering will never play nice with table extraction...

We can think on this further but should remember to document the outcome (I think the likely outcome may be that this isn't supported, and we should document this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants