extract_table ignores ordering defined while loading the document #153

paulopaixaoamaral · 2021-01-13T18:51:27Z

Bug Report

extract_table re-orders the table rows by the y axis (top to bottom), which works for most cases.

The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:

In the above case, even if element_ordering is properly set in load to adjust to the page split, extract_table would return:

[["C", "D"], ["E", "F"],["HEADER 1", "HEADER 2"], ["A", "B"]]

Should we make extract_table obey the ordering on which the document elements are defined? Or should we add some sort of rows_sort and columns_sort options to the function?

The text was updated successfully, but these errors were encountered:

jstockwin · 2021-01-14T09:40:18Z

Hum good point. You're right that this is an issue, but I'm not quite sure what the fix is.

In extract_table we do:

    sorted_rows = sorted(
        rows,
        key=lambda row: (
            row[0].page_number,
            max(-(elem.bounding_box.y1) for elem in row),
        ),
    )
    sorted_cols = sorted(
        cols, key=lambda col: max(elem.bounding_box.x0 for elem in col)
    )

so for the rows we're only interested in the y value, and for the cols we're only interested in the x value.

Ideally, we'd make extract_table use the element_ordering function (or just use the element indexes, which achieves the same thing). However, I'm not quite sure what this looks like. It doesn't feel like it nicely splits into rows and cols somehow.

We might be able to keep the current one, but add an additional sort. Something like:

    sorted_rows = sorted(
        rows,
        key=lambda row: (
            row[0].page_number,
            max(elem.index for elem in row),  # <-----
            max(-(elem.bounding_box.y1) for elem in row),
        ),
    )
    sorted_cols = sorted(
        cols, key=lambda col: (
            max(elem.bounding_box.x0 for elem in col),
            max(elem.index for elem in row)  # <-----
        )
    )

This basically means do what we did before but if all of the elements in a row comes after another row, it'll keep it after. I think this fixes your example.

However, the other thing I've noticed given your example is if the table in the second col is long enough to overlap the start of the table:

then the way we detect rows is going to pick up [header A, header B, E, F] as a row, and then we have even bigger problems.

I think it's very hard to make these functions cope with this case. It might be that we have to simple extract each table separately and join them. The pdf-parser isn't really properly aware of the layout of the page. Element ordering does a lot, however, you can't extract tables just from the order, it has to be done from the location on the page. Given it's done from the location on the page, I think strange ordering will never play nice with table extraction...

We can think on this further but should remember to document the outcome (I think the likely outcome may be that this isn't supported, and we should document this).

paulopaixaoamaral added the bug label Jan 13, 2021

jstockwin added the component: tables label Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_table ignores ordering defined while loading the document #153

extract_table ignores ordering defined while loading the document #153

paulopaixaoamaral commented Jan 13, 2021

jstockwin commented Jan 14, 2021 •

edited

Loading

extract_table ignores ordering defined while loading the document #153

extract_table ignores ordering defined while loading the document #153

Comments

paulopaixaoamaral commented Jan 13, 2021

jstockwin commented Jan 14, 2021 • edited Loading

jstockwin commented Jan 14, 2021 •

edited

Loading