You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
extract_table re-orders the table rows by the y axis (top to bottom), which works for most cases.
The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:
In the above case, even if element_ordering is properly set in load to adjust to the page split, extract_table would return:
Should we make extract_table obey the ordering on which the document elements are defined? Or should we add some sort of rows_sort and columns_sort options to the function?
The text was updated successfully, but these errors were encountered:
Hum good point. You're right that this is an issue, but I'm not quite sure what the fix is.
In extract_table we do:
sorted_rows = sorted(
rows,
key=lambda row: (
row[0].page_number,
max(-(elem.bounding_box.y1) for elem in row),
),
)
sorted_cols = sorted(
cols, key=lambda col: max(elem.bounding_box.x0 for elem in col)
)
so for the rows we're only interested in the y value, and for the cols we're only interested in the x value.
Ideally, we'd make extract_table use the element_ordering function (or just use the element indexes, which achieves the same thing). However, I'm not quite sure what this looks like. It doesn't feel like it nicely splits into rows and cols somehow.
We might be able to keep the current one, but add an additional sort. Something like:
sorted_rows = sorted(
rows,
key=lambda row: (
row[0].page_number,
max(elem.index for elem in row), # <-----
max(-(elem.bounding_box.y1) for elem in row),
),
)
sorted_cols = sorted(
cols, key=lambda col: (
max(elem.bounding_box.x0 for elem in col),
max(elem.index for elem in row) # <-----
)
)
This basically means do what we did before but if all of the elements in a row comes after another row, it'll keep it after. I think this fixes your example.
However, the other thing I've noticed given your example is if the table in the second col is long enough to overlap the start of the table:
then the way we detect rows is going to pick up [header A, header B, E, F] as a row, and then we have even bigger problems.
I think it's very hard to make these functions cope with this case. It might be that we have to simple extract each table separately and join them. The pdf-parser isn't really properly aware of the layout of the page. Element ordering does a lot, however, you can't extract tables just from the order, it has to be done from the location on the page. Given it's done from the location on the page, I think strange ordering will never play nice with table extraction...
We can think on this further but should remember to document the outcome (I think the likely outcome may be that this isn't supported, and we should document this).
Bug Report
extract_table
re-orders the table rows by they
axis (top to bottom), which works for most cases.The issue comes if we have a table with a header which is below any of the other elements of the table, when we have a table in a page split by 2 columns for example:
In the above case, even if
element_ordering
is properly set inload
to adjust to the page split,extract_table
would return:Should we make
extract_table
obey the ordering on which the document elements are defined? Or should we add some sort ofrows_sort
andcolumns_sort
options to the function?The text was updated successfully, but these errors were encountered: