EATEN: Entity-aware Attention for Single Shot Visual Text Extraction

He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding

@article{guo2019eaten,
   title={EATEN: Entity-Aware Attention for Single Shot Visual Text Extraction},
   ISBN={9781728130149},
   url={http://dx.doi.org/10.1109/ICDAR.2019.00049},
   DOI={10.1109/icdar.2019.00049},
   journal={2019 International Conference on Document Analysis and Recognition (ICDAR)},
   publisher={IEEE},
   author={Guo, He and Qin, Xiameng and Liu, Jiaming and Han, Junyu and Liu, Jingtuo and Ding, Errui},
   year={2019},
   month={Sep}
}

Pipeline

Receipt detection	Receipt localization	Receipt normalization	Text line segmentation	Optical character recognition	Semantic analysis
❌	❌	❌	❌	❌	✔️

Semantic analysis

Fields extracted:
- train ticket:
  - Ticket number,
  - Starting station,
  - Train number,
  - Destination station,
  - Date,
  - Ticket rates,
  - Seat category,
  - Name
- passport:
  - passport number,
  - name,
  - gender,
  - birth date,
  - birth place,
  - issue place,
  - expiry date
- business card:
  - telephone
  - postcode,
  - mobile,
  - url,
  - email,
  - fax,
  - address,
  - name,
  - title,
  - company
we design an entity-aware attention network with multiple decoders and state transition between contiguous decoders so that the EoIs can be quickly located and extracted without any complicated post-process.
The CNN-based backbone aims to extract high-level visual features from images, and the entity-aware attention network learns entities layout of images automatically and decodes the content of predefined EoIs by entity-aware decoders
Inception v3 as the backbone
To build the semantic relations between the neighboring EoIs, we employ the last state of previous decoder to initialize the current decoder. We also use initial state warm-up to boost the performance of attention mechanism.
In each decoding step, the entity-aware decoder firstly uses entity-aware attention mechanism to obtain the corresponding context feature. The context feature, combined with previously predicted character, is further fed into an LSTM unit as input. And then the LSTM will update context feature and predict current character.

Notes

Limitation - fixed number of "decoders", so can't work with varying number of items on receipt
end-to-end trainable framework instead of multi-stage procedures
EATEN has no explicit text recognition and uses the entity-aware decoder to decode the corresponding EoIs directly. No lexicon is used in this work.
EATEN is efficiently trained without bounding box and full text annotations, and directly predicts target entities of an input image in one shot without any bells and whistles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guo2019eaten.md

guo2019eaten.md

EATEN: Entity-aware Attention for Single Shot Visual Text Extraction

He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding

Pipeline

Semantic analysis

Notes

Files

guo2019eaten.md

Latest commit

History

guo2019eaten.md

File metadata and controls

EATEN: Entity-aware Attention for Single Shot Visual Text Extraction

He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding

Pipeline

Semantic analysis

Notes