FeedBackPrize is actually a kaggle competition and below is the approach i followed in the competition

Before we started into approach lets talk a bit about BigBird Transformers

We already knew that Bert Model cont handle longer sequences which are greater than 512 length (here is tokens size ) and there are some other model which handle like long-former etc.,But let's discuss about the Big Bird Transformer

In Bert we know the "each token" attend all its neighbourhood which is its try to re-represent itself in self-attention which is o(n**2)

what if the same is applied to a longer sequence , consider 4000+ tokens it will take forever to compute & it would require a lot of computing power

To solve the above lets Big Bird Transformer do the follows the below trick

Global Attention
Random Attention
Window Attention

Below is the image of Attention Matrix of BigBird Transformer . if you see below matrix of Big Bird Attention score is only calculated for highlighted one.

Because of the above attention Matrix the time complexity is decreased to O(n)-inner products

Time & Memory Complexity

Global Attention Matrix for Big Birds are of two types

ITC
ETC

Different was if see the above Image which says Global Attention where only one token the start token is attending all instead of one token they increased it to 3 tokens thats only the difference.

Above is very cool isn't ?

BigBird comes with 2 implementations: original_full & block_sparse. For the sequence length < 1024, using original_full is advised as there is no benefit in using block_sparse attention.

If Sequence length < 1024 ( Go with original_full)
If Sequence length > 1024 ( Go with block_sparse)

Let's see a code example below :

model = BigBirdModel.from_pretrained("google/bigbird-roberta-base")

This will init the model with default configuration i.e. attention_type = "block_sparse" num_random_blocks = 3, block_size = 64.

model = BigBirdModel.from_pretrained("google/bigbird-roberta-base", num_random_blocks=2, block_size=16) # here block size is the query block size & num_random_block is 3 that means random tokens in each layer

By setting attention_type to `original_full`, BigBird will be relying on the full attention of n^2 complexity. This way BigBird is 99.9 % similar to BERT.

model = BigBirdModel.from_pretrained("google/bigbird-roberta-base", attention_type="original_full")

That's pretty much about Big Bird it can handle only sequence <=4096 token

Approach for FeedBackPrize Dataset :

I gone give the credit to Nicholas was inspired by his kernel

The dataset contains essay_id, disclouser_text , discourse_type, discourse_effectiveness is the class label to predict

With this new approach we created [CLS_discourse_type] & [END_discourse_type] for each disclouser_text and use the [CLS_disclouser] to find the class for each disclouser_text.

We create the output label in such a way where we assign the Label value as "0,1,2," depends on the class type to each cls token only and we back prop only that loss.

Above is the overview of the approach to get detail overview please check out my notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
Images		Images
.gitattributes		.gitattributes
FeedBackPrize.ipynb		FeedBackPrize.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FeedBackPrize is actually a kaggle competition and below is the approach i followed in the competition

This will init the model with default configuration i.e. attention_type = "block_sparse" num_random_blocks = 3, block_size = 64.

By setting attention_type to `original_full`, BigBird will be relying on the full attention of n^2 complexity. This way BigBird is 99.9 % similar to BERT.

About

Releases

Packages

Languages

vmunagal/FeedBackPrize-BigBirdTransformer

Folders and files

Latest commit

History

Repository files navigation

FeedBackPrize is actually a kaggle competition and below is the approach i followed in the competition

This will init the model with default configuration i.e. attention_type = "block_sparse" num_random_blocks = 3, block_size = 64.

By setting attention_type to original_full, BigBird will be relying on the full attention of n^2 complexity. This way BigBird is 99.9 % similar to BERT.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

By setting attention_type to `original_full`, BigBird will be relying on the full attention of n^2 complexity. This way BigBird is 99.9 % similar to BERT.

Packages