We developed WassOs, an unsupervised opinion summarization model based on VAE and the Wasserstein barycenter. In order to caputure the main meaning of different kinds of documents, we disentangle the doc- ument distributions into separate semantic and syntactic spaces. We introduce these distributions into the Wasserstein space and construct the summary distribution using the Wasserstein barycenter. This strategy can reduce the mutual interference of semantic and syntactic information, and identify the representative summary distribution from multiple noisy documents. We developed THVAE, an unsupervised timline summarization model based on hierarchical VAE
Our code is based on the framework of Copycat, please follow this link to build the conda environment.
We experimented on 3 different datasets with different types of content (social media posts, reviews) to allow for a thorough evaluation across different domains. The social media posts are from Twitter and Reddit. The revirews are from Amazon. We experimented on talk-lfe datasets
The expected format of input is provided in artifacts. The expected format of input is
group_id | review_text | category | review_tag |
---|---|---|---|
B000WJ3I1M | I have this cupholder mounted ... | reviews_electronics_5 | PRP VBP DT NN VBD IN DT ... |
We parse each document into the tag sequence with Zpar.
The method of getting key phrases is in file, 'read_timeline'.
We developed two strategies 'T_center' and 'O_center' for this model. For the first strategy 'T_center', It uses the two Wasserstein barycenter from semantic and syntactic spaces to construct the summary distribution, and it performs better on social media posts. The second strategy 'O_center' only uses one Wasserstein barycenter from semantic space and constrcuct the syntactic distribuiton with the method for each document. It performs better on reviews dataset. You can a strategy in model_hp.py based on your data.