Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with cold start users click history #11

Open
igor17400 opened this issue Mar 26, 2024 · 6 comments
Open

Dealing with cold start users click history #11

igor17400 opened this issue Mar 26, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@igor17400
Copy link
Contributor

Hello! I'm currently handling a dataset where the histories column might initially be empty, especially for users who are accessing the system for the first time.

Given this context, I'm seeking advice on how to approach a particular situation highlighted in the code found at this GitHub link. The process involves tokenizing the titles of previously clicked news articles, but I'm facing a potential cold start issue for new users without any history. In these instances, should I consider tokenizing empty titles, abstracts, etc.?

@igor17400 igor17400 changed the title Dealing with cold start users Dealing with cold start users click history Mar 26, 2024
@igor17400
Copy link
Contributor Author

igor17400 commented Mar 27, 2024

Here is an update on what I did.

inside the __getitem__ on rec_dataset.py I added the following condition:

if history.size == 1 and history[0] == '':
            history = self._initialize_cold_start()
        else:
            history = self.news.loc[history]

where _initialize_cold_start is defined as the following:

def _initialize_cold_start(self):
        """
        In cold start cases, history can be empty thus we need to 
        add a dataframe with empty values for the embedding.
        """
        # Initialize an empty DataFrame with specified columns
        history = pd.DataFrame(columns=['title', 'abstract', 'sentiment_class', 'sentiment_score'])

        # Append a new row with the specified values
        history = history.append({
            'title': '', 
            'abstract': '', 
            'sentiment_class': 0,
            'sentiment_score': 0.0
        }, ignore_index=True)

        # Explicitly set the data types for the entire DataFrame
        history = history.astype({
            'title': 'object',
            'abstract': 'object',
            'sentiment_class': 'int64',
            'sentiment_score': 'float64'
        })

        return history

This may be useful for other people who are trying to solve the same problem.

@andreeaiana
Copy link
Owner

Hi @igor17400,

Thanks for raising this issue. Indeed, the original code did not work with empty user histories, but implementing this functionality should be useful for many users.

I think your solution is simple and elegant. I can have a look at it over the weekend, to test it with both pretrained word embeddings and PLMs, and streamline it across the data preprocessing functions for all datasets. Would you like to open a PR with your proposed solution?

@andreeaiana andreeaiana added the enhancement New feature or request label Mar 27, 2024
@igor17400
Copy link
Contributor Author

Hi @andreeaiana,

I'm in the process of implementing PP-Rec, as outlined in PR #12. I'm currently working through it, ensuring that the blocks are accurate and checking the scores and behaviors for MIND large and Adressa. Thus, is not ready to be merge. However, just to let you know that in this PR, I'm adding the _initialize_cold_start idea along with the previously mentioned spinner for score calculation to avoid terminal freeze.

@andreeaiana
Copy link
Owner

Great, thanks for letting me know and for your contributions to the library.

@igor17400
Copy link
Contributor Author

@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?

Link to the code

I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.

@andreeaiana
Copy link
Owner

@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?

Link to the code

I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.

I think that's a good idea, we can try it. I know that some models originally do that, but not all of them.

Poseidondon added a commit to Poseidondon/newsreclib-ru that referenced this issue Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants