How does Collie deal with mixed datatypes in df, such as cont, cat, text, datetime? #40

wjlgatech · 2021-12-03T22:14:26Z

wjlgatech
Dec 3, 2021

Hi Collie team,

Great work on making this amazing tool. I like how clean and simple the API is. I have couple questions concerning how to use Collie to prep input data with various data types, such as continuous, categorical (both nominal and ordinal), text, datetime, list (of tags), image etc.

Does Collie have capacity to digest mixed datatypes (especially text)?
If not, do you recommend tools from elsewhere to pre-process the various datatype? For example, TFRS has capacity to do text embedding, categorical variable embedding. SBERT can be used to do unsupervised text embedding.

Learning with mixed datatype is called multimodal learning. There are multiple such packages (google automl tabular, aws autogluon) can solve classification and regression problems. Here I am interested to see such feature in building recommendation systems.

Answered by nathancooperjones

Dec 16, 2021

Hi @wjlgatech.

Great question, I'll do my best to answer this!

Our plan with Collie was to build a flexible framework for recommendations such that we could include mixed data in the model with ease (including both images, text, and even tabular data). Right now, the only currently supported way to do this is to include this as item metadata that is passed to a hybrid model (see the docs here for information on that).

Basically, with this, the model takes in the user embedding and item embedding the model is trained to optimize for, and concatenates these with the item metadata (AKA image embeddings, text embeddings, etc.), and uses this full representation to predict the ranking for that…

View full answer

nathancooperjones · 2021-12-16T15:25:16Z

nathancooperjones
Dec 16, 2021

Hi @wjlgatech.

Great question, I'll do my best to answer this!

Our plan with Collie was to build a flexible framework for recommendations such that we could include mixed data in the model with ease (including both images, text, and even tabular data). Right now, the only currently supported way to do this is to include this as item metadata that is passed to a hybrid model (see the docs here for information on that).

Basically, with this, the model takes in the user embedding and item embedding the model is trained to optimize for, and concatenates these with the item metadata (AKA image embeddings, text embeddings, etc.), and uses this full representation to predict the ranking for that user-item pair.

This isn't perfect or fully complete, but it's the start we have currently implemented in the library. In previous hack weeks, I have experimented with including a full on language model (something like a fine-tuned BERT model) as a learnable part of the model in a custom architecture. Ideally, it's easy to do this by just inheriting the BasePipeline and customizing the setup_model method to include this. For now though, the simplest way to do this is to use an existing library (for text, I like this repo), and slightly tweak these embeddings included as item metadata.

We sorta try to show how this works in the tutorial notebooks 05 and 06: https://collie.readthedocs.io/en/latest/tutorials.html

Future improvements to the library should be more dedicated hybrid models that better encode this data into the model, and the ability to also include user metadata, so my goal is to one day add these to the library.

Let me know if any of this doesn't make sense, you have ideas about how we can improve this, or you want to chat more about the exciting world of multimodal learning. Cheers!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Collie deal with mixed datatypes in df, such as cont, cat, text, datetime? #40

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How does Collie deal with mixed datatypes in df, such as cont, cat, text, datetime? #40

wjlgatech Dec 3, 2021

Replies: 1 comment

nathancooperjones Dec 16, 2021

wjlgatech
Dec 3, 2021

nathancooperjones
Dec 16, 2021