Feature Engineering in High Cardinality space:
Replicating and Extending a paper by Pargent et al (2021)

Overview

This project is a statistical analysis endeavor aimed at assessing the reproducibility of the paper titled "Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features" by Pargent et al, published in 2021. In our exploration, we extended the original work by incorporating a Neural Network model alongside the traditional machine learning algorithms proposed in the paper.

Objective

The primary goal of this project was to replicate and validate the findings from the paper while also expanding the scope by introducing a Neural Network model. Basically, we focused on the impact of different encodings for categorical variables on the performance of various classification algorithms, which means that the main task revolved around feature engineering. Categorical variables often pose a challenge in machine learning models, especially when dealing with high cardinality features.

Encoding Techniques Explored

Frequency Encoding
One-Hot Encoding
Integer Encoding
Dummy Encoding
Hash Encoding
Regularized Impact Encoding
Leaf Encoding
GLMM Encoding

Pseudocode

To better understand our approach, we provide the pseudocode depicted below:

Repository Structure

Please refer to the "Consegna" folder only

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Consegna		Consegna
dataframe_results		dataframe_results
.DS_Store		.DS_Store
13.RegularizedTargetEncoding.pdf		13.RegularizedTargetEncoding.pdf
NN_finale.R		NN_finale.R
NN_function.R		NN_function.R
README.md		README.md
SDS_project13_Craciun_Ferri_Picchianti.zip		SDS_project13_Craciun_Ferri_Picchianti.zip
final_results.rds		final_results.rds
final_tests.R		final_tests.R
first_prep.R		first_prep.R
glmm_definitivo.R		glmm_definitivo.R
pipeline_con_NN.R		pipeline_con_NN.R
pipeline_vecchia.R		pipeline_vecchia.R
prova_dataset_churn.R		prova_dataset_churn.R
pseudocode_pipeline.png		pseudocode_pipeline.png
traffic_violations_pipeline_con_NN.R		traffic_violations_pipeline_con_NN.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Engineering in High Cardinality space:
Replicating and Extending a paper by Pargent et al (2021)

Overview

Objective

Encoding Techniques Explored

Pseudocode

Repository Structure

About

Releases

Packages

Contributors 3

Languages

diletta-ferri/statistics-project

Folders and files

Latest commit

History

Repository files navigation

Feature Engineering in High Cardinality space: Replicating and Extending a paper by Pargent et al (2021)

Overview

Objective

Encoding Techniques Explored

Pseudocode

Repository Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Feature Engineering in High Cardinality space:
Replicating and Extending a paper by Pargent et al (2021)

Packages