This repository is a collection of projects focused on cleaning, preprocessing and preparing datasets for further analysis or modeling.Data cleaning is a crucial step in any data workflow and this repository showcases the use of Python libraries to handle various data quality challenges.
I want this repository to serve as a resource to anyone interested in learning how to clean and preprocess datasets.
-
Handling missing values.
-
Detecting and removing duplicates
-
Standardizing text
-
Parsing and formatting dates
-
Feature engineering
-
Outlier detection
-
Data type conversions and validations.
Pandas: For data manipulation and analysis
Numpy: For numerical operations
FuzzyWuzzy: For text matching and cleaning
Matplotlib & Seaborn: For visualizing data during the cleaning process.