Skip to content

Latest commit

 

History

History
108 lines (43 loc) · 4.08 KB

File metadata and controls

108 lines (43 loc) · 4.08 KB

Abstract

Every industry is experiencing financial crunch because of the pandemic, including the movie business. Hence, to decrease the financial risk, prediction of the gross earnings of a movie is essential to the movie industry. Therefore, the goal of the project is to create a regression model that will predict US movies’ gross after considering several features. The features include IMDb rating, certification, duration of the movie, number of votes and metascore. Anyone interested in financing or investing in the movie, including individual producers, film studios, private investors, etc. will benefit from this model.

Design

The project is designed to webscrape the information of the movies from IMDb website and then utilize the data to build and select the most accurate predictive model for gross earnings of a movie based on its features. Several regression models like linear regression, ridge regression, lasso regression, etc. will be utilized to evaluate the training and validation scores of the dataset to choose the model with the best metric values and run the test data.

Data

The data is scraped from IMDb.com website using Python libraries and packages including Request and BeautifulSoup. Initially, the data had 3000 movies with 9 features. After the data cleaning, an interaction feature was added, and the irrelevant columns were eliminated during feature engineering. This resulted in 2951 movies with 20 features.

Algorithms

• Feature Engineering:

-Categorical feature (certificates) was converted to dummy variables.

-Interactive terms were subtracted to make the feature (year) more relevant.

• Models:

-Several regression models were explored to select the best predictive model.

-Data was split into 60% training, 20% validating and 20% testing.

-Following are the results of various models:

Regression Models Training Score Validation Score

Normal Linear Regression 0.389285939 0.461151975

K-fold Linear Regression 0.389285939 0.379048417

Polynomial Regression Degree 2 0.00471088 -0.07468319

Polynomial Regression Interaction 0.50838276 0.3800856

Ridge Regression 0.46113795 0.38928325

Ridge Regression Cross-Validation 0.389283247 0.379102316

Lasso Regression Cross-Validation 0.389285939 0.379048764

Ridge Regression on polynomial interaction only 0.61451453 0.54744346

o The model that was selected was Ridge regression on polynomial interaction only features.

Ridge Regression on poly interaction only validation Score: 0.54744346

Ridge Regression on poly interaction only Training Score: 0.61451453

Ridge Regression on poly interaction only Testing Score: 0.59082610

Tools

• Python libraries including Request, BeautifulSoup, Pandas and Numpy for web scraping, parsing and data manipulation.

• Scikit-learn for modeling and evaluation metrics.

• Matplotlib and Seaborn for data visualizations.

Communication

As stated above, Ridge Regression on polynomial interaction only features is the best model for prediction of gross earnings. Following charts were utilized to analyze and communicate the results:

Heatmap Correlations

image

Pairplot of features and target variable

image

Gross based on metascore, rating and duration

image

Top 10 Movies by Votes on IMDB

image

image

image