Background: There has been a consistent increase in the used cars industry from the past decade as there is an increase in the usage of cars. Usedcars are attracting more attention as they are affordable than new ones.This situation demands high-performance algorithms that can be used topredict prices for the used cars. Many machine learning algorithms are usedto predict the price of cars.
Objectives: This thesis aims in detecting features that impact predicting the price of used cars, and experiments are performed to investigatean optimal algorithm for price prediction of used cars. Algorithms selectedfor experimenting are Linear Regression (LR), Light Gradient Boosted Machine (LGBM), Random Forest Regression (RFR), Decision Tree Regression(DTR). These algorithms are further compared using performance metricsof regression models.
Methods: The initial step in this study is to gather a suitable dataset andapplying preprocessing techniques to that data. Feature selection is performed using a correlation matrix with the heat map. Label Encoding isperformed on the data to change the categorical values into numerical values. A new dataset is created based on the feature "region" from the originaldataset. train-test-split technique is used to divide the original dataset intotrain and test data in the ratio of 80:20. The new dataset is manually divided into unique regions of train and test data. Selected Machine Learningalgorithms were trained using both datasets. The accuracy score of selectedalgorithms is derived using performance metrics. An optimal algorithm isachieved by comparing the accuracy scores derived.
Results: Light Gradient Boosted Machine is considered as optimal algorithm based on R2score, for the original dataset, it obtained 91.12% on testdata. Light Gradient Boosted Machine achieved 85.30% on test data for thenew dataset. The feature "region" has the highest feature importance overthe remaining features. It has a feature importance of 55220 with respectto number of instances i.e, 568654.
Conclusions: Among selected algorithms, Light Gradient Boosted Machine obtained a high R2score over other algorithms on both original andnew datasets. Feature "region" has a significant impact on predicting theprice of the used car, and this is justified by performing feature importanceon Light Gradient Boosted Machine.
2021.