跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(83) LLM(79) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(34) Go基础(29) Python(24) Vue(23) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) ChatGPT(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) RAG(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) 智能体(6) whisper(6) Prisma(6) 隐私保护(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) kafka(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) nextjs(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) RAG架构(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

category

About the Data

The data used here comes from a 2-part preprocessing of the original data.

The original data was preprocessed in 2 different stages, the preprocessing notebooks could be found here:

The original data could be found here.

 

1. Decide the model configuration

In [1]:
use_review_text = True  # Change it to False, if you don't want review text included for training
use_count_vectorization = True  # Change it to False to exclude count_vectorization
In [2]:
if not use_review_text:
    # Without review text.
    df_types_filename = '../input/airline-reviews-eda-and-preprocessing-pt-1/PreprocessedDataLightTypes.csv'
    df_filename = '../input/airline-reviews-eda-and-preprocessing-pt-1/PreprocessedDataLight.csv'
    df_out_filename = './Preds-WithoutText.csv'
else:
    # With review text.
    df_types_filename = '../input/airline-review-data-preprocessing-pt-2-nlp/NLPFinalDataLightTypes.csv'
    df_filename = '../input/airline-review-data-preprocessing-pt-2-nlp/NLPFinalDataLight.csv'
    df_out_filename = './Preds-WithText.csv'
In [3]:
# Define numerical and categorical features.
if not use_review_text:
    # Without review text.
    num_feats = ['date_flown_month',
                 'date_flown_year',
                 'review_date_date_flown_distance_days',
                 'review_characters',
                 'has_layover_num',
                 'seat_comfort',
                 'cabin_service',
                 'food_bev',
                 'entertainment',
                 'ground_service',
                 'value_for_money']
    cat_feats = ['airline',
                 'traveller_type',
                 'cabin']
else:
    # With review text.
    if not use_count_vectorization:
        num_feats = ['date_flown_month',
                     'date_flown_year',
                     'review_date_date_flown_distance_days',
                     'review_characters',
                     'has_layover_num',
                     'seat_comfort',
                     'cabin_service',
                     'food_bev',
                     'entertainment',
                     'ground_service',
                     'value_for_money',
                     'polarity']
    else:
        with open('../input/airline-review-data-preprocessing-pt-2-nlp/VecReviewTextCleanFeats.csv','r') as f:
            vec_feats = f.read()
            vec_feats = vec_feats.split(', ')
        num_feats = ['date_flown_month',
                     'date_flown_year',
                     'review_date_date_flown_distance_days',
                     'review_characters',
                     'has_layover_num',
                     'seat_comfort',
                     'cabin_service',
                     'food_bev',
                     'entertainment',
                     'ground_service',
                     'value_for_money',
                     'polarity'] + vec_feats
    cat_feats = ['airline',
                 'traveller_type',
                 'cabin']

feats = num_feats + cat_feats
In [4]:
# Set this variable to the desired method for data transformation.
# Possible options are: scaling_and_one_hot_encoding, label_encoding, no_transformation.
transform_dataset = 'label_encoding'
 

2. Necessary Imports

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')
import scipy.sparse

import datetime as dt
import dateutil

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

from sklearn.metrics import roc_curve, accuracy_score, roc_auc_score, confusion_matrix 

import lightgbm as lgb

import importlib
 

3. Load the input data

In [6]:
# Type of each field in the input data.
df_dtype = pd.read_csv(df_types_filename)
dict_dtype = df_dtype[['index','dtypes']].set_index('index').to_dict()['dtypes']
dict_dtype['recommended'] = 'bool'
In [7]:
# Input data.
df = pd.read_csv(df_filename, dtype=dict_dtype, keep_default_na=False, na_values=['_'])
df.drop(columns=['Unnamed: 0'],inplace=True)
In [8]:
df.head()
Out[8]:
 airlinereview_scoretraveller_typecabinseat_comfortcabin_servicefood_beventertainmentground_servicevalue_for_money...count_yetcount_yoghurtcount_yorkcount_youngcount_yourecount_yvrcount_yyzcount_zerocount_zonecount_zurich
0Turkish Airlines7BusinessEconomy Class454424...0000000000
1Turkish Airlines2Family LeisureEconomy Class411111...0000000000
2Turkish Airlines3BusinessEconomy Class141312...0000000000
3Turkish Airlines10Solo LeisureEconomy Class455555...0000000000
4Turkish Airlines1Solo LeisureEconomy Class111111...0000000000

5 rows × 2466 columns

In [9]:
df.shape
Out[9]:
(22822, 2466)
In [10]:
n_reviews = df.shape[0]
print('Number of customer reviews in the dataset: {:d}'.format(n_reviews))
 
Number of customer reviews in the dataset: 22822
 

4. Model training and prediction for the recommendation of airline by customers

 

4.1 Label the data on the basis of 'recommended' column

In [11]:
# Utility function to assign the label to our dataset
def assign_label_recommended(df_row):
    """
    Return 0 if not recommended and 1 otherwise.
    """
    label_recommended = None
    if df_row['recommended'] == True:
        label_recommended = 1
    elif df_row['recommended'] == False:
        label_recommended = 0
    else:
        label_recommended = None
    return label_recommended
In [12]:
df['label'] = df.apply(lambda x: assign_label_recommended(x), axis=1)
In [13]:
df.head()
Out[13]:
 airlinereview_scoretraveller_typecabinseat_comfortcabin_servicefood_beventertainmentground_servicevalue_for_money...count_yoghurtcount_yorkcount_youngcount_yourecount_yvrcount_yyzcount_zerocount_zonecount_zurichlabel
0Turkish Airlines7BusinessEconomy Class454424...0000000001
1Turkish Airlines2Family LeisureEconomy Class411111...0000000000
2Turkish Airlines3BusinessEconomy Class141312...0000000000
3Turkish Airlines10Solo LeisureEconomy Class455555...0000000001
4Turkish Airlines1Solo LeisureEconomy Class111111...0000000000

5 rows × 2467 columns

 

4.2 Convert Boolean features to numerical

In [14]:
df['has_layover_num'] = df['has_layover'].astype(int)
df['date_flown_day'] = df['date_flown_day'].astype(int)
df['date_flown_month'] = df['date_flown_month'].astype(int)
df['date_flown_year'] = df['date_flown_year'].astype(int)

df['seat_comfort'] = df['seat_comfort'].astype(int)
df['cabin_service'] = df['cabin_service'].astype(int)
df['ground_service'] = df['ground_service'].astype(int)
df['food_bev'] = df['food_bev'].astype(int)
df['value_for_money'] = df['value_for_money'].astype(int)
df['entertainment'] = df['entertainment'].astype(int)

for feat in num_feats:
    if 'polarity' not in feat:
        df[feat] = df[feat].astype(int)
In [15]:
df.head()
Out[15]:
 airlinereview_scoretraveller_typecabinseat_comfortcabin_servicefood_beventertainmentground_servicevalue_for_money...count_yorkcount_youngcount_yourecount_yvrcount_yyzcount_zerocount_zonecount_zurichlabelhas_layover_num
0Turkish Airlines7BusinessEconomy Class454424...0000000011
1Turkish Airlines2Family LeisureEconomy Class411111...0000000000
2Turkish Airlines3BusinessEconomy Class141312...0000000001
3Turkish Airlines10Solo LeisureEconomy Class455555...0000000010
4Turkish Airlines1Solo LeisureEconomy Class111111...0000000001

5 rows × 2468 columns

 

4.3 Select features for training and labels for prediction

In [16]:
X = df[feats]
y = df['label'].values
 

4.4 Check for class imbalance

In [17]:
f_rec = (y[y==1].shape[0])/y.shape[0]
f_not_rec = (y[y==0].shape[0])/y.shape[0]
print('Fraction of customers that recommeded the service: {:.2f}'.format(f_rec))
print('Fraction of customers that did not recommed the service: {:.2f}'.format(f_not_rec))
 
Fraction of customers that recommeded the service: 0.48
Fraction of customers that did not recommed the service: 0.52
 

4.5 Scaling numerical features and encoding the categorical features

We might want to scale numerical features, so that they have values in a common range.

Here, we use the StandardScaler available in the sklearn library to normalize the features, that is, to subtract their mean and divide by their standard deviation. We transform x to z = (x-u)/s. We can specify whether or no we want to subtract the mean with the option with_mean=True/False and whether or not we want to divide by the standard deviation with the option with_std=True/False. As a result, all the numerical features will have mean zero and unit standard deviation.

We also need to transform categorical features as well. Two common options are one-hot encoding and label encoding.

  1. One-hot encoding allows to encode categorical features as one-hot vectors. The categorical feature is transformed into binary features, one for each category.

    For example, the categorical feature 'cabin' can have four possible values: Economy Class, Premium Economy, Business Class and First Class. The one-hot encoding transform this feature, with four possible values, into four new features, called cabin_Economy, cabin_Premium Economy, cabin_Business and cabin_First, with each new feature having two possible values, 0 or 1, depending on the value of the original feature. A record with cabin equal to Economy Class will have cabin_Economy Class equal to 1 and all the other three features equal to 0. This could lead to sparse data (most of the elements in the dataset will have the value 0) if the features can have many possible values.

  2. Label encoding allows to encode categorical features as numbers.

    For example, the categorical feature cabin can be encoded as one feature with values 0, 1, 2 and 3.

Here, we use a pipeline to define the data processing, so that we can repeat the same steps for the training and test datasets. In particular, the parameters of the data processing are defined based on the training dataset and are then applied to the test dataset.

In [18]:
# Create a pipeline for numerical features and a pipeline for categorical features.
num_proc = make_pipeline(SimpleImputer(missing_values=np.nan, strategy='mean'), StandardScaler())
cat_proc = make_pipeline(SimpleImputer(strategy='constant', fill_value='missing'), OneHotEncoder(handle_unknown='ignore'))

# Create a preprocessing step for all features.
preprocessor = make_column_transformer((num_proc, num_feats),
                                       (cat_proc, cat_feats))
 

4.6 Dataset for training and testing

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)
 

4.6.1 Transform the data before training using the pipeline made in section 4.5

In [20]:
X_train_transformed = preprocessor.fit_transform(X_train)
In [21]:
cat_feats_one_hot = preprocessor.transformers_[1][1]['onehotencoder'].get_feature_names(cat_feats)
# print(cat_feats_one_hot)

all_feats = list(num_feats)+list(cat_feats_one_hot)
# print(all_feats)

dict_for_renaming_cols = {}
for i in range(len(all_feats)):
    dict_for_renaming_cols[i] = all_feats[i]
# print(dict_for_renaming_cols)
In [22]:
if scipy.sparse.issparse(X_train_transformed):
    X_train_transformed_2 = pd.DataFrame.sparse.from_spmatrix(X_train_transformed)
else:
    X_train_transformed_2 = pd.DataFrame(X_train_transformed)
X_train_transformed_2.rename(columns=dict_for_renaming_cols,inplace=True)

X_test_transformed = preprocessor.transform(X_test)
if scipy.sparse.issparse(X_test_transformed):
    X_test_transformed_2 = pd.DataFrame.sparse.from_spmatrix(X_test_transformed)
else:
    X_test_transformed_2 = pd.DataFrame(X_test_transformed)
X_test_transformed_2.rename(columns=dict_for_renaming_cols,inplace=True)

X_transformed = preprocessor.transform(X)
if scipy.sparse.issparse(X_transformed):
    X_transformed_2 = pd.DataFrame.sparse.from_spmatrix(X_transformed)
else:
    X_transformed_2 = pd.DataFrame(X_transformed)
X_transformed_2.rename(columns=dict_for_renaming_cols,inplace=True)
In [23]:
X_train.shape
Out[23]:
(15975, 2457)
In [24]:
X_train_transformed_2.shape
Out[24]:
(15975, 2543)
In [25]:
X_test.shape
Out[25]:
(6847, 2457)
In [26]:
X_test_transformed_2.shape
Out[26]:
(6847, 2543)
 

4.6.2 Label Encoding for categorical features

In [27]:
le = LabelEncoder()
In [28]:
# Make copies so that original aren't changed
X_label_enc = X.copy()
X_train_label_enc = X_train.copy()
X_test_label_enc = X_test.copy()
In [29]:
for feat in cat_feats:
    print('Feature:', feat)
    X_label_enc[feat] = le.fit_transform(X_label_enc[feat])
    X_train_label_enc[feat] = le.fit_transform(X_train_label_enc[feat])
    X_test_label_enc[feat] = le.fit_transform(X_test_label_enc[feat])
 
Feature: airline
Feature: traveller_type
Feature: cabin
In [30]:
X_label_enc[cat_feats].head()
Out[30]:
 airlinetraveller_typecabin
07101
17121
27101
37131
47131
 

4.7 Model Training

 

4.7.1 Choosing the transformation configuration

In [31]:
if transform_dataset == 'scaling_and_one_hot_encoding':
    print('Method for data tranformation: scaling and one hot encoding')
    X_train_for_model = X_train_transformed_2
    X_test_for_model = X_test_transformed_2
    X_for_model = X_transformed_2
    X_test_for_shap = X_test_transformed_2
    X_for_shap = X_transformed_2
elif transform_dataset == 'label_encoding':
    print('Method for data transformation: label encoding')
    X_train_for_model = X_train_label_enc
    X_test_for_model = X_test_label_enc
    X_for_model = X_label_enc
    X_test_for_shap = X_test_label_enc
    X_for_shap = X_label_enc
elif transform_dataset == 'no_transformation':
    print('Method for data transformation: no transformation')
    X_train_for_model = X_train
    X_test_for_model = X_test 
    X_for_model = X
    X_test_for_shap = X_test
    X_for_shap = X
 
Method for data transformation: label encoding
In [32]:
cat_feats
Out[32]:
['airline', 'traveller_type', 'cabin']
 

4.7.2 Converting the dataset into an lgb Dataset and setting the parameters for model training

LightGBM model works on a specific datatype. The normal Pandas dataframe could be easily converted into that specific type by using lgb.Dataset() function

In [33]:
if transform_dataset == 'scaling_and_one_hot_encoding':
    train_data=lgb.Dataset(X_train_for_model,label=y_train)
    test_data=lgb.Dataset(X_test_for_model,label=y_test)
elif transform_dataset == 'label_encoding':    
    train_data=lgb.Dataset(X_train_for_model,label=y_train,categorical_feature=cat_feats)
    test_data=lgb.Dataset(X_test_for_model,label=y_test,categorical_feature=cat_feats)
elif transform_dataset == 'no_transformation':
    train_data=lgb.Dataset(X_train_for_model,label=y_train)
    test_data=lgb.Dataset(X_test_for_model,label=y_test)
else:
    train_data=lgb.Dataset(X_train_for_model,label=y_train)
    test_data=lgb.Dataset(X_test_for_model,label=y_test)
    
params = {'metric': 'binary_logloss', 
          'boosting_type': 'gbdt', 
          'objective': 'binary',
          'feature_fraction': 0.5,
          'num_leaves': 15,
          'max_depth': 10,
          'n_estimators': 200,
          'min_data_in_leaf': 200, 
          'min_child_weight': 0.1,
          'reg_alpha': 2,
          'reg_lambda': 5,
          'subsample': 0.8,
          'verbose': -1,
}
 

4.7.3 Training and Predicting using the LGBM classifier

In [34]:
lgbm = lgb.train(params,
                 train_data,
                 2500,
                 valid_sets=test_data,
                 early_stopping_rounds= 100,
                 verbose_eval= 20
                 )

y_prob = lgbm.predict(X_for_model)
y_pred = y_prob.round(0)

clf_roc_auc_score = roc_auc_score(y, y_prob)
clf_accuracy_score = accuracy_score(y, y_pred)

print('Model overall ROC AUC score: {:.3f}'.format(clf_roc_auc_score))
print('Model overall accuracy: {:.3f}'.format(clf_accuracy_score))
 
/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
 
Training until validation scores don't improve for 100 rounds
[20]	valid_0's binary_logloss: 0.199763
[40]	valid_0's binary_logloss: 0.144599
[60]	valid_0's binary_logloss: 0.134085
[80]	valid_0's binary_logloss: 0.131493
[100]	valid_0's binary_logloss: 0.130356
[120]	valid_0's binary_logloss: 0.130079
[140]	valid_0's binary_logloss: 0.129909
[160]	valid_0's binary_logloss: 0.129825
[180]	valid_0's binary_logloss: 0.12936
[200]	valid_0's binary_logloss: 0.12949
Did not meet early stopping. Best iteration is:
[192]	valid_0's binary_logloss: 0.129288
Model overall ROC AUC score: 0.995
Model overall accuracy: 0.965
In [35]:
# Verify if the model has predicted a value between 1 and 0
print('Min value of prediction: {:.3f}'.format(y_pred.min()))
print('Max value of prediction: {:.3f}'.format(y_pred.max()))
print('Min value of probability: {:.3f}'.format(y_prob.min()))
print('Max value of probability: {:.3f}'.format(y_prob.max()))
 
Min value of prediction: 0.000
Max value of prediction: 1.000
Min value of probability: 0.000
Max value of probability: 1.000
 

5. Evaluation of trained model on different classification metrics

 

5.1 Getting Recall, Precision and Specificity scores

In [36]:
# Getting all the accuracy metrics
tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()

sensitivity = tp / (tp+fn) # Recall.
specificity = tn / (tn+fp)
precision = tp / (tp+fp)

print('Sensitivity/Recall: %.2f' % sensitivity)
print('Specificity: %.2f' % specificity)
print('Precision: %.2f' % precision)
 
Sensitivity/Recall: 0.96
Specificity: 0.97
Precision: 0.97
 

5.2 Plotting the confusion matrix

In [37]:
def plot_confusion_matrix(y, y_pred, normalize_str, figsize_w, figsize_h, filename):
    """
    Plot the confusion matrix of a classifier.
    """
    plt.figure(figsize=(figsize_w,figsize_h))
    plt.title('Confusion matrix')
    cm = confusion_matrix(y, y_pred, normalize=normalize_str)
    df_cm = pd.DataFrame(cm, columns=np.unique(y), index = np.unique(y))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    sns.set(font_scale=1.4)
    sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})
    plt.savefig(filename)
    plt.show()
    return
In [38]:
plot_confusion_matrix(y=y, y_pred=y_pred, normalize_str='true', figsize_w=4, figsize_h=4, filename='./ConfusionMatrix.png')
 
 

5.3 Plotting the ROC Curve

In [39]:
# True positive rate and false positive rate.
fpr, tpr, _ = roc_curve(y, y_prob)
In [40]:
def plot_roc_curve(fpr, tpr, clf_name, figsize_w, figsize_h, filename):
    """
    Plot the ROC curve of a classifier.
    """
    plt.figure(figsize=(figsize_w,figsize_h))
    sns.set(style="whitegrid")
    plt.plot([0, 1], [0, 1], 'k--', label='random')
    plt.plot(fpr, tpr, label=clf_name)
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')
    plt.legend(loc='best')
    plt.savefig(filename)
    plt.show()
    return
In [41]:
plot_roc_curve(fpr=fpr, tpr=tpr, clf_name='LightGBM', figsize_w=6, figsize_h=6, filename='./ROCCurve.png')
 
 

6. Saving the predictions in a new dataframe

In [42]:
# Saving results in a fresh dataframe
df_out = pd.DataFrame()
df_out['y_pred'] = y_pred
df_out['y_prob'] = y_prob
 

7. Plotting the recommendation probability

In [43]:
def plot_hist_sns(df,feat,bins,title,x_label,y_label,filename):
    """
    Plot the histogram of a given feature.
    """
    plt.figure(figsize=(6,6))
    sns.distplot(df[feat],bins=bins,kde=False)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.grid(False)
    plt.savefig(filename)
    plt.show()
    return
In [44]:
plot_hist_sns(df=df_out,
             feat='y_prob',
             bins=30,
             title='Distribution of model prediction',
             x_label='Predicted probability of being recommended',
             y_label='Entries / bin',
             filename='./HistModelPredictions.png')
 
 

8. Saving the predictions in an output file

In [45]:
df_out.to_csv(df_out_filename)
 

Thanks a lot for taking out time to read my work! Please feel free to leave any comments and recommendations for me to improve my work. If you liked the work, please feel free to press that "little upward arrow" button :) !!!

Cheers!!