跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(75) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) ChatGPT(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) RAG(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) 智能体(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) kafka(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 大型语言模型(2) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

category

import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
import missingno as msno
import gc
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.metrics import roc_curve,roc_auc_score,classification_report,mean_squared_error,accuracy_score
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier,BaggingClassifier,VotingClassifier,AdaBoostClassifier
In this task, you will be asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set. KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged. The train and the test data are selected from users listening history in a given time period. Note that this time period is chosen to be before the WSDM-KKBox Churn Prediction time period. The train and test sets are split based on time, and the split of public/private are based on unique user/song pairs. Tables train.csv msno: user id song_id: song id source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. source_screen_name: name of the layout a user sees. source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc. target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise . test.csv id: row id (will be used for submission) msno: user id song_id: song id source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. source_screen_name: name of the layout a user sees. source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc. sample_submission.csv sample submission file in the format that we expect you to submit id: same as id in test.csv target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise . songs.csv The songs. Note that data is in unicode. song_id song_length: in ms genre_ids: genre category. Some songs have multiple genres and they are separated by | artist_name composer lyricist language members.csv user information. msno city bd: age. Note: this column has outlier values, please use your judgement. gender registered_via: registration method registration_init_time: format %Y%m%d expiration_date: format %Y%m%d song_extra_info.csv song_id song name - the name of the song. isrc - International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times.
In [2]:
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve,roc_auc_score,classification_report,roc_curve
from tqdm import tqdm
 

AS we have imported necessary modules now we can start with

EDA (Exploratory Data Analysis) with wrangling of data and visualizing

The necessary thing for out statiscal analysis as well insights of our data

Last steps would be data imputations , merging , cross validations ,

Hyperparmaters tuning and visualization of every algowhat we used what are

It's result with Time consumption of algos producing the results

In [3]:
from subprocess import check_output

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
songs = pd.read_csv('../input/songs.csv')
members = pd.read_csv('../input/members.csv')
sample = pd.read_csv('../input/sample_submission.csv')
In [4]:
train.head()
Out[4]:
 msnosong_idsource_system_tabsource_screen_namesource_typetarget
0FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=exploreExploreonline-playlist1
1Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=my libraryLocal playlist morelocal-playlist1
2Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=my libraryLocal playlist morelocal-playlist1
3Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=my libraryLocal playlist morelocal-playlist1
4FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=exploreExploreonline-playlist1
In [5]:
test.head()
Out[5]:
 idmsnosong_idsource_system_tabsource_screen_namesource_type
00V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=my libraryLocal playlist morelocal-library
11V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=my libraryLocal playlist morelocal-library
22/uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=discoverNaNsong-based-playlist
331a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=radioRadioradio
441a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=radioRadioradio
In [6]:
songs.head()
Out[6]:
 song_idsong_lengthgenre_idsartist_namecomposerlyricistlanguage
0CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=247640465張信哲 (Jeff Chang)董貞何啟弘3.0
1o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=197328444BLACKPINKTEDDY| FUTURE BOUNCE| Bekuh BOOMTEDDY31.0
2DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=231781465SUPER JUNIORNaNNaN31.0
3dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=273554465S.H.E湯小康徐世珍3.0
4W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=140329726貴族精選TraditionalTraditional52.0
In [7]:
members.head()
Out[7]:
 msnocitybdgenderregistered_viaregistration_init_timeexpiration_date
0XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=10NaN72011082020170920
1UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=10NaN72015062820170622
2D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=10NaN42016041120170712
3mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=10NaN92015090620150907
4q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=10NaN42017012620170613
In [8]:
sample.head()
members.shape
train.info()
print("\n")
songs.info()
print("\n")
members.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7377418 entries, 0 to 7377417
Data columns (total 6 columns):
msno                  object
song_id               object
source_system_tab     object
source_screen_name    object
source_type           object
target                int64
dtypes: int64(1), object(5)
memory usage: 337.7+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2296320 entries, 0 to 2296319
Data columns (total 7 columns):
song_id        object
song_length    int64
genre_ids      object
artist_name    object
composer       object
lyricist       object
language       float64
dtypes: float64(1), int64(1), object(5)
memory usage: 122.6+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34403 entries, 0 to 34402
Data columns (total 7 columns):
msno                      34403 non-null object
city                      34403 non-null int64
bd                        34403 non-null int64
gender                    14501 non-null object
registered_via            34403 non-null int64
registration_init_time    34403 non-null int64
expiration_date           34403 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.8+ MB
In [9]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(x='source_type',hue='source_type',data=train)
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot source types for listening music',fontsize=30)
plt.tight_layout()
 
 

First visualization we can see as if local library are more perffered than any other source types as well after that online playlist and local playlist and other features are showing less importance but can't say anything right now as we handn't deal with cleaning , imputing , stats

But as far we are sure are answers for buliding this systems in revolving maximum issues around local library see next what other result say

In [10]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(y='source_screen_name',data=train,facecolor=(0,0,0,0),linewidth=5,edgecolor=sns.color_palette('dark',3))
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot for which  screen using ',fontsize=30)
plt.tight_layout()
 
 

Second Visualization is telling us that most of the users are listenning local

playlist more means the app which is provided by the company they are using them apart from this we can also see that after this most the users are coming back to the songs by online playlist sources

Very less from the other different sources means are outliers , variance and std deviations are in 2 areas local libs and online playlist

In [11]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(x='source_system_tab',hue='source_system_tab',data=train)
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot for system tab there are using',fontsize=30)
plt.tight_layout()
 
 

so anyone who has installed KKBOX app we can see most of the users are going back to there songs via my library rather discovering them means there are different sources they can go back but most preffered one is my library

now doing some visualiaztion in members.csv

In [12]:
import matplotlib as mpl

mpl.rcParams['font.size'] = 40.0
labels = ['Male','Female']
plt.figure(figsize = (12, 12))
sizes = pd.value_counts(members.gender)
patches, texts, autotexts = plt.pie(sizes, 
                                    labels=labels, autopct='%.0f%%',
                                    shadow=False, radius=1,startangle=90)
for t in texts:
    t.set_size('smaller')
plt.legend()
plt.show()
 
 

As we can see we have male users more now visualization has to be done in this manner like from how many types genders which are popular ways to go back in there playlist

In [13]:
import matplotlib.pyplot as plt
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
# Make data: I have 3 groups and 7 subgroups
group_names=['explore','my library','search','discover','radio','listen with','notification','settings']
group_size=pd.value_counts(train.source_system_tab)
print(group_size)
subgroup_names=['Male','Female']
subgroup_size=pd.value_counts(members.gender)
 
# Create colors
a, b, c,d,e,f,g,h=[plt.cm.autumn, plt.cm.GnBu, plt.cm.YlGn,plt.cm.Purples,plt.cm.cool,plt.cm.RdPu,plt.cm.BuPu,plt.cm.bone]
 
# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, texts= ax.pie(group_size, radius=3.0,labels=group_names, colors=[a(0.6), b(0.6), c(0.6),d(0.6), e(0.6), f(0.6),g(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')
 
# Second Ring (Inside)
#mypie2, texts1 = ax.pie(subgroup_size, radius=3.0-0.3, labels=subgroup_names, labeldistance=0.7, colors=[h(0.5), b(0.4)])
#plt.setp( mypie2, width=0.3, edgecolor='white')
#plt.margins(0,0)
#for t in texts:
 #   t.set_size(25.0)
#for t in texts1:
 
    #t.set_size(25.0)    
plt.legend() 
# show it
plt.show()
 
my library      3684730
discover        2179252
search           623286
radio            476701
listen with      212266
explore          167949
notification       6185
settings           2200
Name: source_system_tab, dtype: int64
 
<matplotlib.figure.Figure at 0x7f84742910f0>
 
 

Inferences we can draw from this chart that among Men exploration method is only way they are using while females are using every possible way to get back their music of choices , in real world this thing also very much similar that men focuses in one direction in depth whereas women focuses in every possible direction but not in depth

We are moving in right direction of building a good accurate systems

 

now some statistics inferences

as we have numeric data in tow csv files and rest of the files with categorical data so members.csv file with 2 columns in numeric and song.csv

In [14]:
print(members.describe())
 
               city            bd  registered_via  registration_init_time  \
count  34403.000000  34403.000000    34403.000000            3.440300e+04   
mean       5.371276     12.280935        5.953376            2.013994e+07   
std        6.243929     18.170251        2.287534            2.954015e+04   
min        1.000000    -43.000000        3.000000            2.004033e+07   
25%        1.000000      0.000000        4.000000            2.012103e+07   
50%        1.000000      0.000000        7.000000            2.015090e+07   
75%       10.000000     25.000000        9.000000            2.016110e+07   
max       22.000000   1051.000000       16.000000            2.017023e+07   

       expiration_date  
count     3.440300e+04  
mean      2.016901e+07  
std       7.320925e+03  
min       1.970010e+07  
25%       2.017020e+07  
50%       2.017091e+07  
75%       2.017093e+07  
max       2.020102e+07  
In [15]:
print(songs.describe())
 
        song_length      language
count  2.296320e+06  2.296319e+06
mean   2.469935e+05  3.237800e+01
std    1.609200e+05  2.433241e+01
min    1.850000e+02 -1.000000e+00
25%    1.836000e+05 -1.000000e+00
50%    2.266270e+05  5.200000e+01
75%    2.772690e+05  5.200000e+01
max    1.217385e+07  5.900000e+01
 

doing stats test on members.csv

In [16]:
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
sns.distplot(members.registration_init_time)
sns.set(font_scale=2)
plt.ylabel('ecdf',fontsize=50)
plt.xlabel('registration time ' ,fontsize=50)
Out[16]:
Text(0.5,0,'registration time ')
 
 

inferences we can drawn from above two result that maximum registration were done in time period of 2012 to 2016 and most this righthand skewed graph one more thing before applying we have to normalize it

In [17]:
members.describe()
Out[17]:
 citybdregistered_viaregistration_init_timeexpiration_date
count34403.00000034403.00000034403.0000003.440300e+043.440300e+04
mean5.37127612.2809355.9533762.013994e+072.016901e+07
std6.24392918.1702512.2875342.954015e+047.320925e+03
min1.000000-43.0000003.0000002.004033e+071.970010e+07
25%1.0000000.0000004.0000002.012103e+072.017020e+07
50%1.0000000.0000007.0000002.015090e+072.017091e+07
75%10.00000025.0000009.0000002.016110e+072.017093e+07
max22.0000001051.00000016.0000002.017023e+072.020102e+07
In [18]:
songs.describe()
Out[18]:
 song_lengthlanguage
count2.296320e+062.296319e+06
mean2.469935e+053.237800e+01
std1.609200e+052.433241e+01
min1.850000e+02-1.000000e+00
25%1.836000e+05-1.000000e+00
50%2.266270e+055.200000e+01
75%2.772690e+055.200000e+01
max1.217385e+075.900000e+01
In [19]:
train.describe()
Out[19]:
 target
count7.377418e+06
mean5.035171e-01
std4.999877e-01
min0.000000e+00
25%0.000000e+00
50%1.000000e+00
75%1.000000e+00
max1.000000e+00
In [20]:
train.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7377418 entries, 0 to 7377417
Data columns (total 6 columns):
msno                  object
song_id               object
source_system_tab     object
source_screen_name    object
source_type           object
target                int64
dtypes: int64(1), object(5)
memory usage: 337.7+ MB
In [21]:
members.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34403 entries, 0 to 34402
Data columns (total 7 columns):
msno                      34403 non-null object
city                      34403 non-null int64
bd                        34403 non-null int64
gender                    14501 non-null object
registered_via            34403 non-null int64
registration_init_time    34403 non-null int64
expiration_date           34403 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.8+ MB
 

we can see that in members and songs csv files large differences bet min and max values which gives inferences that there are outliers in the csv files which has to be removed before making system

 

Data conversion of int , float and categorical has to be done to reduce the data size for computation as well as storage

In [22]:
train_members = pd.merge(train, members, on='msno', how='inner')
train_merged = pd.merge(train_members, songs, on='song_id', how='outer')
print(train_merged.head())
 
                                           msno  \
0  FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=   
1  pouJqjNRmZOnRNzzMWWkamTKkIGHyvhl/jo4HgbncnM=   
2  xbodnNBaLMyqqI7uFJlvHOKMJaizuWo/BB/YHZICcKo=   
3  s0ndDsjI79amU0RBiullFN8HRz9HjE++34jGNa7zJ/s=   
4  Vw4Umh6/qlsJDC/XMslyAxVvRgFJGHr53yb/nrmY1DU=   

                                        song_id source_system_tab  \
0  BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=           explore   
1  BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=          discover   
2  BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   
3  BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   
4  BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   

     source_screen_name      source_type  target  city    bd  gender  \
0               Explore  online-playlist     1.0   1.0   0.0     NaN   
1  Online playlist more  online-playlist     0.0  15.0  18.0    male   
2   Local playlist more    local-library     1.0   1.0   0.0     NaN   
3   Local playlist more    local-library     1.0   5.0  21.0  female   
4   Local playlist more    local-library     0.0   6.0  33.0  female   

   registered_via  registration_init_time  expiration_date  song_length  \
0             7.0              20120102.0       20171005.0     206471.0   
1             4.0              20151220.0       20170930.0     206471.0   
2             7.0              20120804.0       20171004.0     206471.0   
3             9.0              20110808.0       20170917.0     206471.0   
4             9.0              20070323.0       20170915.0     206471.0   

  genre_ids artist_name              composer lyricist  language  
0       359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
1       359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
2       359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
3       359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
4       359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
In [23]:
test_members = pd.merge(test, members, on='msno', how='inner')
test_merged = pd.merge(test_members, songs, on='song_id', how='outer')
print(test_merged.head())
print(len(test_merged.columns))
 
          id                                          msno  \
0        0.0  V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=   
1  1035059.0  08rvvaaab7dM7h78GC4SphLkUCSXPxpu6sY+k8aLUO4=   
2    89968.0  1NvrMNDUcvfqOIjhim8BgdK23znMzGwAO84W+qKs6dw=   
3   972394.0  GfSXhTVP3oj7h0545L/5xh6jD+7edQ7AH0iprl7dYbc=   
4  2194574.0  HkWEvfQyrb5Lve8X3B7HkCEkDFW8qFy/9kWFb4QbM5k=   

                                        song_id source_system_tab  \
0  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
1  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
2  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
3  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
4  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=          discover   

    source_screen_name          source_type  city    bd  gender  \
0  Local playlist more        local-library   1.0   0.0     NaN   
1  Local playlist more        local-library   5.0  29.0  female   
2  Local playlist more        local-library  14.0  20.0     NaN   
3  Local playlist more        local-library  22.0  22.0    male   
4     Discover Feature  song-based-playlist  15.0  26.0  female   

   registered_via  registration_init_time  expiration_date  song_length  \
0             7.0              20160219.0       20170918.0     224130.0   
1             7.0              20120105.0       20171113.0     224130.0   
2             3.0              20130908.0       20171003.0     224130.0   
3             7.0              20131011.0       20170911.0     224130.0   
4             9.0              20060616.0       20180516.0     224130.0   

  genre_ids         artist_name        composer lyricist  language  
0       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
1       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
2       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
3       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
4       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
18
In [24]:
del train_members
del test_members
In [25]:
ax = sns.countplot(y=train_merged.dtypes, data=train_merged)
 
In [26]:
print(train_merged.columns.to_series().groupby(train_merged.dtypes).groups)
print(test_merged.columns.to_series().groupby(test_merged.dtypes).groups)
 
{dtype('float64'): Index(['target', 'city', 'bd', 'registered_via', 'registration_init_time',
       'expiration_date', 'song_length', 'language'],
      dtype='object'), dtype('O'): Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'gender', 'genre_ids', 'artist_name', 'composer',
       'lyricist'],
      dtype='object')}
{dtype('float64'): Index(['id', 'city', 'bd', 'registered_via', 'registration_init_time',
       'expiration_date', 'song_length', 'language'],
      dtype='object'), dtype('O'): Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'gender', 'genre_ids', 'artist_name', 'composer',
       'lyricist'],
      dtype='object')}
 

Analysis on missing values

In [27]:
msno.heatmap(train_merged)
#msno.matrix(train_merged)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f84742623c8>
 
 

as we can see lot of missing values are coming up but when common thing we notice that most of the missing values are arrived from members and songs

 

missing values from the heatmap also showing one thing that information which are missing and has positive correlation are gender with 4 variables of train.csv and rest of varibales with members.csv

In [28]:
#msno.dendrogram(train_merged)
 

A strong nullity correlation here we can see

song id -> lang, song_len,artist name, genre_id

composer -> lyricst

gender -> with song_id

from heatmap we can say if gender is missing 70% missing values will be in msno , target city etc

 

Now checking missing values and replacing them with some unique values

In [29]:
#--- Function to check if missing values are present and if so print the columns having them ---
def check_missing_values(df):
    print (df.isnull().values.any())
    if (df.isnull().values.any() == True):
        columns_with_Nan = df.columns[df.isnull().any()].tolist()
    print(columns_with_Nan)
    for col in columns_with_Nan:
        print("%s : %d" % (col, df[col].isnull().sum()))
    
check_missing_values(train_merged)
check_missing_values(test_merged)
 
True
['msno', 'source_system_tab', 'source_screen_name', 'source_type', 'target', 'city', 'bd', 'gender', 'registered_via', 'registration_init_time', 'expiration_date', 'song_length', 'genre_ids', 'artist_name', 'composer', 'lyricist', 'language']
msno : 1936406
source_system_tab : 1961255
source_screen_name : 2351210
source_type : 1957945
target : 1936406
city : 1936406
bd : 1936406
gender : 4897885
registered_via : 1936406
registration_init_time : 1936406
expiration_date : 1936406
song_length : 114
genre_ids : 205338
artist_name : 114
composer : 2591558
lyricist : 4855358
language : 150
True
['id', 'msno', 'source_system_tab', 'source_screen_name', 'source_type', 'city', 'bd', 'gender', 'registered_via', 'registration_init_time', 'expiration_date', 'song_length', 'genre_ids', 'artist_name', 'composer', 'lyricist', 'language']
id : 2071581
msno : 2071581
source_system_tab : 2080023
source_screen_name : 2234464
source_type : 2078878
city : 2071581
bd : 2071581
gender : 3123805
registered_via : 2071581
registration_init_time : 2071581
expiration_date : 2071581
song_length : 25
genre_ids : 132345
artist_name : 25
composer : 1595714
lyricist : 3008577
language : 42
In [30]:
#--- Function to replace Nan values in columns of type float with -5 ---
def replace_Nan_non_object(df):
    object_cols = list(df.select_dtypes(include=['float']).columns)
    for col in object_cols:
        df[col]=df[col].fillna(np.int(-5))
       
replace_Nan_non_object(train_merged) 
replace_Nan_non_object(test_merged)  
In [31]:
#--- memory consumed by train dataframe ---
mem = train_merged.memory_usage(index=True).sum()
print("Memory consumed by training set  :   {} MB" .format(mem/ 1024**2))
 
#--- memory consumed by test dataframe ---
mem = test_merged.memory_usage(index=True).sum()
print("Memory consumed by test set      :   {} MB" .format(mem/ 1024**2))
 
Memory consumed by training set  :   1350.117919921875 MB
Memory consumed by test set      :   670.9216995239258 MB
In [32]:
def change_datatype(df):
    float_cols = list(df.select_dtypes(include=['float']).columns)
    for col in float_cols:
        if ((np.max(df[col]) <= 127) and(np.min(df[col] >= -128))):
            df[col] = df[col].astype(np.int8)
        elif ((np.max(df[col]) <= 32767) and(np.min(df[col] >= -32768))):
            df[col] = df[col].astype(np.int16)
        elif ((np.max(df[col]) <= 2147483647) and(np.min(df[col] >= -2147483648))):
            df[col] = df[col].astype(np.int32)
        else:
            df[col] = df[col].astype(np.int64)

change_datatype(train_merged)
change_datatype(test_merged)
In [33]:
data = train_merged.groupby('target').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 8)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='target', y='msno', data=data)
 
 

as we can see that new user are about 5500 and old users about 15000 ,

*-5 are those values which are empty

In [34]:
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
data=train_merged.groupby('source_system_tab').aggregate({'msno':'count'}).reset_index()
sns.barplot(x='source_system_tab',y='msno',data=data)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8474172ac8>
 
In [35]:
data = train_merged.groupby('source_screen_name').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='source_screen_name', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[35]:
[Text(0,0,'Album more'),
 Text(0,0,'Artist more'),
 Text(0,0,'Concert'),
 Text(0,0,'Discover Chart'),
 Text(0,0,'Discover Feature'),
 Text(0,0,'Discover Genre'),
 Text(0,0,'Discover New'),
 Text(0,0,'Explore'),
 Text(0,0,'Local playlist more'),
 Text(0,0,'My library'),
 Text(0,0,'My library_Search'),
 Text(0,0,'Online playlist more'),
 Text(0,0,'Others profile more'),
 Text(0,0,'Payment'),
 Text(0,0,'Radio'),
 Text(0,0,'Search'),
 Text(0,0,'Search Home'),
 Text(0,0,'Search Trends'),
 Text(0,0,'Self profile more'),
 Text(0,0,'Unknown')]
 
In [36]:
data = train_merged.groupby('source_type').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='source_type', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[36]:
[Text(0,0,'album'),
 Text(0,0,'artist'),
 Text(0,0,'listen-with'),
 Text(0,0,'local-library'),
 Text(0,0,'local-playlist'),
 Text(0,0,'my-daily-playlist'),
 Text(0,0,'online-playlist'),
 Text(0,0,'radio'),
 Text(0,0,'song'),
 Text(0,0,'song-based-playlist'),
 Text(0,0,'top-hits-for-artist'),
 Text(0,0,'topic-article-playlist')]
 
In [37]:
data = train_merged.groupby('language').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='language', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[37]:
[Text(0,0,'-5'),
 Text(0,0,'-1'),
 Text(0,0,'3'),
 Text(0,0,'10'),
 Text(0,0,'17'),
 Text(0,0,'24'),
 Text(0,0,'31'),
 Text(0,0,'38'),
 Text(0,0,'45'),
 Text(0,0,'52'),
 Text(0,0,'59')]
 
In [38]:
data = train_merged.groupby('registered_via').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='registered_via', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[38]:
[Text(0,0,'-5'),
 Text(0,0,'3'),
 Text(0,0,'4'),
 Text(0,0,'7'),
 Text(0,0,'9'),
 Text(0,0,'13')]
 
 

most users 7 and 9 ways to get registered

In [39]:
print(train_merged.columns)
data = train_merged.groupby('city').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='city', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
 
Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'target', 'city', 'bd', 'gender', 'registered_via',
       'registration_init_time', 'expiration_date', 'song_length', 'genre_ids',
       'artist_name', 'composer', 'lyricist', 'language'],
      dtype='object')
Out[39]:
[Text(0,0,'-5'),
 Text(0,0,'1'),
 Text(0,0,'3'),
 Text(0,0,'4'),
 Text(0,0,'5'),
 Text(0,0,'6'),
 Text(0,0,'7'),
 Text(0,0,'8'),
 Text(0,0,'9'),
 Text(0,0,'10'),
 Text(0,0,'11'),
 Text(0,0,'12'),
 Text(0,0,'13'),
 Text(0,0,'14'),
 Text(0,0,'15'),
 Text(0,0,'16'),
 Text(0,0,'17'),
 Text(0,0,'18'),
 Text(0,0,'19'),
 Text(0,0,'20'),
 Text(0,0,'21'),
 Text(0,0,'22')]
 
 

no of users are 1,13,5 are containig maximum values

In [40]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns