标签: category cols train print data ###
数据前处理
- 导入数据
import pandas as pd import numpy as np from sklearn.cross_validation import train_test_split ### Load data ### Split the data to train and test sets data = pd.read_csv('data/loan/Train.csv', encoding = "ISO-8859-1") train, test = train_test_split(data,train_size=0.7,random_state=123,stratify=data['Disbursed']) ### Check number of nulls in each feature column nulls_per_column = train.isnull().sum() print(nulls_per_column)
View Code - 将特征拆分成数值型和种类型
### Drop the useless columns train_1 = train.drop(['ID','Lead_Creation_Date','LoggedIn'],axis=1) ### Split the columns to numerical and categorical category_cols = train_1.columns[train_1.dtypes==object].tolist() category_cols.remove('DOB') category_cols.append('Var4') numeric_cols = list(set(train_1.columns)-set(category_cols))
View Code - 分析并处理种类型特征
### explore the categorical columns for v in category_cols: print('Ratio of missing value for variable {0}: {1}'.format(v,nulls_per_column[v]/train_1.shape[0])) print('-----------------------------------------------------------') counts = dict() for v in category_cols: print('\nFrequency count for variable %s'%v) counts[v] = train_1[v].value_counts() print(counts[v]) ### merge the cities that counts<200 merge_city = [c for c in counts['City'].index if counts['City'][c]<200] train_1['City'] = train_1['City'].apply(lambda x: 'others' if x in merge_city else x) ### merge the salary accounts that counts<100 merge_sa = [c for c in counts['Salary_Account'].index if counts['Salary_Account'][c]<100] train_1['Salary_Account'] = train_1['Salary_Account'].apply(lambda x: 'others' if x in merge_sa else x) ### merge the sources that counts<100 merge_sr = [c for c in counts['Source'].index if counts['Source'][c]<100] train_1['Source'] = train_1['Source'].apply(lambda x: 'others' if x in merge_sr else x) ### impute the missing value train_1['City'].fillna('Missing',inplace=True) train_1['Salary_Account'].fillna('Missing',inplace=True) ### delete the column Employer_Name since too many categories train_2 = train_1.drop('Employer_Name',axis=1)
View Code - 分析并处理数值型特征
- One-Hot encoding
模型调参
- 建立基础模型并使用early_stop调整迭代次数
- Tune max_depth and min_child_weight
- Tune gamma
- Tune subsample and colsample_bytree
- Tune reg_alpha
- Tune reg_lambda
- Reduce learning rate
标签:,category,cols,train,print,data,### 来源: https://www.cnblogs.com/sunwq06/p/10793016.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。