ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

b

2019-04-29 21:39:10  阅读:169  来源: 互联网

标签: category cols train print data ###


数据前处理

  1. 导入数据
    import pandas as pd
    import numpy as np
    from sklearn.cross_validation import train_test_split
    ### Load data
    ### Split the data to train and test sets
    data = pd.read_csv('data/loan/Train.csv', encoding = "ISO-8859-1")
    train, test = train_test_split(data,train_size=0.7,random_state=123,stratify=data['Disbursed'])
    ### Check number of nulls in each feature column
    nulls_per_column = train.isnull().sum()
    print(nulls_per_column)
    View Code
  2. 将特征拆分成数值型和种类型
    ### Drop the useless columns
    train_1 = train.drop(['ID','Lead_Creation_Date','LoggedIn'],axis=1)
    ### Split the columns to numerical and categorical
    category_cols = train_1.columns[train_1.dtypes==object].tolist()
    category_cols.remove('DOB')
    category_cols.append('Var4')
    numeric_cols = list(set(train_1.columns)-set(category_cols))
    View Code
  3. 分析并处理种类型特征
    ### explore the categorical columns
    for v in category_cols:
        print('Ratio of missing value for variable {0}: {1}'.format(v,nulls_per_column[v]/train_1.shape[0]))
    print('-----------------------------------------------------------')
    counts = dict()
    for v in category_cols:
        print('\nFrequency count for variable %s'%v)
        counts[v] = train_1[v].value_counts()
        print(counts[v])
    ### merge the cities that counts<200
    merge_city = [c for c in counts['City'].index if counts['City'][c]<200]
    train_1['City'] = train_1['City'].apply(lambda x: 'others' if x in merge_city else x)
    ### merge the salary accounts that counts<100
    merge_sa = [c for c in counts['Salary_Account'].index if counts['Salary_Account'][c]<100]
    train_1['Salary_Account'] = train_1['Salary_Account'].apply(lambda x: 'others' if x in merge_sa else x)
    ### merge the sources that counts<100
    merge_sr = [c for c in counts['Source'].index if counts['Source'][c]<100]
    train_1['Source'] = train_1['Source'].apply(lambda x: 'others' if x in merge_sr else x)
    ### impute the missing value
    train_1['City'].fillna('Missing',inplace=True)
    train_1['Salary_Account'].fillna('Missing',inplace=True)
    ### delete the column Employer_Name since too many categories
    train_2 = train_1.drop('Employer_Name',axis=1)
    View Code
  4. 分析并处理数值型特征
  5.  One-Hot encoding

模型调参

  1. 建立基础模型并使用early_stop调整迭代次数
  2. Tune max_depth and min_child_weight
  3. Tune gamma
  4. Tune subsample and colsample_bytree
  5. Tune reg_alpha
  6. Tune reg_lambda
  7. Reduce learning rate




标签:,category,cols,train,print,data,###
来源: https://www.cnblogs.com/sunwq06/p/10793016.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有