ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

网格搜索 Grid Search

2021-02-02 07:33:32  阅读:790  来源: 互联网

标签:neighbors knn Search score 网格 train Grid test best


目录


以使用 KNN 给 digits 数据集分类为例:


Python 原生代码实现寻找最佳超参数

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
 
digits = datasets.load_digits() 
 
X = digits.data
y = digits.target
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
 
from sklearn.neighbors import KNeighborsClassifier

使用 k 作为超参数


best_score = 0.0
best_k = -1
for k in range(1,11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train) 
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_k = k
    
print(best_k, best_score)
# 1 0.9861111111111112

# 如果最好的值是边界值,如10,则最好对 10 以上的数据再进行搜索。

超参数 添加距离 weights


best_score = 0.0
best_k = -1
best_method = ""

for method in ['uniform', 'distance']:

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_method = method
    
print(best_k, best_method, best_score)
# 1 uniform 0.9861111111111112

超参数 添加距离范式 p

p 默认为2,即使用 欧氏距离。

%%time # 距离需要开根号,比较耗时,这里计时
best_score = 0.0
best_k = -1
best_p = -1

for p in range(1, 6):

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_p = p
    
print(best_k, best_p, best_score)
'''
1 2 0.9861111111111112
    CPU times: user 14.8 s, sys: 46.7 ms, total: 14.9 s
    Wall time: 14.9 s
'''

以上搜索方式也称为 网格搜索。


使用 sklearn 中的网格搜索

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
 
digits = datasets.load_digits() 
 
X = digits.data
y = digits.target
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
 
from sklearn.neighbors import KNeighborsClassifier
# 定义要搜索的参数

param_grid = [{'weights': ['uniform'],
               'n_neighbors': [i for i in range(1,11)]
              },
              
              {'weights': ['distance'],
               'n_neighbors': [i for i in range(1,11)],
               'p': [i for i in range(1,6)]
              }]
 
knn_clf = KNeighborsClassifier()

from sklearn.model_selection import GridSearchCV
# CV 的意思是 Cross Validation,交叉验证。

grid_search = GridSearchCV(knn_clf, param_grid)
%%time 
# 比较耗时,
grid_search.fit(X_train, y_train)

# CPU times: user 43.3 s, sys: 93.2 ms, total: 43.4 s
# Wall time: 43.5 s

'''
GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric='minkowski',
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=5, p=2,
                                                weights='uniform'),
                 iid='warn', n_jobs=None,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0) 
'''

grid_search.best_estimator_  # 最佳分类器对应的参数
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')

 
# 最佳准确度
grid_search.best_score_
# 0.9846903270702854
 
# 最佳参数
grid_search.best_params_  
# {'n_neighbors': 1, 'weights': 'uniform'}
 
# 以上属性末尾都有下划线,代表一个原则:不是由用户传入的数据,而是类自己计算的结果,命名都是 名字后跟一个下划线。

# 将最佳模型传给这个 knn
knn_clf = grid_search.best_estimator_
 
knn_clf.predict(X_test) 
'''
array([4, 0, 9, 1, 8, 7, 1, 5, 1, 6, 6, 7, 6, 1, 5, 5, 7, 6, 2, 7, 4, 6, 1, 5, 2, 9, 5, 4, 6, 5, 6, 3, 4, 0, 9, 9, 8, 4, 6, 8, 8, 5, 7, ... 5, 7, 8, 0, 4, 1, 4, 5])
'''
 
knn_clf.score(X_test, y_test)
# 0.9861111111111112

提升效率


# 以上搜索过程是可以并行处理的;n_jobs 决定了为计算机分配几个核来处理,默认为1,代表单核;传-1代表传所有核。
# verbose 表示在搜索过程中进行输出,这样在长时间搜索的时候,可以了解搜索状态。传入整数,整数越大,输出信息越详细。
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
 
grid_search.fit(X_train, y_train)

''' 
    Fitting 3 folds for each of 60 candidates, totalling 180 fits
 
   ~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
      warnings.warn(CV_WARNING, FutureWarning)
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
    [Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.3s
    [Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    8.9s
    [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   11.1s finished 
    
    GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric='minkowski',
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=1, p=2,
                                                weights='uniform'),
                 iid='warn', n_jobs=-1,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=2)
'''

关于距离

机器学习中的距离

KNeighborsClassifier 中默认使用闵式距离,p为2(欧式距离);可以使用 metric 参数修改距离;

sklearn 官网文档列出了不同的距离
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

Metrics intended for real-valued vector spaces:

identifier class name args distance function
“euclidean” EuclideanDistance sqrt(sum((x - y)^2))
“manhattan” ManhattanDistance sum(|x - y|)
“chebyshev” ChebyshevDistance max(|x - y|)
“minkowski” MinkowskiDistance p sum(|x - y|^p)^(1/p)
“wminkowski” WMinkowskiDistance p, w sum(|w * (x - y)|^p)^(1/p)
“seuclidean” SEuclideanDistance V sqrt(sum((x - y)^2 / V))
“mahalanobis” MahalanobisDistance V or VI sqrt((x - y)' V^-1 (x - y))

Metrics intended for two-dimensional vector spaces: Note that the haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

identifier class name distance function
“haversine” HaversineDistance 2 arcsin(sqrt(sin^2(0.5*dx) + cos(x1)cos(x2)sin^2(0.5*dy)))

Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.

identifier class name distance function
“hamming” HammingDistance N_unequal(x, y) / N_tot
“canberra” CanberraDistance sum(|x - y| / (|x| + |y|))
“braycurtis” BrayCurtisDistance sum(|x - y|) / (sum(|x|) + sum(|y|))

Metrics intended for boolean-valued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:

  • N : number of dimensions
  • NTT : number of dims in which both values are True
  • NTF : number of dims in which the first value is True, second is False
  • NFT : number of dims in which the first value is False, second is True
  • NFF : number of dims in which both values are False
  • NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
  • NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT
identifier class name distance function
“jaccard” JaccardDistance NNEQ / NNZ
“matching” MatchingDistance NNEQ / N
“dice” DiceDistance NNEQ / (NTT + NNZ)
“kulsinski” KulsinskiDistance (NNEQ + N - NTT) / (NNEQ + N)
“rogerstanimoto” RogersTanimotoDistance 2 * NNEQ / (N + NNEQ)
“russellrao” RussellRaoDistance NNZ / N
“sokalmichener” SokalMichenerDistance 2 * NNEQ / (N + NNEQ)
“sokalsneath” SokalSneathDistance NNEQ / (NNEQ + 0.5 * NTT)

User-defined distance:

identifier class name args
“pyfunc” PyFuncDistance func

标签:neighbors,knn,Search,score,网格,train,Grid,test,best
来源: https://www.cnblogs.com/devwalks/p/14360138.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有