ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

机器学习sklearn(二十): 特征工程(十一)特征编码(五)类别特征编码(三)独热编码 OneHotEncoder

2021-06-19 19:35:22  阅读:273  来源: 互联网

标签:编码 enc 特征 OneHotEncoder drop feature uses array categories


另外一种将标称型特征转换为能够被scikit-learn中模型使用的编码是one-of-K, 又称为 独热码或dummy encoding。 这种编码类型已经在类OneHotEncoder中实现。该类把每一个具有n_categories个可能取值的categorical特征变换为长度为n_categories的二进制特征向量,里面只有一个地方是1,其余位置都是0。

继续我们上面的示例:

>>>
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)  
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
...                ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

默认情况下,每个特征使用几维的数值可以从数据集自动推断。而且也可以在属性categories_中找到:

>>>
>>> enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]

可以使用参数categories_显式地指定这一点。我们的数据集中有两种性别、四种可能的大陆和四种web浏览器:

>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None,
       categories=[...], drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

如果训练数据可能缺少分类特性,通常最好指定handle_unknown='ignore',而不是像上面那样手动设置类别。当指定handle_unknown='ignore',并且在转换过程中遇到未知类别时,不会产生错误,但是为该特性生成的一热编码列将全部为零(handle_unknown='ignore'只支持一热编码):

如果训练数据中可能含有缺失的标称型特征, 通过指定handle_unknown='ignore'比像上面代码那样手动设置categories更好。 当handle_unknown='ignore' 被指定并在变换过程中真的碰到了未知的 categories, 则不会抛出任何错误,但是由此产生的该特征的one-hot编码列将会全部变成 0 。(这个参数设置选项 handle_unknown='ignore' 仅仅在 one-hot encoding的时候有效):

>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

还可以使用drop参数将每个列编码为n_categories-1列,而不是n_categories列。此参数允许用户为要删除的每个特征指定类别。这对于避免某些分类器中输入矩阵的共线性是有用的。例如,当使用非正则化回归(线性回归)时,这种功能是有用的,因为共线性会导致协方差矩阵是不可逆的。当这个参数不是None时,handle_unknown必须设置为error:

>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
       [0., 0., 0.]])

标称型特征有时是用字典来表示的,而不是标量,具体请参阅从字典中加载特征

class sklearn.preprocessing.OneHotEncoder(*categories='auto'drop=Nonesparse=Truedtype=<class 'numpy.float64'>handle_unknown='error')

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters
categories‘auto’ or a list of array-like, default=’auto’

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in the categories_ attribute.

New in version 0.20.

drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

  • array : drop[i] is the category in feature X[:, i] that should be dropped.

New in version 0.21: The parameter drop was added in 0.21.

Changed in version 0.23: The option drop='if_binary' was added in 0.23.

sparsebool, default=True

Will return sparse matrix if set True else will return an array.

dtypenumber type, default=float

Desired dtype of output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Attributes
categories_list of arrays

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

drop_idx_array of shape (n_features,)
  • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

  • drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop='if_binary' and the feature isn’t binary.

  • drop_idx_ = None if all the transformed features will be retained.

Changed in version 0.23: Added the possibility to contain None values.

 

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
  dtype=object)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])

Methods

fit(X[, y])

Fit OneHotEncoder to X.

fit_transform(X[, y])

Fit OneHotEncoder to X, then transform X.

get_feature_names([input_features])

Return feature names for output features.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Convert the data back to the original representation.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X using one-hot encoding.

 

标签:编码,enc,特征,OneHotEncoder,drop,feature,uses,array,categories
来源: https://www.cnblogs.com/qiu-hua/p/14904617.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有