WangYu::Space

Study, think, create, and grow. Teach yourself and teach others.

Kaggle - Categorical Feature Encoding Challenge

分类:机器学习创建时间:2019-09-11 00:00:00

Categorical Feature Encoding Challenge

这是 kaggle 上的一个练习赛,地址在 这里,给出了一个全部由 category 特征构成的数据集。通过玩这个项目,可以了解到如何处理 category 特征,以及维度很高的时候,如何使用稀疏向量来存储数据。

"""
数据分析
"""
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('../data/cat-in-the-dat/train.csv')
test = pd.read_csv('../data/cat-in-the-dat/test.csv')

train_copy, test_copy = train.copy(), test.copy()

观察数据

train.shape, test.shape
((300000, 25), (200000, 24))
train.iloc[:,:12].head()
id bin_0 bin_1 bin_2 bin_3 bin_4 nom_0 nom_1 nom_2 nom_3 nom_4 nom_5
0 0 0 0 0 T Y Green Triangle Snake Finland Bassoon 50f116bcf
1 1 0 1 0 T Y Green Trapezoid Hamster Russia Piano b3b4d25d0
2 2 0 0 0 F Y Blue Trapezoid Lion Russia Theremin 3263bdce5
3 3 0 1 0 F Y Red Trapezoid Snake Canada Oboe f12246592
4 4 0 0 0 F N Red Trapezoid Lion Canada Oboe 5b0f5acd5
train.iloc[:,12:].head()
nom_6 nom_7 nom_8 nom_9 ord_0 ord_1 ord_2 ord_3 ord_4 ord_5 day month target
0 3ac1b8814 68f6ad3e9 c389000ab 2f4cb3d51 2 Grandmaster Cold h D kr 2 2 0
1 fbcb50fc1 3b6dd5612 4cd920251 f83c56c21 1 Grandmaster Hot a A bF 7 8 0
2 0922e3cb8 a6a36f527 de9c9f684 ae6800dd0 1 Expert Lava Hot h R Jc 7 2 0
3 50d7ad46a ec69236eb 4ade6ab69 8270f0d71 1 Grandmaster Boiling Hot i D kW 2 1 1
4 1fe17a1fd 04ddac2be cb43ab175 b164b72a7 1 Grandmaster Freezing a R qP 7 8 0
train.columns.values
array(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0',
       'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7',
       'nom_8', 'nom_9', 'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4',
       'ord_5', 'day', 'month', 'target'], dtype=object)

所有以 bin_ 开头的属性是二值的。nom_ 开头的属性是枚举值。ord_ 开头的属性也是枚举值,但是存在顺序关系,比如 ord_2 这一列描述的是温度,温度是有高低关系的。

for i in range(10):
    count = train.loc[:, 'nom_{}'.format(i)].unique().shape[0]
    print("nom_{} has {} unique values".format(i, count))
nom_0 has 3 unique values
nom_1 has 6 unique values
nom_2 has 6 unique values
nom_3 has 6 unique values
nom_4 has 4 unique values
nom_5 has 222 unique values
nom_6 has 522 unique values
nom_7 has 1220 unique values
nom_8 has 2215 unique values
nom_9 has 11981 unique values
for i in range(6):
    count = train.loc[:, 'ord_{}'.format(i)].unique().shape[0]
    print("ord_{} has {} unique values".format(i, count))
ord_0 has 3 unique values
ord_1 has 5 unique values
ord_2 has 6 unique values
ord_3 has 15 unique values
ord_4 has 26 unique values
ord_5 has 192 unique values
print(train.info())
print("-" * 40)
print(test.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 25 columns):
id        300000 non-null int64
bin_0     300000 non-null int64
bin_1     300000 non-null int64
bin_2     300000 non-null int64
bin_3     300000 non-null object
bin_4     300000 non-null object
nom_0     300000 non-null object
nom_1     300000 non-null object
nom_2     300000 non-null object
nom_3     300000 non-null object
nom_4     300000 non-null object
nom_5     300000 non-null object
nom_6     300000 non-null object
nom_7     300000 non-null object
nom_8     300000 non-null object
nom_9     300000 non-null object
ord_0     300000 non-null int64
ord_1     300000 non-null object
ord_2     300000 non-null object
ord_3     300000 non-null object
ord_4     300000 non-null object
ord_5     300000 non-null object
day       300000 non-null int64
month     300000 non-null int64
target    300000 non-null int64
dtypes: int64(8), object(17)
memory usage: 57.2+ MB
None
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
id       200000 non-null int64
bin_0    200000 non-null int64
bin_1    200000 non-null int64
bin_2    200000 non-null int64
bin_3    200000 non-null object
bin_4    200000 non-null object
nom_0    200000 non-null object
nom_1    200000 non-null object
nom_2    200000 non-null object
nom_3    200000 non-null object
nom_4    200000 non-null object
nom_5    200000 non-null object
nom_6    200000 non-null object
nom_7    200000 non-null object
nom_8    200000 non-null object
nom_9    200000 non-null object
ord_0    200000 non-null int64
ord_1    200000 non-null object
ord_2    200000 non-null object
ord_3    200000 non-null object
ord_4    200000 non-null object
ord_5    200000 non-null object
day      200000 non-null int64
month    200000 non-null int64
dtypes: int64(7), object(17)
memory usage: 36.6+ MB
None

经过分析后,所有属性都是离散的,不存在缺失值,我觉得可以把所有属性都做 one-hot 编码即可。但是考虑到 ord_ 属性是存在顺序关系的,可以尝试把 ord_ 映射到 0-1 之间。后面分别尝试这两种方案。

方案一

所有属性都做 one-hot 编码。

数据预处理

将所有特征都做 one-hot 编码即可,每一个样本为一个 16552 维的稀疏向量。day 和 month 组合起来可以是一年中具体的某一天,可以构造出这样一个特征来。

train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)
dataset = pd.concat([train, test], ignore_index=True)
dataset = dataset.drop(['id'], axis=1)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)
"""
大约要执行一分钟
"""
X = pd.get_dummies(dataset, columns=dataset.columns, sparse=True)

pandas.get_dummiessparse=True 返回的是 DataFrame,可以使用 to_coo 方法得到稀疏矩阵。但是这里为了分离出训练集和测试集,需要在将 coo 矩阵转为 csr 矩阵,这样才能做行切片。

X = X.to_coo().tocsr()
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]
X_train.shape, X_test.shape
((300000, 16636), (200000, 16636))

训练模型

这里数据量虽然很大,但是因为使用的是稀疏向量,因此训练 logistics regression 还是会很快。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.1)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)
array([0.80145132, 0.80058547, 0.80766639, 0.80311128, 0.80372911])

预测

lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission.csv", index=False)

提交到 kaggle 上之后得分 0.80780,排行榜上 56/347。

方案二

除了 ord_ 属性外都做 one-hot 编码,ord_ 属性转换为 0-1 之间的值。

train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)
dataset = pd.concat([train, test], ignore_index=True)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)
dataset.shape
(500000, 25)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, OrdinalEncoder, MinMaxScaler

def make_pipeline(estimator_list):
    return Pipeline([
        (estimator.__class__.__name__ + str(i), estimator)
        for i,estimator  in enumerate(estimator_list)
    ])

full_pipeline =  ColumnTransformer([
    ("nom_*", OneHotEncoder(), ['nom_'+str(i) for i in range(0, 10)]),
    ("day/month", OneHotEncoder(), ['day','month']),
    ("ord_0", make_pipeline([
        OrdinalEncoder(),
        MinMaxScaler()
    ]), ["ord_0", "ord_3", "ord_4", "ord_5", "ord_6"]),
    ("ord_1", make_pipeline([
        OrdinalEncoder(categories=[['Novice','Contributor','Expert','Master','Grandmaster']]),
        MinMaxScaler()
    ]), ["ord_1"]),
    ("ord_2", make_pipeline([
        OrdinalEncoder(categories=[['Freezing','Cold','Warm', 'Hot', 'Boiling Hot', 'Lava Hot']]),
        MinMaxScaler()
    ]), ["ord_2"]),
], remainder='passthrough')

for df in [train, test]:
    df.drop(['id'], axis=1, inplace=True)
    df['bin_3'] = df['bin_3'].map({'F': 0, 'T': 1})
    df['bin_4'] = df['bin_4'].map({'N': 0, 'Y': 1})
    df['ord_6'] = df['ord_5'].str[1]
    df['ord_5'] = df['ord_5'].str[0]

dataset = pd.concat([train, test], ignore_index=True)
    
full_pipeline.fit(dataset)

X_train = full_pipeline.transform(train)
X_test = full_pipeline.transform(test)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.2)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)
array([0.80179762, 0.80051735, 0.8070874 , 0.80310585, 0.80400994])
"""
预测
"""

lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test_copy['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission_order.csv", index=False)
Notebook

评论 (评论内容仅博主可见,不会公开显示)