Categorical Feature Encoding Challenge

这是 kaggle 上的一个练习赛，地址在这里，给出了一个全部由 category 特征构成的数据集。通过玩这个项目，可以了解到如何处理 category 特征，以及维度很高的时候，如何使用稀疏向量来存储数据。

"""
数据分析
"""
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

train = pd.read_csv('../data/cat-in-the-dat/train.csv')
test = pd.read_csv('../data/cat-in-the-dat/test.csv')

train_copy, test_copy = train.copy(), test.copy()

观察数据

train.shape, test.shape

((300000, 25), (200000, 24))

train.iloc[:,:12].head()

	id	bin_1	bin_3	bin_4	nom_0	nom_1	nom_2	nom_3	nom_4	nom_5
0	0	0	T	Y	Green	Triangle	Snake	Finland	Bassoon	50f116bcf
1	1	1	T	Y	Green	Trapezoid	Hamster	Russia	Piano	b3b4d25d0
2	2	0	F	Y	Blue	Trapezoid	Lion	Russia	Theremin	3263bdce5
3	3	1	F	Y	Red	Trapezoid	Snake	Canada	Oboe	f12246592
4	4	0	F	N	Red	Trapezoid	Lion	Canada	Oboe	5b0f5acd5

train.iloc[:,12:].head()

	nom_6	nom_7	nom_8	nom_9	ord_0	ord_1	ord_2	ord_3	ord_4	ord_5	day	month	target
0	3ac1b8814	68f6ad3e9	c389000ab	2f4cb3d51	2	Grandmaster	Cold	h	D	kr	2	2	0
1	fbcb50fc1	3b6dd5612	4cd920251	f83c56c21	1	Grandmaster	Hot	a	A	bF	7	8	0
2	0922e3cb8	a6a36f527	de9c9f684	ae6800dd0	1	Expert	Lava Hot	h	R	Jc	7	2	0
3	50d7ad46a	ec69236eb	4ade6ab69	8270f0d71	1	Grandmaster	Boiling Hot	i	D	kW	2	1	1
4	1fe17a1fd	04ddac2be	cb43ab175	b164b72a7	1	Grandmaster	Freezing	a	R	qP	7	8	0

train.columns.values

array(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0',
       'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7',
       'nom_8', 'nom_9', 'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4',
       'ord_5', 'day', 'month', 'target'], dtype=object)

所有以 bin_ 开头的属性是二值的。nom_ 开头的属性是枚举值。ord_ 开头的属性也是枚举值，但是存在顺序关系，比如 ord_2 这一列描述的是温度，温度是有高低关系的。

for i in range(10):
    count = train.loc[:, 'nom_{}'.format(i)].unique().shape[0]
    print("nom_{} has {} unique values".format(i, count))

nom_0 has 3 unique values
nom_1 has 6 unique values
nom_2 has 6 unique values
nom_3 has 6 unique values
nom_4 has 4 unique values
nom_5 has 222 unique values
nom_6 has 522 unique values
nom_7 has 1220 unique values
nom_8 has 2215 unique values
nom_9 has 11981 unique values

for i in range(6):
    count = train.loc[:, 'ord_{}'.format(i)].unique().shape[0]
    print("ord_{} has {} unique values".format(i, count))

ord_0 has 3 unique values
ord_1 has 5 unique values
ord_2 has 6 unique values
ord_3 has 15 unique values
ord_4 has 26 unique values
ord_5 has 192 unique values

print(train.info())
print("-" * 40)
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 25 columns):
id        300000 non-null int64
bin_0     300000 non-null int64
bin_1     300000 non-null int64
bin_2     300000 non-null int64
bin_3     300000 non-null object
bin_4     300000 non-null object
nom_0     300000 non-null object
nom_1     300000 non-null object
nom_2     300000 non-null object
nom_3     300000 non-null object
nom_4     300000 non-null object
nom_5     300000 non-null object
nom_6     300000 non-null object
nom_7     300000 non-null object
nom_8     300000 non-null object
nom_9     300000 non-null object
ord_0     300000 non-null int64
ord_1     300000 non-null object
ord_2     300000 non-null object
ord_3     300000 non-null object
ord_4     300000 non-null object
ord_5     300000 non-null object
day       300000 non-null int64
month     300000 non-null int64
target    300000 non-null int64
dtypes: int64(8), object(17)
memory usage: 57.2+ MB
None
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
id       200000 non-null int64
bin_0    200000 non-null int64
bin_1    200000 non-null int64
bin_2    200000 non-null int64
bin_3    200000 non-null object
bin_4    200000 non-null object
nom_0    200000 non-null object
nom_1    200000 non-null object
nom_2    200000 non-null object
nom_3    200000 non-null object
nom_4    200000 non-null object
nom_5    200000 non-null object
nom_6    200000 non-null object
nom_7    200000 non-null object
nom_8    200000 non-null object
nom_9    200000 non-null object
ord_0    200000 non-null int64
ord_1    200000 non-null object
ord_2    200000 non-null object
ord_3    200000 non-null object
ord_4    200000 non-null object
ord_5    200000 non-null object
day      200000 non-null int64
month    200000 non-null int64
dtypes: int64(7), object(17)
memory usage: 36.6+ MB
None

经过分析后，所有属性都是离散的，不存在缺失值，我觉得可以把所有属性都做 one-hot 编码即可。但是考虑到 ord_ 属性是存在顺序关系的，可以尝试把 ord_ 映射到 0-1 之间。后面分别尝试这两种方案。

方案一

所有属性都做 one-hot 编码。

数据预处理

将所有特征都做 one-hot 编码即可，每一个样本为一个 16552 维的稀疏向量。day 和 month 组合起来可以是一年中具体的某一天，可以构造出这样一个特征来。

train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)

dataset = pd.concat([train, test], ignore_index=True)
dataset = dataset.drop(['id'], axis=1)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)

"""
大约要执行一分钟
"""
X = pd.get_dummies(dataset, columns=dataset.columns, sparse=True)

pandas.get_dummies 加 sparse=True 返回的是 DataFrame，可以使用 to_coo 方法得到稀疏矩阵。但是这里为了分离出训练集和测试集，需要在将 coo 矩阵转为 csr 矩阵，这样才能做行切片。

X = X.to_coo().tocsr()
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]

X_train.shape, X_test.shape

((300000, 16636), (200000, 16636))

训练模型

这里数据量虽然很大，但是因为使用的是稀疏向量，因此训练 logistics regression 还是会很快。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.1)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.80145132, 0.80058547, 0.80766639, 0.80311128, 0.80372911])

预测

lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission.csv", index=False)

提交到 kaggle 上之后得分 0.80780，排行榜上 56/347。

方案二

除了 ord_ 属性外都做 one-hot 编码，ord_ 属性转换为 0-1 之间的值。

train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)

dataset = pd.concat([train, test], ignore_index=True)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)

dataset.shape

(500000, 25)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, OrdinalEncoder, MinMaxScaler

def make_pipeline(estimator_list):
    return Pipeline([
        (estimator.__class__.__name__ + str(i), estimator)
        for i,estimator  in enumerate(estimator_list)
    ])

full_pipeline =  ColumnTransformer([
    ("nom_*", OneHotEncoder(), ['nom_'+str(i) for i in range(0, 10)]),
    ("day/month", OneHotEncoder(), ['day','month']),
    ("ord_0", make_pipeline([
        OrdinalEncoder(),
        MinMaxScaler()
    ]), ["ord_0", "ord_3", "ord_4", "ord_5", "ord_6"]),
    ("ord_1", make_pipeline([
        OrdinalEncoder(categories=[['Novice','Contributor','Expert','Master','Grandmaster']]),
        MinMaxScaler()
    ]), ["ord_1"]),
    ("ord_2", make_pipeline([
        OrdinalEncoder(categories=[['Freezing','Cold','Warm', 'Hot', 'Boiling Hot', 'Lava Hot']]),
        MinMaxScaler()
    ]), ["ord_2"]),
], remainder='passthrough')

for df in [train, test]:
    df.drop(['id'], axis=1, inplace=True)
    df['bin_3'] = df['bin_3'].map({'F': 0, 'T': 1})
    df['bin_4'] = df['bin_4'].map({'N': 0, 'Y': 1})
    df['ord_6'] = df['ord_5'].str[1]
    df['ord_5'] = df['ord_5'].str[0]

dataset = pd.concat([train, test], ignore_index=True)
    
full_pipeline.fit(dataset)

X_train = full_pipeline.transform(train)
X_test = full_pipeline.transform(test)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.2)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.80179762, 0.80051735, 0.8070874 , 0.80310585, 0.80400994])

"""
预测
"""

lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test_copy['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission_order.csv", index=False)

WangYu::Space

Kaggle - Categorical Feature Encoding Challenge

Categorical Feature Encoding Challenge

观察数据

方案一

数据预处理

训练模型

预测

方案二

评论（评论内容仅博主可见，不会公开显示）

Kaggle - Categorical Feature Encoding Challenge

Categorical Feature Encoding Challenge

观察数据

方案一

数据预处理

训练模型

预测

方案二

评论 （评论内容仅博主可见，不会公开显示）

评论（评论内容仅博主可见，不会公开显示）