Kaggle - Categorical Feature Encoding Challenge
分类:机器学习创建时间:2019-09-11 00:00:00
Categorical Feature Encoding Challenge
这是 kaggle 上的一个练习赛,地址在 这里,给出了一个全部由 category 特征构成的数据集。通过玩这个项目,可以了解到如何处理 category 特征,以及维度很高的时候,如何使用稀疏向量来存储数据。
"""
数据分析
"""
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')train = pd.read_csv('../data/cat-in-the-dat/train.csv')
test = pd.read_csv('../data/cat-in-the-dat/test.csv')
train_copy, test_copy = train.copy(), test.copy()观察数据
train.shape, test.shape((300000, 25), (200000, 24))
train.iloc[:,:12].head()| id | bin_0 | bin_1 | bin_2 | bin_3 | bin_4 | nom_0 | nom_1 | nom_2 | nom_3 | nom_4 | nom_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | T | Y | Green | Triangle | Snake | Finland | Bassoon | 50f116bcf |
| 1 | 1 | 0 | 1 | 0 | T | Y | Green | Trapezoid | Hamster | Russia | Piano | b3b4d25d0 |
| 2 | 2 | 0 | 0 | 0 | F | Y | Blue | Trapezoid | Lion | Russia | Theremin | 3263bdce5 |
| 3 | 3 | 0 | 1 | 0 | F | Y | Red | Trapezoid | Snake | Canada | Oboe | f12246592 |
| 4 | 4 | 0 | 0 | 0 | F | N | Red | Trapezoid | Lion | Canada | Oboe | 5b0f5acd5 |
train.iloc[:,12:].head()| nom_6 | nom_7 | nom_8 | nom_9 | ord_0 | ord_1 | ord_2 | ord_3 | ord_4 | ord_5 | day | month | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3ac1b8814 | 68f6ad3e9 | c389000ab | 2f4cb3d51 | 2 | Grandmaster | Cold | h | D | kr | 2 | 2 | 0 |
| 1 | fbcb50fc1 | 3b6dd5612 | 4cd920251 | f83c56c21 | 1 | Grandmaster | Hot | a | A | bF | 7 | 8 | 0 |
| 2 | 0922e3cb8 | a6a36f527 | de9c9f684 | ae6800dd0 | 1 | Expert | Lava Hot | h | R | Jc | 7 | 2 | 0 |
| 3 | 50d7ad46a | ec69236eb | 4ade6ab69 | 8270f0d71 | 1 | Grandmaster | Boiling Hot | i | D | kW | 2 | 1 | 1 |
| 4 | 1fe17a1fd | 04ddac2be | cb43ab175 | b164b72a7 | 1 | Grandmaster | Freezing | a | R | qP | 7 | 8 | 0 |
train.columns.valuesarray(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0',
'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7',
'nom_8', 'nom_9', 'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4',
'ord_5', 'day', 'month', 'target'], dtype=object)所有以 bin_ 开头的属性是二值的。nom_ 开头的属性是枚举值。ord_ 开头的属性也是枚举值,但是存在顺序关系,比如 ord_2 这一列描述的是温度,温度是有高低关系的。
for i in range(10):
count = train.loc[:, 'nom_{}'.format(i)].unique().shape[0]
print("nom_{} has {} unique values".format(i, count))nom_0 has 3 unique values nom_1 has 6 unique values nom_2 has 6 unique values nom_3 has 6 unique values nom_4 has 4 unique values nom_5 has 222 unique values nom_6 has 522 unique values nom_7 has 1220 unique values nom_8 has 2215 unique values nom_9 has 11981 unique values
for i in range(6):
count = train.loc[:, 'ord_{}'.format(i)].unique().shape[0]
print("ord_{} has {} unique values".format(i, count))ord_0 has 3 unique values ord_1 has 5 unique values ord_2 has 6 unique values ord_3 has 15 unique values ord_4 has 26 unique values ord_5 has 192 unique values
print(train.info())
print("-" * 40)
print(test.info())<class 'pandas.core.frame.DataFrame'> RangeIndex: 300000 entries, 0 to 299999 Data columns (total 25 columns): id 300000 non-null int64 bin_0 300000 non-null int64 bin_1 300000 non-null int64 bin_2 300000 non-null int64 bin_3 300000 non-null object bin_4 300000 non-null object nom_0 300000 non-null object nom_1 300000 non-null object nom_2 300000 non-null object nom_3 300000 non-null object nom_4 300000 non-null object nom_5 300000 non-null object nom_6 300000 non-null object nom_7 300000 non-null object nom_8 300000 non-null object nom_9 300000 non-null object ord_0 300000 non-null int64 ord_1 300000 non-null object ord_2 300000 non-null object ord_3 300000 non-null object ord_4 300000 non-null object ord_5 300000 non-null object day 300000 non-null int64 month 300000 non-null int64 target 300000 non-null int64 dtypes: int64(8), object(17) memory usage: 57.2+ MB None ---------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 200000 entries, 0 to 199999 Data columns (total 24 columns): id 200000 non-null int64 bin_0 200000 non-null int64 bin_1 200000 non-null int64 bin_2 200000 non-null int64 bin_3 200000 non-null object bin_4 200000 non-null object nom_0 200000 non-null object nom_1 200000 non-null object nom_2 200000 non-null object nom_3 200000 non-null object nom_4 200000 non-null object nom_5 200000 non-null object nom_6 200000 non-null object nom_7 200000 non-null object nom_8 200000 non-null object nom_9 200000 non-null object ord_0 200000 non-null int64 ord_1 200000 non-null object ord_2 200000 non-null object ord_3 200000 non-null object ord_4 200000 non-null object ord_5 200000 non-null object day 200000 non-null int64 month 200000 non-null int64 dtypes: int64(7), object(17) memory usage: 36.6+ MB None
经过分析后,所有属性都是离散的,不存在缺失值,我觉得可以把所有属性都做 one-hot 编码即可。但是考虑到 ord_ 属性是存在顺序关系的,可以尝试把 ord_ 映射到 0-1 之间。后面分别尝试这两种方案。
方案一
所有属性都做 one-hot 编码。
数据预处理
将所有特征都做 one-hot 编码即可,每一个样本为一个 16552 维的稀疏向量。day 和 month 组合起来可以是一年中具体的某一天,可以构造出这样一个特征来。
train = train_copy.copy()
test = test_copy.copy()
target = train['target']
train = train.drop(['target'], axis=1)dataset = pd.concat([train, test], ignore_index=True)
dataset = dataset.drop(['id'], axis=1)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)"""
大约要执行一分钟
"""
X = pd.get_dummies(dataset, columns=dataset.columns, sparse=True)pandas.get_dummies 加 sparse=True 返回的是 DataFrame,可以使用 to_coo 方法得到稀疏矩阵。但是这里为了分离出训练集和测试集,需要在将 coo 矩阵转为 csr 矩阵,这样才能做行切片。
X = X.to_coo().tocsr()
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]X_train.shape, X_test.shape((300000, 16636), (200000, 16636))
训练模型
这里数据量虽然很大,但是因为使用的是稀疏向量,因此训练 logistics regression 还是会很快。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(solver='lbfgs', C=0.1)
cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)array([0.80145132, 0.80058547, 0.80766639, 0.80311128, 0.80372911])
预测
lr.fit(X_train, target)
y_proba = lr.predict_proba(X_test)
submission = pd.DataFrame({
"id": test['id'],
"target": y_proba[:, 0]
})
submission.to_csv("../data/cat-in-the-dat/submission.csv", index=False)提交到 kaggle 上之后得分 0.80780,排行榜上 56/347。
方案二
除了 ord_ 属性外都做 one-hot 编码,ord_ 属性转换为 0-1 之间的值。
train = train_copy.copy()
test = test_copy.copy()
target = train['target']
train = train.drop(['target'], axis=1)dataset = pd.concat([train, test], ignore_index=True)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)dataset.shape(500000, 25)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, OrdinalEncoder, MinMaxScaler
def make_pipeline(estimator_list):
return Pipeline([
(estimator.__class__.__name__ + str(i), estimator)
for i,estimator in enumerate(estimator_list)
])
full_pipeline = ColumnTransformer([
("nom_*", OneHotEncoder(), ['nom_'+str(i) for i in range(0, 10)]),
("day/month", OneHotEncoder(), ['day','month']),
("ord_0", make_pipeline([
OrdinalEncoder(),
MinMaxScaler()
]), ["ord_0", "ord_3", "ord_4", "ord_5", "ord_6"]),
("ord_1", make_pipeline([
OrdinalEncoder(categories=[['Novice','Contributor','Expert','Master','Grandmaster']]),
MinMaxScaler()
]), ["ord_1"]),
("ord_2", make_pipeline([
OrdinalEncoder(categories=[['Freezing','Cold','Warm', 'Hot', 'Boiling Hot', 'Lava Hot']]),
MinMaxScaler()
]), ["ord_2"]),
], remainder='passthrough')
for df in [train, test]:
df.drop(['id'], axis=1, inplace=True)
df['bin_3'] = df['bin_3'].map({'F': 0, 'T': 1})
df['bin_4'] = df['bin_4'].map({'N': 0, 'Y': 1})
df['ord_6'] = df['ord_5'].str[1]
df['ord_5'] = df['ord_5'].str[0]
dataset = pd.concat([train, test], ignore_index=True)
full_pipeline.fit(dataset)
X_train = full_pipeline.transform(train)
X_test = full_pipeline.transform(test)from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
lr = LogisticRegression(solver='lbfgs', C=0.2)
cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)array([0.80179762, 0.80051735, 0.8070874 , 0.80310585, 0.80400994])
"""
预测
"""
lr.fit(X_train, target)
y_proba = lr.predict_proba(X_test)
submission = pd.DataFrame({
"id": test_copy['id'],
"target": y_proba[:, 0]
})
submission.to_csv("../data/cat-in-the-dat/submission_order.csv", index=False)