看了几篇kaggle的solution,总结了一下,完成一个机器学习项目,通常分为以下几步:
1.载入数据&查看数据信息
2.数据预处理1(数据清洗,异常值处理)
3.特征工程(在现有数据中挖掘对任务有用的新特征)
4.数据处理2(将处理过的不同数据表连接、合并…最终划分为训练集和测试集,形成可以用来训练的数据集)
5.模型的选择(试用几个模型来比较效果)
6.模型的比较与融合(挑选好的模型互相融合)
7.模型的训练与保存
8.模型的调用与测试

接下来,以一个例子来分享一下这个项目:

1 项目介绍

项目地址如下:

https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data?select=test.csv

数据集介绍:
在这个网址中,有涉及到任务的数据集,

在这里插入图片描述

在旁边可以查看数据的构成和基本信息,不必也做不到用excel打开看一看。因为这个数据集非常大,总共100MB大小,所以excel打开必然会崩。

在这里插入图片描述

上述文件的描述可以在下图中看到:

在这里插入图片描述

在6个csv文件中的各个数据字段的含义,下图也有表述:

在这里插入图片描述

思路:

  • 有用的特征是:往期每个月的数据。由于原数据中给的是每一天的数据,因此需要对每一个商店中每一个商品的数据都按月先求和。
  • 模型可以用之前所有月来预测最后一个月的数据。

代码解析:
载入数据

df_train = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
df_shops = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
df_items = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/items.csv')
df_item_categories = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
df_test = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/test.csv')

查看数据基本信息

df_train.head()
df_test.head()
df_train.info()
df_train.describe()
df_train.isnull().sum()
df_test.isna().sum()

数据预处理

#在df_train里剔除date_block_num和item_price
df_train.drop(['date_block_num','item_price'], axis=1, inplace=True)

df_train['date'] = pd.to_datetime(df_train['date'], dayfirst=True)
df_train['date'] = df_train['date'].apply(lambda x: x.strftime('%Y-%m'))
df_train.head()

df = df_train.groupby(['date','shop_id','item_id']).sum()# sql group by
#透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table
#index先按照shop_id排序,在shop_id里再按照item_id排序;column是date,也就是日期;
df = df.pivot_table(index=['shop_id','item_id'], columns='date', values='item_cnt_day', fill_value=0)
#             0    1    2    3    4
# date
# shop_id    0    0    0    0    0
# item_id    30    31    32    33    35
# 2013-01    0    0    6    3    1
# 2013-02(31) 11    10    3    14
# 2013-03    0    0    0    0    0
# 2013-04    0    0    0    0    0
# 2013-05    0    0    0    0    0
#举例,0号商店的30号商品在2013年2月的第一天迈出了31份

#inplace=True :是指重置索引的结果是否作用在前面的数据上
df.reset_index(inplace=True)
print(df.head().T)

#测试集处理,
df_test = pd.merge(df_test, df, on=['shop_id','item_id'], how='left')#类似concat
df_test.drop(['ID', '2013-01'], axis=1, inplace=True)#将['ID', '2013-01']两lie去除
df_test = df_test.fillna(0)#填充nan
print('Test data:', df_test.head().T)

数据集划分

# split into train and test sets
Y_train = df['2015-10'].values#以最后一维数据作为y
X_train = df.drop(['2015-10'], axis = 1)#除了y以外的其他数据作为特征
X_test = df_test

print(X_train.shape, Y_train.shape)
print(X_test.shape)


#设置训练测试集划分
x_train, x_test, y_train, y_test = train_test_split( X_train, Y_train, test_size=0.20, random_state=1)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

模型选择与参数处理

LR = LinearRegression()
LR.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, LR.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, LR.predict(x_test)))
print('Test set score:', LR.score(x_train,y_train))

# %time
RFR = RandomForestRegressor(n_estimators = 100)
RFR.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, RFR.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, RFR.predict(x_test)))
print('Test set score:', RFR.score(x_train,y_train))

# %time
XGB = XGBRegressor(max_depth=16,n_estimators=200,seed=1)
XGB.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, XGB.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, XGB.predict(x_test)))
print('Test set score:', XGB.score(x_train,y_train))

# %time
LGBM = LGBMRegressor(max_depth=16,n_estimators=200,seed=1)
LGBM.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, LGBM.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, LGBM.predict(x_test)))
print('Test set score:', LGBM.score(x_train,y_train))

prediction = XGB.predict(X_test)

prediction = list(map(round, prediction))#四舍五入

提交结果

df_submission['item_cnt_month'] = prediction
df_submission.to_csv('prediction.csv', index=False)
df_submission.head()

这里给出另一个类似的kaggle项目代码

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


#导入数据包
import datetime as dt
import logging
import warnings
warnings.filterwarnings('ignore')
import os
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import catboost
from catboost import CatBoostRegressor
from catboost import Pool, cv
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

### 读取数据
items=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
item_cat=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
train_df=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
shops=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
test_df=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
submission=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')


#打印数据尺寸信息
display(items.shape)
display(item_cat.shape)
display(train_df.shape)
display(shops.shape)
display(test_df.shape)
display(submission.shape)


#打印前5行数据
print(items.head())
print(item_cat.head())
print(train_df.head())
print(shops.head())
print(test_df.head())
print(submission.head())


#打印数据基本信息的同时,检测是否有异常值
train_df.describe()
#     date_block_num    shop_id    item_id    item_price    item_cnt_day
# count    2.935849e+06    2.935849e+06    2.935849e+06    2.935849e+06    2.935849e+06
# mean    1.456991e+01    3.300173e+01    1.019723e+04    8.908532e+02    1.242641e+00
# std    9.422988e+00    1.622697e+01    6.324297e+03    1.729800e+03    2.618834e+00
# min    0.000000e+00    0.000000e+00    0.000000e+00    -1.000000e+00    -2.200000e+01
# 25%    7.000000e+00    2.200000e+01    4.476000e+03    2.490000e+02    1.000000e+00
# 50%    1.400000e+01    3.100000e+01    9.343000e+03    3.990000e+02    1.000000e+00
# 75%    2.300000e+01    4.700000e+01    1.568400e+04    9.990000e+02    1.000000e+00
# max    3.300000e+01    5.900000e+01    2.216900e+04    3.079800e+05    2.169000e+03
train_df.isnull().sum()##查看是否有异常值
train_df.head()
## 删掉不必要的数据列
train_df.drop(['date_block_num','item_price'],axis=1,inplace=True)
### 将日期时间数据列转换为datetime类型数据
train_df['date'] = pd.to_datetime(train_df['date'])
train_df.head()
## 将datetime类型的数据中的年月赋值给数据新的月(年)份列
train_df['month']=train_df['date'].dt.month
train_df['year']=train_df['date'].dt.year
#     date    shop_id    item_id    item_cnt_day    month    year
# 0    2013-02-01    59    22154    1.0    2    2013
# 1    2013-03-01    25    2552    1.0    3    2013
# 2    2013-05-01    25    2552    -1.0    5    2013
# 3    2013-06-01    25    2554    1.0    6    2013
# 4    2013-01-15    25    2555    1.0    1    2013
# ...    ...    ...    ...    ...    ...    ...
# 2935844    2015-10-10    25    7409    1.0    10    2015
# 2935845    2015-09-10    25    7460    1.0    9    2015
# 2935846    2015-10-14    25    7459    1.0    10    2015
# 2935847    2015-10-22    25    7440    1.0    10    2015
# 2935848    2015-03-10    25    7460    1.0    3    2015
#不管是几号的数据,一律模糊处理成为某一月份的数据,最后用于相加
train_df['date'] = train_df['date'].apply(lambda x: x.strftime('%Y-%m'))
### 按照时间->商店名->商品名分组并聚合,形成新的数据
data = train_df.groupby(['date','shop_id','item_id']).sum()
#形成一张新的数据透视表,处理成为期望输入模型的数据
data = data.pivot_table(index=['shop_id','item_id'], columns='date', values='item_cnt_day', fill_value=0)
data.reset_index(inplace=True)
# date    shop_id    item_id    2013-01    2013-02    2013-03    2013-04    2013-05    2013-06    2013-07    2013-08    ...    2015-03    2015-04    2015-05    2015-06    2015-07    2015-08    2015-09    2015-10    2015-11    2015-12
# 0    0    30    0    31    0    0    0    0    0    0    ...    0    0    0    0    0    0    0    0    0    0
# 1    0    31    0    11    0    0    0    0    0    0    ...    0    0    0    0    0    0    0    0    0    0
# 2    0    32    4    6    2    0    1    1    0    0    ...    0    0    0    0    0    0    0    0    0    0
# 3    0    33    2    2    1    0    0    0    1    0    ...    0    0    0    0    0    0    0    0    0    0
# 4    0    35    2    6    0    1    0    2    0    2    ...    0    0    0    0    0    0    0    0    0    0
# ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
# 424119    59    22154    0    1    0    0    0    0    0    0    ...    0    0    0    0    0    0    0    0    0    0
# 424120    59    22155    0    0    0    0    0    0    0    0    ...    0    0    0    0    0    0    0    0    0    0
# 424121    59    22162    0    0    0    0    0    0    0    0    ...    2    1    1    0    1    0    1    0    1    0
# 424122    59    22164    0    0    0    0    0    0    0    0    ...    1    2    0    0    1    0    0    0    1    0
# 424123    59    22167    0    0    1    0    0    0    0    0    ...    0    0    0    0    0    0    0    0    0    0
display(test_df.shape)
# 将测试集和数据进行表的连接
test_df=pd.merge(test_df, data, on=['shop_id','item_id'], how='left')
display(test_df.shape)
test_df = test_df.fillna(0)


# 接着将数据划分成为训练集和测试集
#将最后一个数据作为标签,其他数据作为特征
Y_train = data['2015-12'].values
X_train = data.drop(['2015-12'], axis = 1)
X_test = test_df
print(X_train.shape, Y_train.shape)
print(X_test.shape)
x_train, x_test, y_train, y_test = train_test_split( X_train, Y_train, test_size=0.20, random_state=123)


#打印训练集测试集尺寸
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## Prepara a pool of train and validation set
pool_train=Pool(x_train,y_train)
pool_val=Pool(x_test,y_test)

### Define a cv function to fit on data and find the optimal number of iteration keeping other parameters fixed
### Function takes input = catboost object with default params , train data ,train y data 
def modelfit(params,poolX,useTrainCV=True,cv_folds=5,early_stopping_rounds=40):
    if useTrainCV:
        cvresult = cv(params=params, pool=poolX,nfold=cv_folds,early_stopping_rounds=early_stopping_rounds,plot=True)
    return cvresult ## return dataframe for the iteration till the optimal iteration is reached

## Prepara a cv class
params={
    'loss_function':'RMSE'
}

### Object return the optimal number of trees to grow
n_est=modelfit(params,pool_train)

n_est.shape[0]### best iteration =664

### Fit the model with iteration=664
cboost1=CatBoostRegressor(iterations=664,loss_function='RMSE',random_seed=123)
cboost1.fit(x_train,y_train)

#Predict training set:
train_predictions = cboost1.predict(x_train)
#Print model report:
print("\nModel Report Train")
print("Root Mean Square Error : %.4g" % metrics.mean_squared_error(y_train, train_predictions))
print("R^2 Score (Train): %f" % metrics.r2_score(y_train, train_predictions))

#Predict training set:
test_predictions = cboost1.predict(x_test)
#Print model report:
print("\nModel Report Test")
print("Root Mean Square Error : %.4g" % metrics.mean_squared_error(y_test, test_predictions))
print("R^2 Score (Train): %f" % metrics.r2_score(y_test, test_predictions))

prediction = cboost1.predict(X_test)
prediction = list(map(round, prediction))
len(prediction)

submission
submission['item_cnt_month'] = prediction
submission.to_csv('prediction.csv', index=False)
submission.tail()
submission.head(50)