19丨决策树(下):泰坦尼克乘客生存预测
在这个课程里我打算用Julia实现一下,由于版权原因,我就不好把内容贴出来了
不过资料放网盘里了,可以在上面的帖子里找
代码
using DataFrames
import CSV
using MLJ
数据准备
train_data = DataFrame(CSV.read("./Titanic_Data/train.csv"))
test_data = DataFrame(CSV.read("./Titanic_Data/train.csv"))
其实在数据清洗前,应该对数据进行探索
数据探索 1. 查看数据 (在JupyterBook里输入这行代码)
using TableView
showtable(train_data)
懒得贴
数据探索 2. 查看缺失值
describe(train_data, :nmissing) # 这里nmissing可能是指 num-of-missing,这个名字真怪
12×2 DataFrame
│ Row │ variable │ nmissing │
│ │ Symbol │ Union… │
├─────┼─────────────┼──────────┤
│ 1 │ PassengerId │ │
│ 2 │ Survived │ │
│ 3 │ Pclass │ │
│ 4 │ Name │ │
│ 5 │ Sex │ │
│ 6 │ Age │ 177 │
│ 7 │ SibSp │ │
│ 8 │ Parch │ │
│ 9 │ Ticket │ │
│ 10 │ Fare │ │
│ 11 │ Cabin │ 687 │
│ 12 │ Embarked │ 2 │
数据探索 3. 补充
这个数据集比较简单,复杂的数据集还要检查有没有无效值和重复值,总之数据要有意义(其实是我懒得找)
数据清洗 1. 清洗缺失值
# TODO drop Cabin
select!(train_data,Not(:Cabin))
select!(test_data,Not(:Cabin))
# TODO clean Age
train_mean_value = Int(floor(mean(skipmissing(train_data[!,:Age]))))
train_data[!,:Age] = convert(Vector{Int},
floor.(replace(train_data[!,:Age], missing => train_mean_value)))
test_mean_value = Int(floor(mean(skipmissing(test_data[!,:Age]))))
test_data[!,:Age] = convert(Vector{Int},
floor.(replace(test_data[!,:Age], missing => test_mean_value)))
# TODO clean Embarked
train_data[!,:Embarked] = convert(Vector{String},
replace(train_data[!,:Embarked], missing => "S"))
test_data[!,:Embarked] = convert(Vector{String},
replace(test_data[!,:Embarked], missing => "S"))
数据清洗 2. 更改科学类型,使scitype(train_data) <: input_scitype(dtc),scitype(labels) <: target_scitype(dtc)
# 其中 :Survive字段为分类标签
auto = autotype(train_data,(:string_to_multiclass, :discrete_to_continuous))
coerce!(coerce!(train_data, auto), :Sex => Continuous, :Embarked => Continuous, :Survived => Multiclass)
auto = autotype(test_data,(:string_to_multiclass, :discrete_to_continuous))
coerce!(coerce!(test_data, auto), :Sex => Continuous, :Embarked => Continuous, :Survived => Multiclass)
数据清洗 3. 特征选择
features = [:Pclass, :Sex, :Age, :SibSp, :Parch, :Fare, :Embarked ]
train_features = select(train_data, features)
train_labels = train_data[!,:Survived]
test_features = select(test_data, features)
test_labels = test_data[!,:Survived]
分析 1. 载入模型
@load DecisionTreeClassifier
dtc = DecisionTreeClassifier()
mach = machine(dtc, train_features, train_labels)
分析 2. 训练模型
fit!(mach)
分析 3. 预测
predict_labels = predict_mode(mach,test_features)
分析 4. 检验模型
print(accuracy(predict_labels, test_labels))
# 0.9764309764309764