经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 大数据/云/AI » 人工智能基础 » 查看文章
Kaggle-tiantic数据建模与分析
来源:cnblogs  作者:NeilZhang  时间:2018/12/10 9:22:23  对本文有异议

1.数据可视化

kaggle中数据解释:https://www.kaggle.com/c/titanic/data

数据形式:

titanic_data

读取数据,并显示数据信息

  1. data_train = pd.read_csv("./data/train.csv")
  1. print(data_train.info())

数据结果如下:

  1. <class 'pandas.core.frame.DataFrame'>
  1. RangeIndex: 891 entries, 0 to 890
  1. Data columns (total 12 columns):
  1. PassengerId 891 non-null int64
  1. Survived 891 non-null int64
  1. Pclass 891 non-null int64
  1. Name 891 non-null object
  1. Sex 891 non-null object
  1. Age 714 non-null float64
  1. SibSp 891 non-null int64
  1. Parch 891 non-null int64
  1. Ticket 891 non-null object
  1. Fare 891 non-null float64
  1. Cabin 204 non-null object
  1. Embarked 889 non-null object

数据解释:

  1. PassengerId => 乘客ID
  2. Survive => 乘客是否生还(仅在训练集中有,测试集中没有)
  3. Pclass => 乘客等级(1/2/3等舱位)
  4. Name => 乘客姓名
  5. Sex => 性别
  6. Age => 年龄
  7. SibSp => 堂兄弟/妹个数
  8. Parch => 父母与小孩个数
  9. Ticket => 船票信息
  10. Fare => 票价
  11. Cabin => 客舱
  12. Embarked => 登船港口

 

1.1 生存/死亡人数统计

P1

  1. # # 统计 存活/死亡 人数
  2. def sur_die_analysis(data_train):
  3. fig = plt.figure()
  4. fig.set(alpha=0.2) # 设定图表颜色alpha参数
  5. data_train.Survived.value_counts().plot(kind='bar')# 柱状图
  6. plt.title(u"获救情况 (1为获救)") # 标题
  7. plt.ylabel(u"人数")
  8. plt.show()

1.2 PClass

Pclass

  1. # PClass
  2. def pclass_analysis(data_train):
  3. fig = plt.figure()
  4. fig.set(alpha=0.2) # 设定图表颜色alpha参数
  5. sur_data = data_train.Pclass[data_train.Survived == 1].value_counts()
  6. die_data = data_train.Pclass[data_train.Survived == 0].value_counts()
  7. pd.DataFrame({'Survived':sur_data,'Died':die_data}).plot(kind='bar')
  8. plt.ylabel(u"人数")
  9. plt.title(u"乘客等级分布")
  10. plt.show()

通过数据分布可以很明显的看出 Pclass 为 1/2 的乘客存活率比 3 的高很多

1.3 Sex

Psex

  1. #Sex
  2. def sex_analysis(data_train):
  3. no_survived_g = data_train.Sex[data_train.Survived == 0].value_counts()
  4. no_survived_g.to_csv("no_survived_g.csv")
  5. survived_g = data_train.Sex[data_train.Survived == 1].value_counts()
  6. df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
  7. df_g.plot(kind='bar', stacked=True)
  8. plt.title('性别存活率分析')
  9. plt.xlabel('People')
  10. plt.ylabel('Survive')
  11. plt.show()

女性的存活率比男性高

1.4 Age

Page

  1. # age : 将年龄分成十段,分别统计 存活人数和死亡人数
  2. def age_analysis(data_train):
  3. data_series = pd.DataFrame(columns=['Survived', 'dies'])
  4. cloms = []
  5. for num in range(0, 10):
  6. clo = "" + str(num * 10) + "-" + str((num + 1) * 10)
  7. cloms.append(clo)
  8. sur_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 1)].shape[0]
  9. die_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 0)].shape[0]
  10. data_series.loc[num] = [sur_df,die_df]
  11. data_series.index = cloms
  12. data_series.plot(kind='bar', stacked=True)
  13. plt.ylabel(u"存活率") # 设定纵坐标名称
  14. plt.grid(b=True, which='major', axis='y')
  15. plt.title(u"按年龄看获救分布")
  16. plt.show()

低年龄段的获救的百分比明显占的比例较多

1.5  Family : SibSp + Parch

定义Family项,代表家庭成员数量,并离散分类为三个等级:

0: 代表没有任何成员

1: 1-4

2: > 4

PFamliy

  1. # Family: Sibsp + Parch 家庭成员人数
  2. def family_analysis(data_train):
  3. data_train['Family'] = data_train['SibSp'] + data_train['Parch']
  4. data_train.loc[(data_train.Family == 0), 'Family'] = 0
  5. data_train.loc[((data_train.Family > 0) & (data_train.Family < 4)), 'Family'] = 1
  6. data_train.loc[((data_train.Family >= 4)), 'Family'] = 2
  7.  
  8. no_survived_g = data_train.Family[data_train.Survived == 0].value_counts()
  9. survived_g = data_train.Family[data_train.Survived == 1].value_counts()
  10. df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
  11. df_g.plot(kind='bar', stacked=True)
  12. plt.title('家庭成员分析')
  13. plt.xlabel('等级:0-无 1-(1~4) 2-(>4)')
  14. plt.ylabel('存活情况')
  15. plt.show()

由于数据分布很不均衡,sibsp 是否和存活率的关系,可以将所有列都除以该列总人数。这里不再赘述。

1.6 Fare

费用统计:

PClassB

当费用升高到一定时,存活人数已经超过了死亡人数

PFare

 

  1. # Fare
  2. def fare_analysis(data_train):
  3. # data_train.Fare[data_train.Survived == 1].plot(kind='kde')
  4. # data_train.Fare[data_train.Survived == 0].plot(kind='kde')
  5. # data_train["Fare"].plot(kind='kde')
  6. # plt.legend(('survived', 'died','all'), loc='best')
  7. # plt.show()
  8. data_train['NewFare'] = data_train['Fare']
  9. data_train.loc[(data_train.Fare < 50), 'NewFare'] = 0
  10. data_train.loc[((data_train.Fare>=50) & (data_train.Fare<100)), 'NewFare'] = 1
  11. data_train.loc[((data_train.Fare >= 100) & (data_train.Fare < 150)), 'NewFare'] = 2
  12. data_train.loc[((data_train.Fare >= 150) & (data_train.Fare < 200)), 'NewFare'] = 3
  13. data_train.loc[(data_train.Fare >= 200), 'NewFare'] = 4
  14. no_survived_g = data_train.NewFare[data_train.Survived == 0].value_counts()
  15. survived_g = data_train.NewFare[data_train.Survived == 1].value_counts()
  16. df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
  17. df_g.plot(kind='bar', stacked=True)
  18. plt.title('费用-生存分析')
  19. plt.xlabel('费用等级')
  20. plt.ylabel('存活情况')
  21. plt.show()

很明显可以看出 费用等级较高的人存活率会高很多。

优化:

上述只是任意的选取了五个费用段,作为五类,但是具体是多少类才能最好的拟合数据?

这里可以通过聚类的方法查找最佳的分类个数,再将每个费用数据映射为其中一类:

  1. def fare_kmeans(data_train):
  2. for i in range(2,10):
  3. clusters = KMeans(n_clusters=i)
  4. clusters.fit(data_train['Fare'].values.reshape(-1,1))
  1. # intertia_ 参数是衡量聚类的效果,越大则表明效果越差
  2. print("" + str(i) + "" + str(clusters.inertia_))

打印结果:

  1. 2 846932.9762272763
  2. 3 399906.26606199215
  3. 4 195618.50643749788
  4. 5 104945.73652631264
  5. 6 52749.474696547695
  6. 7 35141.316334118805
  7. 8 26030.553497795216
  8. 9 19501.242236941747

由此可以看出看出当 类别数为 5 时分类的效果最好。所以这里将所有的费用映射到为这五类。

  1. #将费用进行聚类,发现 类别数为 5 时聚合的效果最好
  2. def fare_kmeans(data_train):
  3. clusters = KMeans(n_clusters=5)
  4. clusters.fit(data_train['Fare'].values.reshape(-1, 1))
  5. predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1))
  6. print(predict)
  7. data_train['NewFare'] = predict
  8. print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean())
  9. print("" + str(clusters.inertia_))

等级映射后每个等级的存活率如下:(效果明显比上面随便分类的好)

  1. NewFare Survived
  2. 0 0 0.319832
  3. 1 1 0.647059
  4. 2 2 0.606557
  5. 3 3 1.000000
  6. 4 4 0.757576

1.7 Embarked

PEmbark

  1. #Embarked 上船港口情况
  2. def embarked_analysis(data_train):
  3. no_survived_g = data_train.Embarked[data_train.Survived == 0].value_counts()
  4. survived_g = data_train.Embarked[data_train.Survived == 1].value_counts()
  5. df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
  6. df_g.plot(kind='bar', stacked=True)
  7. plt.title('登陆港口-存活情况分析')
  8. plt.xlabel('Embarked')
  9. plt.ylabel('Survive')
  10. plt.show()

至于就登陆港口而言,三个港口并看不出明显的差距,C港生还率略高于S港与Q港。

2. 数据预处理

由开头部分数据信息可以看出,有几栏的数据是部分缺失的: Age / Cabin / Embarked

对于缺失数据这里选择简单填充的方式进行处理:(可以以中值/均值/众数等方式填充)

同时对费用进行分类

  1. def dataPreprocess(df):
  2. df.loc[df['Sex'] == 'male', 'Sex'] = 0
  3. df.loc[df['Sex'] == 'female', 'Sex'] = 1
  4.  
  5. # 由于 Embarked中有两个数据未填充,需要先将数据填满
  6. df['Embarked'] = df['Embarked'].fillna('S')
  7. # 部分年龄数据未空, 填充为 均值
  8. df['Age'] = df['Age'].fillna(df['Age'].median())
  9.  
  10. df.loc[df['Embarked']=='S', 'Embarked'] = 0
  11. df.loc[df['Embarked'] == 'C', 'Embarked'] = 1
  12. df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2
  13.  
  14. df['FamilySize'] = df['SibSp'] + df['Parch']
  15. df['IsAlone'] = 0
  16. df.loc[df['FamilySize']==0,'IsAlone'] = 1
  17. df.drop('FamilySize',axis = 1)
  18. df.drop('Parch',axis=1)
  19. df.drop('SibSp',axis=1)
  20. return fare_kmeans(df)
  21.  
  22. def fare_kmeans(data_train):
  23. clusters = KMeans(n_clusters=5)
  24. clusters.fit(data_train['Fare'].values.reshape(-1, 1))
  25. predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1))
  26. data_train['NewFare'] = predict
  27. data_train.drop('Fare')
  28. # print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean())
  29. # print(" " + str(clusters.inertia_))
  30. return data_train

这里对与分类特征通过了普通的编码方式进行实现,也可以通过onehot编码使每种分类之间的间隔相等。

3. 特征选择

上述感性的认识了各个特征与存活率之间的关系,其实sklearn库中提供了对每个特征打分的函数,可以很方便的看出各个特征的重要性

  1. predictors = ["Pclass", "Sex", "Age", "NewFare", "Embarked",'IsAlone']
  2.  
  3. # Perform feature selection
  4. selector = SelectKBest(f_classif, k=5)
  5. selector.fit(data_train[predictors], data_train["Survived"])
  6.  
  7. # Plot the raw p-values for each feature,and transform from p-values into scores
  8. scores = -np.log10(selector.pvalues_)
  9.  
  10. # Plot the scores. See how "Pclass","Sex","Title",and "Fare" are the best?
  11. plt.bar(range(len(predictors)),scores)
  12. plt.xticks(range(len(predictors)),predictors, rotation='vertical')
  13. plt.show()

skselectors

上图可以看到输入的6个特征中那些特征比较重要

4. 线性回归建模

  1. def linearRegression(df):
  2. predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'Embarked']
  3. #predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'EmbarkedS','EmbarkedC','EmbarkedQ']
  4.  
  5. alg = LinearRegression()
  6. X = df[predictors]
  7. Y = df['Survived']
  8. X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)
  9.  
  10. # 打印 训练集 测试集 样本数量
  11. print (X_train.shape)
  12. print (Y_train.shape)
  13. print (X_test.shape)
  14. print (Y_test.shape)
  15.  
  16. # 进行拟合
  17. alg.fit(X_train, Y_train)
  18.  
  19. print (alg.intercept_)
  20. print (alg.coef_)
  21.  
  22. Y_predict = alg.predict(X_test)
  23. Y_predict[Y_predict >= 0.5 ] = 1
  24. Y_predict[Y_predict < 0.5] = 0
  25. acc = sum(Y_predict==Y_test) / len(Y_predict)
  26. return acc

测试模型预测准确率: 0.79

5. 随机森林建模

选取最有价值的5个特征进行模型训练,并验证模型的效果:

  1. def randomForest(data_train):
  2. # Pick only the four best features.
  3. predictors = ["Pclass", "Sex", "NewFare", "Embarked", 'IsAlone']
  4. X_train, X_test, Y_train, Y_test = train_test_split(data_train[predictors], data_train['Survived'], test_size=0.2)
  5. alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
  6. alg.fit(X_train, Y_train)
  7. Y_predict = alg.predict(X_test)
  8. acc = sum(Y_predict == Y_test) / len(Y_predict)
  9. return acc

经过测试该模型的准确率为 0.811

初步原因分析: 选取的5个特征中没有Age,Age可能因为缺失很大部分数据对预测的准确率有一定的影响。

 

代码已经提交git: https://github.com/lsfzlj/kaggle

欢迎指正交流

参考:

https://blog.csdn.net/han_xiaoyang/article/details/49797143

https://blog.csdn.net/CSDN_Black/article/details/80309542

https://www.kaggle.com/sinakhorami/titanic-best-working-classifier

 

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号