经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 程序设计 » Python3 » 查看文章
定量数据和定性数据
来源:cnblogs  作者:无风听海  时间:2023/11/20 8:56:36  对本文有异议

定量数据本质上是数值,应该是衡量某样东西的数量。
定性数据本质上是类别,应该是描述某样东西的性质。

全部的数据列如下,其中既有定性列也有定量列;

  1. import pandas as pd
  2. pd.options.display.max_columns = None
  3. pd.set_option('expand_frame_repr', False)
  4. salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
  5. print(salary_ranges.head())
  6. # SetID JobCode Eff Date SalEndDate SalarySetID SalPlan Grade Step BiweeklyHighRate BiweeklyLowRate UnionCode ExtendedStep PayType
  7. # 0 COMMN 109 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $0.00 $0.00 330 0 C
  8. # 1 COMMN 110 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $15.00 $15.00 323 0 D
  9. # 2 COMMN 111 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $25.00 $25.00 323 0 D
  10. # 3 COMMN 112 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $50.00 $50.00 323 0 D
  11. # 4 COMMN 114 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $100.00 $100.00 323 0 M

.info()可以了解数据的列信息以及每列非null的行数;

  1. print(salary_ranges.info())
  2. # <class 'pandas.core.frame.DataFrame'>
  3. # RangeIndex: 1356 entries, 0 to 1355
  4. # Data columns (total 13 columns):
  5. # # Column Non-Null Count Dtype
  6. # --- ------ -------------- -----
  7. # 0 SetID 1356 non-null object
  8. # 1 Job Code 1356 non-null object
  9. # 2 Eff Date 1356 non-null object
  10. # 3 Sal End Date 1356 non-null object
  11. # 4 Salary SetID 1356 non-null object
  12. # 5 Sal Plan 1356 non-null object
  13. # 6 Grade 1356 non-null object
  14. # 7 Step 1356 non-null int64
  15. # 8 Biweekly High Rate 1356 non-null object
  16. # 9 Biweekly Low Rate 1356 non-null object
  17. # 10 Union Code 1356 non-null int64
  18. # 11 Extended Step 1356 non-null int64
  19. # 12 Pay Type 1356 non-null object
  20. # dtypes: int64(3), object(10)
  21. # memory usage: 137.8+ KB
  22. # None

也可以使用以下方法更快速的计算缺失值的信息;

  1. print(salary_ranges.isnull().sum())
  2. # SetID 0
  3. # Job Code 0
  4. # Eff Date 0
  5. # Sal End Date 0
  6. # Salary SetID 0
  7. # Sal Plan 0
  8. # Grade 0
  9. # Step 0
  10. # Biweekly High Rate 0
  11. # Biweekly Low Rate 0
  12. # Union Code 0
  13. # Extended Step 0
  14. # Pay Type 0
  15. # dtype: int64

describe方法查看定量数据的描述性统计;Pandas认为,数据只有3个定量列:Step、Union Code和Extended Step(步进、工会代码和增强步进)。先不说步进和增强步进,很明显工会代码不是定量的。虽然这一列是数,但这些数不代表数量,只代表某个工会的代码

  1. print( salary_ranges.describe())
  2. # Step Union Code Extended Step
  3. # count 1356.000000 1356.000000 1356.000000
  4. # mean 1.294985 392.676991 0.150442
  5. # std 1.045816 338.100562 1.006734
  6. # min 1.000000 1.000000 0.000000
  7. # 25% 1.000000 21.000000 0.000000
  8. # 50% 1.000000 351.000000 0.000000
  9. # 75% 1.000000 790.000000 0.000000
  10. # max 5.000000 990.000000 11.000000

最值得注意的特征是一个定量列Biweekly High Rate(双周最高工资)和一个定性列Grade(工作种类);

  1. salary_ranges = salary_ranges[['BiweeklyHighRate', 'Grade']]
  2. print(salary_ranges.head())
  3. # BiweeklyHighRate Grade
  4. # 0 $0.00 0
  5. # 1 $15.00 0
  6. # 2 $25.00 0
  7. # 3 $50.00 0
  8. # 4 $100.00 0

查看两个字段的类型;

  1. print(salary_ranges.info())
  2. # <class 'pandas.core.frame.DataFrame'>
  3. # RangeIndex: 1356 entries, 0 to 1355
  4. # Data columns (total 2 columns):
  5. # # Column Non-Null Count Dtype
  6. # --- ------ -------------- -----
  7. # 0 BiweeklyHighRate 1356 non-null object
  8. # 1 Grade 1356 non-null object
  9. # dtypes: object(2)
  10. # memory usage: 21.3+ KB
  11. # None

我们清理一下数据,移除工资前面的美元符号,保证数据类型正确。当处理定量数据时,一般使用整数或浮点数作为类型(最好使用浮点数);定性数据则一般使用字符串或Unicode对象。

  1. salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].map(lambda value:value.replace('$',''))
  2. print(salary_ranges.head())
  3. # BiweeklyHighRate Grade
  4. # 0 0.00 0
  5. # 1 15.00 0
  6. # 2 25.00 0
  7. # 3 50.00 0
  8. # 4 100.00 0

数据类型并没有变

  1. print(salary_ranges.info())
  2. # <class 'pandas.core.frame.DataFrame'>
  3. # RangeIndex: 1356 entries, 0 to 1355
  4. # Data columns (total 2 columns):
  5. # # Column Non-Null Count Dtype
  6. # --- ------ -------------- -----
  7. # 0 BiweeklyHighRate 1356 non-null object
  8. # 1 Grade 1356 non-null object
  9. # dtypes: object(2)
  10. # memory usage: 21.3+ KB
  11. # None

将BiweeklyHighRate和Grade列中的数据分别转换为浮点数、字符串;

  1. salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].astype(float)
  2. salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
  3. print(salary_ranges.info())
  4. # <class 'pandas.core.frame.DataFrame'>
  5. # RangeIndex: 1356 entries, 0 to 1355
  6. # Data columns (total 2 columns):
  7. # # Column Non-Null Count Dtype
  8. # --- ------ -------------- -----
  9. # 0 BiweeklyHighRate 1356 non-null float64
  10. # 1 Grade 1356 non-null object
  11. # dtypes: float64(1), object(1)
  12. # memory usage: 21.3+ KB
  13. # None

原文链接:https://www.cnblogs.com/wufengtinghai/p/17842380.html

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号