经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 程序设计 » Python » 查看文章
Pandas数据类型之category的用法
来源:jb51  时间:2021/6/28 19:12:20  对本文有异议

创建category

使用Series创建

在创建Series的同时添加dtype="category"就可以创建好category了。category分为两部分,一部分是order,一部分是字面量:

  1. In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
  2.  
  3. In [2]: s
  4. Out[2]:
  5. 0 a
  6. 1 b
  7. 2 c
  8. 3 a
  9. dtype: category
  10. Categories (3, object): ['a', 'b', 'c']

可以将DF中的Series转换为category:

  1. In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
  2.  
  3. In [4]: df["B"] = df["A"].astype("category")
  4.  
  5. In [5]: df["B"]
  6. Out[32]:
  7. 0 a
  8. 1 b
  9. 2 c
  10. 3 a
  11. Name: B, dtype: category
  12. Categories (3, object): [a, b, c]

可以创建好一个pandas.Categorical ,将其作为参数传递给Series:

  1. In [10]: raw_cat = pd.Categorical(
  2. ....: ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
  3. ....: )
  4. ....:
  5.  
  6. In [11]: s = pd.Series(raw_cat)
  7.  
  8. In [12]: s
  9. Out[12]:
  10. 0 NaN
  11. 1 b
  12. 2 c
  13. 3 NaN
  14. dtype: category
  15. Categories (3, object): ['b', 'c', 'd']

使用DF创建

创建DataFrame的时候,也可以传入 dtype="category":

  1. In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
  2.  
  3. In [18]: df.dtypes
  4. Out[18]:
  5. A category
  6. B category
  7. dtype: object

DF中的A和B都是一个category:

  1. In [19]: df["A"]
  2. Out[19]:
  3. 0 a
  4. 1 b
  5. 2 c
  6. 3 a
  7. Name: A, dtype: category
  8. Categories (3, object): ['a', 'b', 'c']
  9.  
  10. In [20]: df["B"]
  11. Out[20]:
  12. 0 b
  13. 1 c
  14. 2 c
  15. 3 d
  16. Name: B, dtype: category
  17. Categories (3, object): ['b', 'c', 'd']

或者使用df.astype("category")将DF中所有的Series转换为category:

  1. In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
  2.  
  3. In [22]: df_cat = df.astype("category")
  4.  
  5. In [23]: df_cat.dtypes
  6. Out[23]:
  7. A category
  8. B category
  9. dtype: object

创建控制

默认情况下传入dtype='category' 创建出来的category使用的是默认值:

1.Categories是从数据中推断出来的。

2.Categories是没有大小顺序的。

可以显示创建CategoricalDtype来修改上面的两个默认值:

  1. In [26]: from pandas.api.types import CategoricalDtype
  2.  
  3. In [27]: s = pd.Series(["a", "b", "c", "a"])
  4.  
  5. In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
  6.  
  7. In [29]: s_cat = s.astype(cat_type)
  8.  
  9. In [30]: s_cat
  10. Out[30]:
  11. 0 NaN
  12. 1 b
  13. 2 c
  14. 3 NaN
  15. dtype: category
  16. Categories (3, object): ['b' < 'c' < 'd']

同样的CategoricalDtype还可以用在DF中:

  1. In [31]: from pandas.api.types import CategoricalDtype
  2.  
  3. In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
  4.  
  5. In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
  6.  
  7. In [34]: df_cat = df.astype(cat_type)
  8.  
  9. In [35]: df_cat["A"]
  10. Out[35]:
  11. 0 a
  12. 1 b
  13. 2 c
  14. 3 a
  15. Name: A, dtype: category
  16. Categories (4, object): ['a' < 'b' < 'c' < 'd']
  17.  
  18. In [36]: df_cat["B"]
  19. Out[36]:
  20. 0 b
  21. 1 c
  22. 2 c
  23. 3 d
  24. Name: B, dtype: category
  25. Categories (4, object): ['a' < 'b' < 'c' < 'd']

转换为原始类型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以将Category转换为原始类型:

  1. In [39]: s = pd.Series(["a", "b", "c", "a"])
  2.  
  3. In [40]: s
  4. Out[40]:
  5. 0 a
  6. 1 b
  7. 2 c
  8. 3 a
  9. dtype: object
  10.  
  11. In [41]: s2 = s.astype("category")
  12.  
  13. In [42]: s2
  14. Out[42]:
  15. 0 a
  16. 1 b
  17. 2 c
  18. 3 a
  19. dtype: category
  20. Categories (3, object): ['a', 'b', 'c']
  21.  
  22. In [43]: s2.astype(str)
  23. Out[43]:
  24. 0 a
  25. 1 b
  26. 2 c
  27. 3 a
  28. dtype: object
  29.  
  30. In [44]: np.asarray(s2)
  31. Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

categories的操作

获取category的属性

Categorical数据有 categoriesordered 两个属性。可以通过s.cat.categoriess.cat.ordered来获取:

  1. In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
  2.  
  3. In [58]: s.cat.categories
  4. Out[58]: Index(['a', 'b', 'c'], dtype='object')
  5.  
  6. In [59]: s.cat.ordered
  7. Out[59]: False

重排category的顺序:

  1. In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
  2.  
  3. In [61]: s.cat.categories
  4. Out[61]: Index(['c', 'b', 'a'], dtype='object')
  5.  
  6. In [62]: s.cat.ordered
  7. Out[62]: False

重命名categories

通过给s.cat.categories赋值可以重命名categories:

  1. In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
  2.  
  3. In [68]: s
  4. Out[68]:
  5. 0 a
  6. 1 b
  7. 2 c
  8. 3 a
  9. dtype: category
  10. Categories (3, object): ['a', 'b', 'c']
  11.  
  12. In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]
  13.  
  14. In [70]: s
  15. Out[70]:
  16. 0 Group a
  17. 1 Group b
  18. 2 Group c
  19. 3 Group a
  20. dtype: category
  21. Categories (3, object): ['Group a', 'Group b', 'Group c']

使用rename_categories可以达到同样的效果:

  1. In [71]: s = s.cat.rename_categories([1, 2, 3])
  2.  
  3. In [72]: s
  4. Out[72]:
  5. 0 1
  6. 1 2
  7. 2 3
  8. 3 1
  9. dtype: category
  10. Categories (3, int64): [1, 2, 3]

或者使用字典对象:

  1. # You can also pass a dict-like object to map the renaming
  2. In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
  3.  
  4. In [74]: s
  5. Out[74]:
  6. 0 x
  7. 1 y
  8. 2 z
  9. 3 x
  10. dtype: category
  11. Categories (3, object): ['x', 'y', 'z']

使用add_categories添加category

可以使用add_categories来添加category:

  1. In [77]: s = s.cat.add_categories([4])
  2.  
  3. In [78]: s.cat.categories
  4. Out[78]: Index(['x', 'y', 'z', 4], dtype='object')
  5.  
  6. In [79]: s
  7. Out[79]:
  8. 0 x
  9. 1 y
  10. 2 z
  11. 3 x
  12. dtype: category
  13. Categories (4, object): ['x', 'y', 'z', 4]

使用remove_categories删除category

  1. In [80]: s = s.cat.remove_categories([4])
  2.  
  3. In [81]: s
  4. Out[81]:
  5. 0 x
  6. 1 y
  7. 2 z
  8. 3 x
  9. dtype: category
  10. Categories (3, object): ['x', 'y', 'z']

删除未使用的cagtegory

  1. In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
  2.  
  3. In [83]: s
  4. Out[83]:
  5. 0 a
  6. 1 b
  7. 2 a
  8. dtype: category
  9. Categories (4, object): ['a', 'b', 'c', 'd']
  10.  
  11. In [84]: s.cat.remove_unused_categories()
  12. Out[84]:
  13. 0 a
  14. 1 b
  15. 2 a
  16. dtype: category
  17. Categories (2, object): ['a', 'b']

重置cagtegory

使用set_categories()可以同时进行添加和删除category操作:

  1. In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category")
  2.  
  3. In [86]: s
  4. Out[86]:
  5. 0 one
  6. 1 two
  7. 2 four
  8. 3 -
  9. dtype: category
  10. Categories (4, object): ['-', 'four', 'one', 'two']
  11.  
  12. In [87]: s = s.cat.set_categories(["one", "two", "three", "four"])
  13.  
  14. In [88]: s
  15. Out[88]:
  16. 0 one
  17. 1 two
  18. 2 four
  19. 3 NaN
  20. dtype: category
  21. Categories (4, object): ['one', 'two', 'three', 'four']

category排序

如果category创建的时候带有 ordered=True , 那么可以对其进行排序操作:

  1. In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
  2.  
  3. In [92]: s.sort_values(inplace=True)
  4.  
  5. In [93]: s
  6. Out[93]:
  7. 0 a
  8. 3 a
  9. 1 b
  10. 2 c
  11. dtype: category
  12. Categories (3, object): ['a' < 'b' < 'c']
  13.  
  14. In [94]: s.min(), s.max()
  15. Out[94]: ('a', 'c')

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序:

  1. In [95]: s.cat.as_ordered()
  2. Out[95]:
  3. 0 a
  4. 3 a
  5. 1 b
  6. 2 c
  7. dtype: category
  8. Categories (3, object): ['a' < 'b' < 'c']
  9.  
  10. In [96]: s.cat.as_unordered()
  11. Out[96]:
  12. 0 a
  13. 3 a
  14. 1 b
  15. 2 c
  16. dtype: category
  17. Categories (3, object): ['a', 'b', 'c']

重排序

使用Categorical.reorder_categories() 可以对现有的category进行重排序:

  1. In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")
  2.  
  3. In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)
  4.  
  5. In [105]: s
  6. Out[105]:
  7. 0 1
  8. 1 2
  9. 2 3
  10. 3 1
  11. dtype: category
  12. Categories (3, int64): [2 < 3 < 1]

多列排序

sort_values 支持多列进行排序:

  1. In [109]: dfs = pd.DataFrame(
  2. .....: {
  3. .....: "A": pd.Categorical(
  4. .....: list("bbeebbaa"),
  5. .....: categories=["e", "a", "b"],
  6. .....: ordered=True,
  7. .....: ),
  8. .....: "B": [1, 2, 1, 2, 2, 1, 2, 1],
  9. .....: }
  10. .....: )
  11. .....:
  12.  
  13. In [110]: dfs.sort_values(by=["A", "B"])
  14. Out[110]:
  15. A B
  16. 2 e 1
  17. 3 e 2
  18. 7 a 1
  19. 6 a 2
  20. 0 b 1
  21. 5 b 1
  22. 1 b 2
  23. 4 b 2

比较操作

如果创建的时候设置了ordered==True ,那么category之间就可以进行比较操作。支持 ==, !=, >, >=, <, 和 <=这些操作符。

  1. In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
  2.  
  3. In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))
  4.  
  5. In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
  6. In [119]: cat > cat_base
  7. Out[119]:
  8. 0 True
  9. 1 False
  10. 2 False
  11. dtype: bool
  12.  
  13. In [120]: cat > 2
  14. Out[120]:
  15. 0 True
  16. 1 False
  17. 2 False
  18. dtype: bool

其他操作

Cagetory本质上来说还是一个Series,所以Series的操作category基本上都可以使用,比如: Series.min(), Series.max() 和 Series.mode()。

value_counts:

  1. In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
  2.  
  3. In [132]: s.value_counts()
  4. Out[132]:
  5. c 2
  6. a 1
  7. b 1
  8. d 0
  9. dtype: int64

DataFrame.sum():

  1. In [133]: columns = pd.Categorical(
  2. .....: ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
  3. .....: )
  4. .....:
  5.  
  6. In [134]: df = pd.DataFrame(
  7. .....: data=[[1, 2, 3], [4, 5, 6]],
  8. .....: columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
  9. .....: )
  10. .....:
  11.  
  12. In [135]: df.sum(axis=1, level=1)
  13. Out[135]:
  14. One Two Three
  15. 0 3 3 0
  16. 1 9 6 0

Groupby:

  1. In [136]: cats = pd.Categorical(
  2. .....: ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
  3. .....: )
  4. .....:
  5.  
  6. In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
  7.  
  8. In [138]: df.groupby("cats").mean()
  9. Out[138]:
  10. values
  11. cats
  12. a 1.0
  13. b 2.0
  14. c 4.0
  15. d NaN
  16.  
  17. In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
  18.  
  19. In [140]: df2 = pd.DataFrame(
  20. .....: {
  21. .....: "cats": cats2,
  22. .....: "B": ["c", "d", "c", "d"],
  23. .....: "values": [1, 2, 3, 4],
  24. .....: }
  25. .....: )
  26. .....:
  27.  
  28. In [141]: df2.groupby(["cats", "B"]).mean()
  29. Out[141]:
  30. values
  31. cats B
  32. a c 1.0
  33. d 2.0
  34. b c 3.0
  35. d 4.0
  36. c c NaN
  37. d NaN

Pivot tables:

  1. In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
  2.  
  3. In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
  4.  
  5. In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
  6. Out[144]:
  7. values
  8. A B
  9. a c 1
  10. d 2
  11. b c 3
  12. d 4

到此这篇关于Pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索w3xue以前的文章或继续浏览下面的相关文章希望大家以后多多支持w3xue!

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号