创建category
使用Series创建
在创建Series的同时添加dtype="category"就可以创建好category了。category分为两部分,一部分是order,一部分是字面量:
- In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
-
- In [2]: s
- Out[2]:
- 0 a
- 1 b
- 2 c
- 3 a
- dtype: category
- Categories (3, object): ['a', 'b', 'c']
可以将DF中的Series转换为category:
- In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
-
- In [4]: df["B"] = df["A"].astype("category")
-
- In [5]: df["B"]
- Out[32]:
- 0 a
- 1 b
- 2 c
- 3 a
- Name: B, dtype: category
- Categories (3, object): [a, b, c]
可以创建好一个pandas.Categorical
,将其作为参数传递给Series:
- In [10]: raw_cat = pd.Categorical(
- ....: ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
- ....: )
- ....:
-
- In [11]: s = pd.Series(raw_cat)
-
- In [12]: s
- Out[12]:
- 0 NaN
- 1 b
- 2 c
- 3 NaN
- dtype: category
- Categories (3, object): ['b', 'c', 'd']
使用DF创建
创建DataFrame的时候,也可以传入 dtype="category":
- In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
-
- In [18]: df.dtypes
- Out[18]:
- A category
- B category
- dtype: object
DF中的A和B都是一个category:
- In [19]: df["A"]
- Out[19]:
- 0 a
- 1 b
- 2 c
- 3 a
- Name: A, dtype: category
- Categories (3, object): ['a', 'b', 'c']
-
- In [20]: df["B"]
- Out[20]:
- 0 b
- 1 c
- 2 c
- 3 d
- Name: B, dtype: category
- Categories (3, object): ['b', 'c', 'd']
或者使用df.astype("category")将DF中所有的Series转换为category:
- In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
-
- In [22]: df_cat = df.astype("category")
-
- In [23]: df_cat.dtypes
- Out[23]:
- A category
- B category
- dtype: object
创建控制
默认情况下传入dtype='category' 创建出来的category使用的是默认值:
1.Categories是从数据中推断出来的。
2.Categories是没有大小顺序的。
可以显示创建CategoricalDtype来修改上面的两个默认值:
- In [26]: from pandas.api.types import CategoricalDtype
-
- In [27]: s = pd.Series(["a", "b", "c", "a"])
-
- In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
-
- In [29]: s_cat = s.astype(cat_type)
-
- In [30]: s_cat
- Out[30]:
- 0 NaN
- 1 b
- 2 c
- 3 NaN
- dtype: category
- Categories (3, object): ['b' < 'c' < 'd']
同样的CategoricalDtype还可以用在DF中:
- In [31]: from pandas.api.types import CategoricalDtype
-
- In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
-
- In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
-
- In [34]: df_cat = df.astype(cat_type)
-
- In [35]: df_cat["A"]
- Out[35]:
- 0 a
- 1 b
- 2 c
- 3 a
- Name: A, dtype: category
- Categories (4, object): ['a' < 'b' < 'c' < 'd']
-
- In [36]: df_cat["B"]
- Out[36]:
- 0 b
- 1 c
- 2 c
- 3 d
- Name: B, dtype: category
- Categories (4, object): ['a' < 'b' < 'c' < 'd']
转换为原始类型
使用Series.astype(original_dtype)
或者 np.asarray(categorical)
可以将Category转换为原始类型:
- In [39]: s = pd.Series(["a", "b", "c", "a"])
-
- In [40]: s
- Out[40]:
- 0 a
- 1 b
- 2 c
- 3 a
- dtype: object
-
- In [41]: s2 = s.astype("category")
-
- In [42]: s2
- Out[42]:
- 0 a
- 1 b
- 2 c
- 3 a
- dtype: category
- Categories (3, object): ['a', 'b', 'c']
-
- In [43]: s2.astype(str)
- Out[43]:
- 0 a
- 1 b
- 2 c
- 3 a
- dtype: object
-
- In [44]: np.asarray(s2)
- Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)
categories的操作
获取category的属性
Categorical数据有 categories
和 ordered
两个属性。可以通过s.cat.categories
和 s.cat.ordered
来获取:
- In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
-
- In [58]: s.cat.categories
- Out[58]: Index(['a', 'b', 'c'], dtype='object')
-
- In [59]: s.cat.ordered
- Out[59]: False
重排category的顺序:
- In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
-
- In [61]: s.cat.categories
- Out[61]: Index(['c', 'b', 'a'], dtype='object')
-
- In [62]: s.cat.ordered
- Out[62]: False
重命名categories
通过给s.cat.categories赋值可以重命名categories:
- In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
-
- In [68]: s
- Out[68]:
- 0 a
- 1 b
- 2 c
- 3 a
- dtype: category
- Categories (3, object): ['a', 'b', 'c']
-
- In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]
-
- In [70]: s
- Out[70]:
- 0 Group a
- 1 Group b
- 2 Group c
- 3 Group a
- dtype: category
- Categories (3, object): ['Group a', 'Group b', 'Group c']
使用rename_categories可以达到同样的效果:
- In [71]: s = s.cat.rename_categories([1, 2, 3])
-
- In [72]: s
- Out[72]:
- 0 1
- 1 2
- 2 3
- 3 1
- dtype: category
- Categories (3, int64): [1, 2, 3]
或者使用字典对象:
- # You can also pass a dict-like object to map the renaming
- In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
-
- In [74]: s
- Out[74]:
- 0 x
- 1 y
- 2 z
- 3 x
- dtype: category
- Categories (3, object): ['x', 'y', 'z']
使用add_categories添加category
可以使用add_categories来添加category:
- In [77]: s = s.cat.add_categories([4])
-
- In [78]: s.cat.categories
- Out[78]: Index(['x', 'y', 'z', 4], dtype='object')
-
- In [79]: s
- Out[79]:
- 0 x
- 1 y
- 2 z
- 3 x
- dtype: category
- Categories (4, object): ['x', 'y', 'z', 4]
使用remove_categories删除category
- In [80]: s = s.cat.remove_categories([4])
-
- In [81]: s
- Out[81]:
- 0 x
- 1 y
- 2 z
- 3 x
- dtype: category
- Categories (3, object): ['x', 'y', 'z']
删除未使用的cagtegory
- In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
-
- In [83]: s
- Out[83]:
- 0 a
- 1 b
- 2 a
- dtype: category
- Categories (4, object): ['a', 'b', 'c', 'd']
-
- In [84]: s.cat.remove_unused_categories()
- Out[84]:
- 0 a
- 1 b
- 2 a
- dtype: category
- Categories (2, object): ['a', 'b']
重置cagtegory
使用set_categories()
可以同时进行添加和删除category操作:
- In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category")
-
- In [86]: s
- Out[86]:
- 0 one
- 1 two
- 2 four
- 3 -
- dtype: category
- Categories (4, object): ['-', 'four', 'one', 'two']
-
- In [87]: s = s.cat.set_categories(["one", "two", "three", "four"])
-
- In [88]: s
- Out[88]:
- 0 one
- 1 two
- 2 four
- 3 NaN
- dtype: category
- Categories (4, object): ['one', 'two', 'three', 'four']
category排序
如果category创建的时候带有 ordered=True , 那么可以对其进行排序操作:
- In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
-
- In [92]: s.sort_values(inplace=True)
-
- In [93]: s
- Out[93]:
- 0 a
- 3 a
- 1 b
- 2 c
- dtype: category
- Categories (3, object): ['a' < 'b' < 'c']
-
- In [94]: s.min(), s.max()
- Out[94]: ('a', 'c')
可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序:
- In [95]: s.cat.as_ordered()
- Out[95]:
- 0 a
- 3 a
- 1 b
- 2 c
- dtype: category
- Categories (3, object): ['a' < 'b' < 'c']
-
- In [96]: s.cat.as_unordered()
- Out[96]:
- 0 a
- 3 a
- 1 b
- 2 c
- dtype: category
- Categories (3, object): ['a', 'b', 'c']
重排序
使用Categorical.reorder_categories() 可以对现有的category进行重排序:
- In [103]: s = pd.Series([1, 2, 3, 1], dtype="category")
-
- In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)
-
- In [105]: s
- Out[105]:
- 0 1
- 1 2
- 2 3
- 3 1
- dtype: category
- Categories (3, int64): [2 < 3 < 1]
多列排序
sort_values 支持多列进行排序:
- In [109]: dfs = pd.DataFrame(
- .....: {
- .....: "A": pd.Categorical(
- .....: list("bbeebbaa"),
- .....: categories=["e", "a", "b"],
- .....: ordered=True,
- .....: ),
- .....: "B": [1, 2, 1, 2, 2, 1, 2, 1],
- .....: }
- .....: )
- .....:
-
- In [110]: dfs.sort_values(by=["A", "B"])
- Out[110]:
- A B
- 2 e 1
- 3 e 2
- 7 a 1
- 6 a 2
- 0 b 1
- 5 b 1
- 1 b 2
- 4 b 2
比较操作
如果创建的时候设置了ordered==True ,那么category之间就可以进行比较操作。支持 ==
, !=
, >
, >=
, <
, 和 <=
这些操作符。
- In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
-
- In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))
-
- In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
- In [119]: cat > cat_base
- Out[119]:
- 0 True
- 1 False
- 2 False
- dtype: bool
-
- In [120]: cat > 2
- Out[120]:
- 0 True
- 1 False
- 2 False
- dtype: bool
其他操作
Cagetory本质上来说还是一个Series,所以Series的操作category基本上都可以使用,比如: Series.min(), Series.max() 和 Series.mode()。
value_counts:
- In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
-
- In [132]: s.value_counts()
- Out[132]:
- c 2
- a 1
- b 1
- d 0
- dtype: int64
DataFrame.sum():
- In [133]: columns = pd.Categorical(
- .....: ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
- .....: )
- .....:
-
- In [134]: df = pd.DataFrame(
- .....: data=[[1, 2, 3], [4, 5, 6]],
- .....: columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
- .....: )
- .....:
-
- In [135]: df.sum(axis=1, level=1)
- Out[135]:
- One Two Three
- 0 3 3 0
- 1 9 6 0
Groupby:
- In [136]: cats = pd.Categorical(
- .....: ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
- .....: )
- .....:
-
- In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
-
- In [138]: df.groupby("cats").mean()
- Out[138]:
- values
- cats
- a 1.0
- b 2.0
- c 4.0
- d NaN
-
- In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
-
- In [140]: df2 = pd.DataFrame(
- .....: {
- .....: "cats": cats2,
- .....: "B": ["c", "d", "c", "d"],
- .....: "values": [1, 2, 3, 4],
- .....: }
- .....: )
- .....:
-
- In [141]: df2.groupby(["cats", "B"]).mean()
- Out[141]:
- values
- cats B
- a c 1.0
- d 2.0
- b c 3.0
- d 4.0
- c c NaN
- d NaN
Pivot tables:
- In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
-
- In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
-
- In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
- Out[144]:
- values
- A B
- a c 1
- d 2
- b c 3
- d 4
到此这篇关于Pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索w3xue以前的文章或继续浏览下面的相关文章希望大家以后多多支持w3xue!