Pandas Introduction 2
indexes
keep the indexes:
when making difficult calculations,it will keep its indexes.
first create a series and a dataframe:
1 | import pandas as pd |
e to the x:
1 | np.exp(ser) |
or more complex one:
1 | np.sin(df * np.pi / 4) |
match the index:
1 | area = pd.Series({'Alaska': 1723337, 'Texas': 695662, |
their indexes:
1 | area.index | population.index |
if there is no value, it will show NaN
1 | A = pd.Series([2, 4, 6], index=[0, 1, 2]) |
using this, we can fill the values that don’t exist with 0
1 | A.add(B, fill_value=0) |
when comes to DataFrame:
1 | A = pd.DataFrame(rng.randint(0, 20, (2, 2)), |
use the mean to fill NaN
1 | fill = A.stack().mean() |
Python运算符 | Pandas方法 |
---|---|
+ | add() |
- | sub()、subtract() |
* | mul()、multiply() |
/ | truediv()、div()、divide() |
// | floordiv() |
% | mod() |
** | pow() |
calculation
1 | A = rng.randint(10, size=(3, 4)) |
DataFrame id the same:
1 | df = pd.DataFrame(A, columns=list('QRST')) |
change to columns:
1 | df.subtract(df['R'], axis=0) |
1 | halfrow = df.iloc[0, ::2] |
deal with NaN
what is NaN
1 | vals1 = np.array([1, None, 3, 4]) |
1 | vals1.sum() |
None can’t be add to integers
NaN: not a number
It can be calculated with integer:
1 | 1 + np.nan |
The result will always be nan.
including the maximum and minimum:
1 | vals2 = np.array([1, np.nan, 3, 4]) |
delete the nan:
1 | np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) |
In pandas, None=NaN:
1 | pd.Series([1, np.nan, 2, None]) |
If there is a NaN the type will be changed to float:
1 | x = pd.Series(range(2), dtype=int) |
类型 | 缺失值转换规则 | NA标签值 |
---|---|---|
floating 浮点型 | 无变化 | np.nan |
object 对象类型 | 无变化 | None 或 np.nan |
integer 整数类型 | 强制转换为 float64 | np.nan |
boolean 布尔类型 | 强制转换为 object | None 或 np.nan |
whether there is NaN/delete NaN
.isnull():
1 | data = pd.Series([1, np.nan, 'hello', None]) |
.notnull():
1 | data[data.notnull()] |
.dropna(): (it will drop the whole index in DataFrame)
1 | data.dropna() |
change the direction it delete in DataFrame:
1 | df[3] = np.nan |
fill the NaN
1 | data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde')) |
.fillna():
1 | data.fillna(0) |
It’s similar in DataFrame:
1 | df |
multilevel index
reindex
1 | index = [('California', 2000), ('California', 2010), |
unstack and stack
1 | pop_df = pop.unstack() |
create
1 | df = pd.DataFrame(np.random.rand(4, 2), |
1 | data = {('California', 2000): 33871648, |
special ways:
1 | pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]]) |
*在创建 Series 或 DataFrame 时,可以将这些对象作为 index 参数,或者通过 reindex 方法更新 Series 或 DataFrame 的索引。
1 | #create names for indexes |
more than one multilevel index
1 | index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],names=['year', 'visit']) |
get the value
1 | pop |
get only one:
1 | pop['California', 2000] |
partial indexing:
1 | pop['California'] |
slice
1 | pop.loc['California':'New York'] |
mask
1 | pop[pop > 22000000] |
DataFrame
1 | health_data |
loc
1 | health_data.iloc[:2, :2] #隐式索引 |
IndexSlice
1 | idx = pd.IndexSlice |
index/column exchange
1 | index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]]) |
sort_index()
1 | data = data.sort_index() |
stack and unstack
1 | pop.unstack(level=0) |
set and reset
1 | pop_flat = pop.reset_index(name='population') |
accumulate
1 | health_data |
mean:
1 | data_mean = health_data.mean(level='year') |