pandas2

Pandas Introduction 2

indexes

keep the indexes:

when making difficult calculations,it will keep its indexes.

first create a series and a dataframe:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd 
import numpy as np

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4)) #select 4 numbers within 0 to 10
ser
'''
out:
0 6
1 3
2 7
3 4
dtype: int64
'''

df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df
'''
out:
A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
'''

e to the x:

1
2
3
4
5
6
7
8
9
np.exp(ser) 
'''
Out:
0 403.428793
1 20.085537
2 1096.633158
3 54.598150
dtype: float64
'''

or more complex one:

1
2
3
4
5
6
7
8
np.sin(df * np.pi / 4) 
'''
Out[5]:
A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
'''

match the index:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

#calculate the density:
population / area
'''
Out[7]:
Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64
'''

their indexes:

1
2
area.index | population.index 
#Out: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

if there is no value, it will show NaN

1
2
3
4
5
6
7
8
9
10
11
A = pd.Series([2, 4, 6], index=[0, 1, 2]) 
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
'''
Out:
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
'''

using this, we can fill the values that don’t exist with 0

1
2
3
4
5
6
7
8
9
A.add(B, fill_value=0) 
'''
Out:
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
'''

when comes to DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), 
columns=list('AB'))
A
'''
Out:
A B
0 1 11
1 5 1
'''

B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B
'''
Out:
B A C
0 4 0 9
1 5 8 0
2 9 2 6
'''

A + B
'''
Out:
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
'''

use the mean to fill NaN

1
2
3
4
5
6
7
8
9
fill = A.stack().mean() 
A.add(B, fill_value=fill)
'''
Out:
A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
'''
Python运算符 Pandas方法
+ add()
- sub()、subtract()
* mul()、multiply()
/ truediv()、div()、divide()
// floordiv()
% mod()
** pow()

calculation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A = rng.randint(10, size=(3, 4)) 
A
'''
Out:
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
'''

A - A[0]
'''
Out:
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
'''

DataFrame id the same:

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(A, columns=list('QRST')) 
df - df.iloc[0]
'''
Out:
Q R S T
0 0 0 0 0
1 -1 -2 2 4
2 3 -7 1 4
'''

change to columns:

1
2
3
4
5
6
7
8
df.subtract(df['R'], axis=0) 
'''
Out:
Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7
'''
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
halfrow = df.iloc[0, ::2] 
halfrow
'''
Out:
Q 3
S 2
Name: 0, dtype: int64
'''

df - halfrow
'''
Out:
Q R S T
0 0.0 NaN 0.0 NaN
1 -1.0 NaN 2.0 NaN
2 3.0 NaN 1.0 NaN
'''

deal with NaN

what is NaN

1
2
3
vals1 = np.array([1, None, 3, 4]) 
vals1
#Out: array([1, None, 3, 4], dtype=object)
1
2
3
4
5
6
7
8
9
10
11
12
13
vals1.sum() 
'''
TypeError Traceback (most recent call last)
<ipython-input-4-749fd8ae6030> in <module>()
----> 1 vals1.sum()
/Users/jakevdp/anaconda/lib/python3.5/site-packages/numpy/core/_methods.py ...
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32 return umr_sum(a, axis, dtype, out, keepdims)
33
34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
'''

None can’t be add to integers

NaN: not a number

It can be calculated with integer:

1
2
3
4
5
1 + np.nan 
#Out: nan

0 * np.nan
#Out: nan

The result will always be nan.

including the maximum and minimum:

1
2
3
vals2 = np.array([1, np.nan, 3, 4])
vals2.sum(), vals2.min(), vals2.max()
#Out: (nan, nan, nan)

delete the nan:

1
2
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) 
#Out: (8.0, 1.0, 4.0)

In pandas, None=NaN:

1
2
3
4
5
6
7
8
9
pd.Series([1, np.nan, 2, None]) 
'''
Out:
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
'''

If there is a NaN the type will be changed to float:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x = pd.Series(range(2), dtype=int) 
x
'''
Out:
0 0
1 1
dtype: int64
'''
x[0] = None
x
'''
Out:
0 NaN
1 1.0
dtype: float64
'''
类型 缺失值转换规则 NA标签值
floating 浮点型 无变化 np.nan
object 对象类型 无变化 None 或 np.nan
integer 整数类型 强制转换为 float64 np.nan
boolean 布尔类型 强制转换为 object None 或 np.nan

whether there is NaN/delete NaN

.isnull():

1
2
3
4
5
6
7
8
9
10
data = pd.Series([1, np.nan, 'hello', None]) 
data.isnull()
'''
Out:
0 False
1 True
2 False
3 True
dtype: bool
'''

.notnull():

1
2
3
4
5
6
7
data[data.notnull()] 
'''
Out:
0 1
2 hello
dtype: object
'''

.dropna(): (it will drop the whole index in DataFrame)

1
2
3
4
5
6
7
data.dropna() 
'''
Out:
0 1
2 hello
dtype: object
'''

change the direction it delete in DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
df[3] = np.nan 
df
'''''
Out:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
'''

df.dropna(axis='columns', how='all')
'''
Out:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
'''

df.dropna(axis='rows', thresh=3)
'''
Out:
0 1 2 3
1 2.0 3.0 5 NaN
'''

fill the NaN

1
2
3
4
5
6
7
8
9
10
11
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde')) 
data
'''
Out:
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
'''

.fillna():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
data.fillna(0) 
'''
Out:
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
'''

data.fillna(method='ffill')
'''
Out:
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
'''

data.fillna(method='bfill')
'''
Out: a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64
'''

It’s similar in DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df 
'''
Out:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
'''

df.fillna(method='ffill', axis=1)
'''
Out:
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
'''

multilevel index

reindex

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
index = [('California', 2000), ('California', 2010), 
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]

pop = pd.Series(populations, index=index)
index = pd.MultiIndex.from_tuples(index)
index
#Out: MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
# labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

pop = pop.reindex(index)
pop['California', 2000]
#Out: 33871648

unstack and stack

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
pop_df = pop.unstack() 
pop_df
'''
Out:
2000 2010
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
'''

pop_df.stack()
'''
Out:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

create

1
2
3
4
5
6
7
8
9
10
11
12
df = pd.DataFrame(np.random.rand(4, 2), 
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
'''
Out:
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
'''
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data = {('California', 2000): 33871648, 
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
'''
Out:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
'''

special ways:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]]) 
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

*在创建 Series 或 DataFrame 时,可以将这些对象作为 index 参数,或者通过 reindex 方法更新 Series 或 DataFrame 的索引。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#create names for indexes
pop.index.names = ['state', 'year']
pop
'''
Out:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
'''

more than one multilevel index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],names=['year', 'visit']) 
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# 模拟数据
data = np.round(np.random.randn(4, 6), 1) #取小数点后一位
data[:, ::2] *= 10 #make the data more different 所有行,0、2、4列扩大10倍
data += 37
# 创建DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
'''
Out:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
'''

health_data['Bob']['HR']#now it is series
'''
out:
year visit
2013 1 31.0
2 44.0
2014 1 30.0
2 47.0
'''

get the value

1
2
3
4
5
6
7
8
9
10
11
12
pop 
'''
Out:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
'''

get only one:

1
2
pop['California', 2000] 
#Out: 33871648

partial indexing:

1
2
3
4
5
6
7
8
pop['California'] 
'''
Out:
year
2000 33871648
2010 37253956
dtype: int64
'''

slice

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
pop.loc['California':'New York'] 
'''
Out:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102 dtype: int64
'''

#get the data in 2000
pop[:, 2000]
'''
Out:
state
California 33871648
New York 18976457
Texas 20851820
dtype: int64
'''

mask

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pop[pop > 22000000] 
'''
Out:
state year
California 2000 33871648
2010 37253956
Texas 2010 25145561 dtype: int64
'''

pop[['California', 'Texas']]
'''
Out:
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
'''

DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
health_data 
'''
Out:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
'''

#the same method in series can be used
health_data['Guido', 'HR']
'''
Out:
year visit
2013 1 32.0
2 50.0
2014 1 39.0
2 48.0
Name: (Guido, HR), dtype: float64
'''

loc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
health_data.iloc[:2, :2] #隐式索引
'''
Out:
subject Bob
type HR Temp
year visit
2013 1 31.0 38.7
2 44.0 37.7
'''

#more than one indexes can be used
health_data.loc[:, ('Bob', 'HR')]
'''
Out:
year visit
2013 1 31.0
2 44.0
2014 1 30.0
2 47.0
Name: (Bob, HR), dtype: float64
'''

IndexSlice

1
2
3
4
5
6
7
8
9
10
idx = pd.IndexSlice 
health_data.loc[idx[:, 1], idx[:, 'HR']]
'''
Out:
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 31.0 32.0 35.0
2014 1 30.0 39.0 61.0
'''

index/column exchange

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]]) 
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
'''
Out:
char int
a 1 0.003001
2 0.164974
c 1 0.741650
2 0.569264
b 1 0.001693
2 0.526226
dtype: float64
'''

#code will go wrong if the DataFrame is not sorted
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)
#<class 'KeyError'>
#'Key length (1) was greater than MultiIndex lexsort depth (0)'

sort_index()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
data = data.sort_index() 
data
'''
Out:
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype: float64
'''

data['a':'b']
'''
Out:
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
dtype: float64
'''

stack and unstack

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
pop.unstack(level=0) 
'''
Out:
state California New York Texas
year
2000 33871648 18976457 20851820
2010 37253956 19378102 25145561
'''

pop.unstack(level=1)
'''
Out:
year 2000 2010
state
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
'''


pop.unstack().stack()
'''
Out:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
'''

set and reset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pop_flat = pop.reset_index(name='population') 
pop_flat
'''
Out:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
'''

pop_flat.set_index(['state', 'year'])
'''
Out:
population
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
'''

accumulate

1
2
3
4
5
6
7
8
9
10
11
health_data 
'''
Out:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
'''

mean:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
data_mean = health_data.mean(level='year') 
data_mean
'''
Out:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year
2013 37.5 38.2 41.0 35.85 32.0 36.95
2014 38.5 37.6 43.5 37.55 56.0 36.70
'''

data_mean.mean(axis=1, level='type')
'''
Out:
type HR Temp
year
2013 36.833333 37.000000
2014 46.000000 37.283333
'''