pandas1

Pandas Introduction 1

Series

first import pandas:

1
2
3
4
5
import pandas 
pandas.__version__
#Out: '0.18.1'

import pandas as pd

create series:

pd.Series(data, index=index)

1
2
data = pd.Series([0.25, 0.5, 0.75, 1.0]) 
data

0 0.25

1 0.50

2 0.75

3 1.00

dtype: float64

1
pd.Series(5, index=[100, 200, 300])

100 5

200 5

300 5

dtype: int64

it can also be dictionary, and it will sort as the index

1
pd.Series({2:'a', 1:'b', 3:'c'})

1 b

2 a

3 c

dtype: object

1
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3 c

2 a

dtype: object

index

we can check the value(similar to numpy) and index:

1
2
3
4
5
data.values 
#Out: array([ 0.25, 0.5 , 0.75, 1. ])

data.index
#Out: RangeIndex(start=0, stop=4, step=1)

it can also be get as python:

1
2
3
4
5
6
7
data[1] 
#Out: 0.5

data[1:3]
#Out: 1 0.50
# 2 0.75
# dtype: float64

we can change the index:

1
2
3
4
5
6
7
8
9
10
11
12
13
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
index=['a', 'b', 'c', 'd'])
data

#Out: a 0.25
# b 0.50
# c 0.75
# d 1.00
# dtype: float64

#can be got like before
data['b']
#Out: 0.5

the index can be whatever you like:

1
2
3
4
5
6
7
8
9
10
11
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
index=[2, 5, 3, 7])
data
#Out: 2 0.25
# 5 0.50
# 3 0.75
# 7 1.00
# dtype: float64

data[5]
#Out: 0.5

it can be considered as a kind of dictionary:

1
2
3
4
5
6
7
population_dict = {'California': 38332521, 
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

California 38332521

Florida 19552860

Illinois 12882135

New York 19651127

Texas 26448193

dtype: int64

it can also be selected:

1
2
3
4
population['California'] 
#Out: 38332521

population['California':'Illinois']

California 38332521
Florida 19552860
Illinois 12882135
dtype: int64

DataFrame

1
2
3
4
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California 423967

Florida 170312

Illinois 149995

New York 141297

Texas 695662

dtype: int64

combine it with the former one:

1
2
3
states = pd.DataFrame({'population': population, 
'area': area})
states

​ area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

index and column

1
2
3
4
5
states.index 
#Out: Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

states.columns
#Out: Index(['area', 'population'], dtype='object')

it can also be considered as dictionary:

1
states['area']

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

create DataFrame:

1
pd.DataFrame(population, columns=['population'])

​ population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

1
2
3
data = [{'a': i, 'b': 2 * i} 
for i in range(3)]
pd.DataFrame(data)

​ a b

0 0 0

1 1 2

2 2 4

when combine two column, if some values doesn’t exist, it will show NaN

1
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

​ a b c

0 1.0 2 NaN
1 NaN 3 4.0

it can also be made up of series:

1
pd.DataFrame({'population': population, 'area': area})
area      population 

​ area population

California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

if there is a two dimension array, it can also be made to DataFrame

1
2
3
pd.DataFrame(np.random.rand(3, 2), 
columns=['foo', 'bar'],
index=['a', 'b', 'c'])

​ foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718

Index函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ind = pd.Index([2, 3, 5, 7, 11]) 
ind
#Out: Int64Index([2, 3, 5, 7, 11], dtype='int64')

ind[1]
#Out: 3

ind[::2]
#Out: Int64Index([2, 5, 11], dtype='int64')

print(ind.size, ind.shape, ind.ndim, ind.dtype)
#out:5 (5,) 1 int64

#but it can't be change
ind[1] = 0
'''
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0
/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py ...
1243
1244 def __setitem__(self, key, value):
-> 1245 raise TypeError("Index does not support mutable operations")
1246
1247 def __getitem__(self, key):
TypeError: Index does not support mutable operations
'''

join two DataFrame

1
2
3
4
5
6
7
8
9
10
indA = pd.Index([1, 3, 5, 7, 9]) 
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # 交集 and
#Out: Int64Index([3, 5, 7], dtype='int64')

indA | indB # 并集 or
#Out: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB # 异或 nor
#Out: Int64Index([1, 2, 9, 11], dtype='int64')

select

series

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
index=['a', 'b', 'c', 'd'])
data
'''
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
'''

data['b']
#Out: 0.5

'a' in data
#Out: True


#it can be selected as the way of dictionary
data.keys()
#Out: Index(['a', 'b', 'c', 'd'], dtype='object')

list(data.items())
#Out: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#add items to series(the same as dictionary)
data['e'] = 1.25
data
'''
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
'''

use series the same as array:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 将显式索引作为切片
data['a':'c']
#Out: a 0.25
# b 0.50
# c 0.75
# dtype: float64

# 将隐式整数索引作为切片
data[0:2]
#Out: a 0.25
# b 0.50
# dtype: float64

# 掩码
data[(data > 0.3) & (data < 0.8)]
#Out: b 0.50
# c 0.75
# dtype: float64

# 花哨的索引
data[['a', 'e']]
#Out: a 0.25
# e 1.25
# dtype: float64

lociloc and ix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) 
data
#Out: 1 a
# 3 b
# 5 c
# dtype: object


#we can get the value through the index we set
# 取值操作是显式索引
data[1]
#Out: 'a'

#we can select several items through the index it has oraginally
# 切片操作是隐式索引
data[1:3]
#Out: 3 b
# 5 c
# dtype: object

it is difficult to remember, so we use loc:

it can be used to select through the index you set.

1
2
3
4
5
6
7
data.loc[1] 
#Out: 'a'

data.loc[1:3]
#Out: 1 a
# 3 b
# dtype: object

in the contrast, iloc can be used to select through the index it has originally.

1
2
3
4
5
6
7
data.iloc[1] 
#Out: 'b'

data.iloc[1:3]
#Out: 3 b
# 5 c
# dtype: object

ix is the combination of the two, it will be mentioned later.

DataFrame

considered as dictionary

1
2
3
4
5
6
7
8
area = pd.Series({'California': 423967, 'Texas': 695662, 
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

​ area population

California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

attribute-style

1
data['area']

​ area

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

they are the same:

1
2
data.area is data['area'] 
#Out: True

but, data.area, this form can’t be used in some condition: if it can’t be a valuable’s name

calculate as dictionary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data['density'] = data['pop'] / data['area'] 
data
'''
Out: area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740
'''

data.values
#Out: array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
# [ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
# [ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
# [ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
# [ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

reshape

1
data.T

​ California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01

get the whole index or column:

1
2
3
4
data.values[0] 
#Out: array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

data['area']

​ area

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

loc、iloc and ix

the same as series

1
data.iloc[:3, :2]

​ area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

1
data.loc[:'Illinois', :'pop']

​ area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

1
2
#ix can mix them,but readers may not distiguish
data.ix[:3, :'pop']

​ area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

advanced way

1
data.loc[data.density > 100, ['pop', 'density']]

​ pop density
Florida 19552860 114.806121
New York 19651127 139.076746

change the value

1
2
data.iloc[0, 2] = 90 
data

​ area pop density
​ California 423967 38332521 90.000000
​ Florida 170312 19552860 114.806121
​ Illinois 149995 12882135 85.883763
​ New York 141297 19651127 139.076746
​ Texas 695662 26448193 38.018740

other ways

1
2
#slice
data['Florida':'Illinois']

​ area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

1
2
#use the index
data[1:3]
area     pop             density 

Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

1
2
#mask
data[data.density > 100]

​ area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746