Pandas Introduction 1

Series

first import pandas:

import pandas 
pandas.__version__ 
#Out: '0.18.1'

import pandas as pd

create series:

pd.Series(data, index=index)

1 2	data = pd.Series([0.25, 0.5, 0.75, 1.0]) data

0 0.25

1 0.50

2 0.75

3 1.00

dtype: float64

1	pd.Series(5, index=[100, 200, 300])

100 5

200 5

300 5

dtype: int64

it can also be dictionary, and it will sort as the index

1	pd.Series({2:'a', 1:'b', 3:'c'})

1 b

2 a

3 c

dtype: object

1	pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3 c

2 a

dtype: object

index

we can check the value(similar to numpy) and index:

data.values 
#Out: array([ 0.25, 0.5 , 0.75, 1. ])

data.index 
#Out: RangeIndex(start=0, stop=4, step=1)

it can also be get as python:

data[1] 
#Out: 0.5 

data[1:3] 
#Out: 1 0.50 
# 	  2 0.75 
# 	  dtype: float64

we can change the index:

data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a', 'b', 'c', 'd']) 
data 

#Out: a 0.25 
#	  b 0.50 
#	  c 0.75 
#	  d 1.00 
#	  dtype: float64

#can be got like before
data['b'] 
#Out: 0.5

the index can be whatever you like:

data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=[2, 5, 3, 7]) 
data 
#Out: 2 0.25 
#	  5 0.50 
#	  3 0.75 
#	  7 1.00 
#	  dtype: float64 

data[5] 
#Out: 0.5

it can be considered as a kind of dictionary:

population_dict = {'California': 38332521, 
                   'Texas': 26448193, 
                   'New York': 19651127, 
                   'Florida': 19552860, 
                   'Illinois': 12882135} 
population = pd.Series(population_dict) 
population

California 38332521

Florida 19552860

Illinois 12882135

New York 19651127

Texas 26448193

dtype: int64

it can also be selected:

population['California'] 
#Out: 38332521

population['California':'Illinois']

California 38332521
Florida 19552860
Illinois 12882135
dtype: int64

DataFrame

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 
             'Florida': 170312, 'Illinois': 149995} 
area = pd.Series(area_dict) 
area

California 423967

Florida 170312

Illinois 149995

New York 141297

Texas 695662

dtype: int64

combine it with the former one:

1
2
3

states = pd.DataFrame({'population': population, 
                       'area': area}) 
states

area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

index and column

states.index 
#Out: Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

states.columns 
#Out: Index(['area', 'population'], dtype='object')

it can also be considered as dictionary:

1	states['area']

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

create DataFrame:

1	pd.DataFrame(population, columns=['population'])

population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

1
2
3

data = [{'a': i, 'b': 2 * i} 
        for i in range(3)] 
pd.DataFrame(data)

a b

0 0 0

1 1 2

2 2 4

when combine two column, if some values doesn’t exist, it will show NaN

1	pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

a b c

0 1.0 2 NaN
1 NaN 3 4.0

it can also be made up of series:

1	pd.DataFrame({'population': population, 'area': area})

area      population

area population

California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

if there is a two dimension array, it can also be made to DataFrame

1
2
3

pd.DataFrame(np.random.rand(3, 2), 
             columns=['foo', 'bar'], 
             index=['a', 'b', 'c'])

foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718

Index函数

ind = pd.Index([2, 3, 5, 7, 11]) 
ind 
#Out: Int64Index([2, 3, 5, 7, 11], dtype='int64')

ind[1] 
#Out: 3 

ind[::2] 
#Out: Int64Index([2, 5, 11], dtype='int64')

print(ind.size, ind.shape, ind.ndim, ind.dtype) 
#out:5 (5,) 1 int64

#but it can't be change
ind[1] = 0 
'''
--------------------------------------------------------------------------- 
TypeError Traceback (most recent call last) 
<ipython-input-34-40e631c82e8a> in <module>() 
----> 1 ind[1] = 0 
/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py ... 
 1243 
 1244 def __setitem__(self, key, value): 
-> 1245 raise TypeError("Index does not support mutable operations") 
 1246 
 1247 def __getitem__(self, key): 
TypeError: Index does not support mutable operations
'''

join two DataFrame

indA = pd.Index([1, 3, 5, 7, 9]) 
indB = pd.Index([2, 3, 5, 7, 11]) 
indA & indB # 交集 and
#Out: Int64Index([3, 5, 7], dtype='int64') 

indA | indB # 并集 or
#Out: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64') 

indA ^ indB # 异或 nor
#Out: Int64Index([1, 2, 9, 11], dtype='int64')

select

series

data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a', 'b', 'c', 'd']) 
data 
'''
 a 0.25 
 b 0.50 
 c 0.75 
 d 1.00 
 dtype: float64 
'''

data['b'] 
#Out: 0.5

'a' in data 
#Out: True 


#it can be selected as the way of dictionary
data.keys() 
#Out: Index(['a', 'b', 'c', 'd'], dtype='object') 

list(data.items()) 
#Out: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#add items to series(the same as dictionary)
data['e'] = 1.25 
 data 
'''
 a 0.25 
 b 0.50 
 c 0.75 
 d 1.00 
 e 1.25 
 dtype: float64
'''

use series the same as array:

# 将显式索引作为切片
data['a':'c'] 
#Out: a 0.25 
#     b 0.50 
#	  c 0.75 
#	  dtype: float64 

# 将隐式整数索引作为切片
data[0:2] 
#Out: a 0.25 
#	  b 0.50 
#	  dtype: float64 

# 掩码
data[(data > 0.3) & (data < 0.8)] 
#Out: b 0.50 
#	  c 0.75 
#	  dtype: float64 

# 花哨的索引
data[['a', 'e']] 
#Out: a 0.25 
#	  e 1.25 
#	  dtype: float64

`loc`、`iloc` and `ix`

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) 
data 
#Out: 1 a 
#     3 b 
#	  5 c 
#	  dtype: object


#we can get the value through the index we set
# 取值操作是显式索引
data[1] 
#Out: 'a'

#we can select several items through the index it has oraginally
# 切片操作是隐式索引
data[1:3] 
#Out: 3 b 
#	  5 c 
#	  dtype: object

it is difficult to remember, so we use loc:

it can be used to select through the index you set.

data.loc[1] 
#Out: 'a' 

data.loc[1:3] 
#Out: 1 a 
#	  3 b 
#	  dtype: object

in the contrast, iloc can be used to select through the index it has originally.

data.iloc[1] 
#Out: 'b' 

data.iloc[1:3] 
#Out: 3 b 
#	  5 c 
#	  dtype: object

ix is the combination of the two, it will be mentioned later.

DataFrame

considered as dictionary

area = pd.Series({'California': 423967, 'Texas': 695662, 
                  'New York': 141297, 'Florida': 170312, 
                  'Illinois': 149995}) 
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 
                 'New York': 19651127, 'Florida': 19552860, 
                 'Illinois': 12882135}) 
data = pd.DataFrame({'area':area, 'pop':pop}) 
data

area population

California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

attribute-style

1	data['area']

area

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

they are the same:

1 2	data.area is data['area'] #Out: True

but, data.area, this form can’t be used in some condition: if it can’t be a valuable’s name

calculate as dictionary

data['density'] = data['pop'] / data['area'] 
data 
'''
Out:              area  pop     density 
	 California 423967 38332521 90.413926 
	 Florida    170312 19552860 114.806121 
	 Illinois   149995 12882135 85.883763 
	 New York   141297 19651127 139.076746 
	 Texas      695662 26448193 38.018740
'''

data.values 
#Out: array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01], 
# 			 [ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02], 
# 			 [ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01], 
# 			 [ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02], 
# 			 [ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

reshape

data.T

California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01

get the whole index or column：

data.values[0] 
#Out: array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

data['area']

area

California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

loc、iloc and ix

the same as series

1	data.iloc[:3, :2]

area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

1	data.loc[:'Illinois', :'pop']

area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

1 2	#ix can mix them,but readers may not distiguish data.ix[:3, :'pop']

area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

advanced way

1	data.loc[data.density > 100, ['pop', 'density']]

pop density
Florida 19552860 114.806121
New York 19651127 139.076746

change the value

1 2	data.iloc[0, 2] = 90 data

area pop density
California 423967 38332521 90.000000
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740

other ways

1 2	#slice data['Florida':'Illinois']

area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

1 2	#use the index data[1:3]

area     pop             density

Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

1 2	#mask data[data.density > 100]

area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746