Pandas Introduction 2

indexes

keep the indexes:

when making difficult calculations,it will keep its indexes.

first create a series and a dataframe:

import pandas as pd 
import numpy as np 

rng = np.random.RandomState(42) 
ser = pd.Series(rng.randint(0, 10, 4)) #select 4 numbers within 0 to 10
ser 
'''
out:
 0 6 
 1 3 
 2 7 
 3 4 
 dtype: int64 
'''

df = pd.DataFrame(rng.randint(0, 10, (3, 4)), 
columns=['A', 'B', 'C', 'D']) 
df 
'''
out: 
   A B C D 
 0 6 9 2 6 
 1 7 4 3 7 
 2 7 2 5 4
'''

e to the x:

np.exp(ser) 
'''
Out:
 0 403.428793 
 1 20.085537 
 2 1096.633158 
 3 54.598150 
 dtype: float64
'''

or more complex one:

np.sin(df * np.pi / 4) 
'''
Out[5]: 
	A 			B			 C 			D 
 0 -1.000000 7.071068e-01 1.000000 -1.000000e+00 
 1 -0.707107 1.224647e-16 0.707107 -7.071068e-01 
 2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
'''

match the index:

area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 
                  'California': 423967}, name='area') 
population = pd.Series({'California': 38332521, 'Texas': 26448193, 
                        'New York': 19651127}, name='population')

#calculate the density:
population / area 
'''
Out[7]: 
 Alaska		 NaN 
 California  90.413926 
 New York 	 NaN 
 Texas 		 38.018740 
 dtype: float64
'''

their indexes:

1 2	area.index \| population.index #Out: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

if there is no value, it will show NaN

A = pd.Series([2, 4, 6], index=[0, 1, 2]) 
B = pd.Series([1, 3, 5], index=[1, 2, 3]) 
A + B 
'''
Out: 
 0 NaN 
 1 5.0 
 2 9.0 
 3 NaN 
 dtype: float64
'''

using this, we can fill the values that don’t exist with 0

A.add(B, fill_value=0) 
'''
Out: 
 0 2.0 
 1 5.0 
 2 9.0 
 3 5.0 
 dtype: float64
'''

when comes to DataFrame:

A = pd.DataFrame(rng.randint(0, 20, (2, 2)), 
columns=list('AB')) 
A 
'''
Out: 
   A B 
 0 1 11 
 1 5 1 
'''

B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC')) 
B 
'''
Out: 
   B A C 
 0 4 0 9 
 1 5 8 0 
 2 9 2 6 
'''

A + B 
'''
Out:
    A   B   C 
 0 1.0  15.0 NaN 
 1 13.0 6.0  NaN 
 2 NaN  NaN  NaN
'''

use the mean to fill NaN

fill = A.stack().mean() 
A.add(B, fill_value=fill) 
'''
Out: 
    A   B   C 
 0 1.0 15.0 13.5 
 1 13.0 6.0 4.5 
 2 6.5 13.5 10.5
'''

Python运算符	Pandas方法
+	add()
-	sub()、subtract()
*	mul()、multiply()
/	truediv()、div()、divide()
//	floordiv()
%	mod()
**	pow()

calculation

A = rng.randint(10, size=(3, 4)) 
A 
'''
Out: 
array([[3, 8, 2, 4], 
	  [2, 6, 4, 8], 
	  [6, 1, 3, 8]]) 
'''

A - A[0] 
'''
Out: 
array([[ 0, 0, 0, 0], 
	  [-1, -2, 2, 4], 
	  [ 3, -7, 1, 4]])
'''

DataFrame id the same:

df = pd.DataFrame(A, columns=list('QRST')) 
df - df.iloc[0] 
'''
Out:
    Q R S T 
 0  0  0 0 0 
 1 -1 -2 2 4 
 2  3 -7 1 4
'''

change to columns:

df.subtract(df['R'], axis=0) 
'''
Out:
    Q R S T 
 0 -5 0 -6 -4 
 1 -4 0 -2  2 
 2  5 0  2  7
'''

halfrow = df.iloc[0, ::2] 
halfrow 
'''
Out:
 Q 3 
 S 2 
Name: 0, dtype: int64 
'''

df - halfrow 
'''
Out: 
    Q   R   S   T   
 0 0.0 NaN 0.0 NaN 
 1 -1.0 NaN 2.0 NaN 
 2 3.0 NaN 1.0 NaN
'''

deal with NaN

what is NaN

1
2
3

vals1 = np.array([1, None, 3, 4]) 
vals1 
#Out: array([1, None, 3, 4], dtype=object)

vals1.sum() 
'''
TypeError Traceback (most recent call last) 
<ipython-input-4-749fd8ae6030> in <module>() 
----> 1 vals1.sum() 
/Users/jakevdp/anaconda/lib/python3.5/site-packages/numpy/core/_methods.py ... 
 30 
 31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False): 
---> 32 return umr_sum(a, axis, dtype, out, keepdims) 
 33 
 34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False): 
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
'''

None can’t be add to integers

NaN: not a number

It can be calculated with integer:

1 + np.nan 
#Out: nan 

0 * np.nan 
#Out: nan

The result will always be nan.

including the maximum and minimum:

1
2
3

vals2 = np.array([1, np.nan, 3, 4])
vals2.sum(), vals2.min(), vals2.max() 
#Out: (nan, nan, nan)

delete the nan:

1 2	np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) #Out: (8.0, 1.0, 4.0)

In pandas, None=NaN:

pd.Series([1, np.nan, 2, None]) 
'''
Out:
 0 1.0 
 1 NaN 
 2 2.0 
 3 NaN 
 dtype: float64
'''

If there is a NaN the type will be changed to float:

x = pd.Series(range(2), dtype=int) 
x 
'''
Out:
 0 0 
 1 1 
 dtype: int64 
'''
x[0] = None 
x 
'''
Out:
 0 NaN 
 1 1.0 
 dtype: float64
'''

类型	缺失值转换规则	NA标签值
floating 浮点型	无变化	np.nan
object 对象类型	无变化	None 或 np.nan
integer 整数类型	强制转换为 float64	np.nan
boolean 布尔类型	强制转换为 object	None 或 np.nan

whether there is NaN/delete NaN

.isnull():

data = pd.Series([1, np.nan, 'hello', None]) 
data.isnull()
'''
Out: 
 0 False 
 1 True 
 2 False 
 3 True 
 dtype: bool
'''

.notnull():

data[data.notnull()] 
'''
Out: 
 0 1 
 2 hello 
 dtype: object
'''

.dropna(): (it will drop the whole index in DataFrame)

data.dropna() 
'''
Out: 
 0 1 
 2 hello 
 dtype: object
'''

change the direction it delete in DataFrame:

df[3] = np.nan 
df 
'''''
Out:
    0   1  2  3 
 0 1.0 NaN 2 NaN 
 1 2.0 3.0 5 NaN 
 2 NaN 4.0 6 NaN 
'''

df.dropna(axis='columns', how='all') 
'''
Out:
    0   1  2 
 0 1.0 NaN 2 
 1 2.0 3.0 5 
 2 NaN 4.0 6
'''

df.dropna(axis='rows', thresh=3) 
'''
Out: 
     0 1   2 3 
 1 2.0 3.0 5 NaN
'''

fill the NaN

data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde')) 
data 
'''
Out: 
 a 1.0 
 b NaN 
 c 2.0 
 d NaN 
 e 3.0 
 dtype: float64
'''

.fillna():

data.fillna(0) 
'''
Out: 
 a 1.0 
 b 0.0 
 c 2.0 
 d 0.0 
 e 3.0 
 dtype: float64
'''

data.fillna(method='ffill') 
'''
Out:
 a 1.0 
 b 1.0 
 c 2.0 
 d 2.0 
 e 3.0 
 dtype: float64
'''

data.fillna(method='bfill') 
'''
Out: a 1.0 
 b 2.0 
 c 2.0 
 d 3.0 
 e 3.0 
 dtype: float64
'''

It’s similar in DataFrame:

df 
'''
Out:
     0  1  2  3 
 0 1.0 NaN 2 NaN 
 1 2.0 3.0 5 NaN 
 2 NaN 4.0 6 NaN 
'''

df.fillna(method='ffill', axis=1) 
'''
Out: 
    0  1   2   3 
 0 1.0 1.0 2.0 2.0 
 1 2.0 3.0 5.0 5.0 
 2 NaN 4.0 6.0 6.0
'''

multilevel index

reindex

index = [('California', 2000), ('California', 2010), 
         ('New York', 2000), ('New York', 2010), 
         ('Texas', 2000), ('Texas', 2010)] 
populations = [33871648, 37253956, 
               18976457, 19378102, 
               20851820, 25145561] 

pop = pd.Series(populations, index=index)
index = pd.MultiIndex.from_tuples(index) 
index 
#Out: MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]], 
#                labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

pop = pop.reindex(index)
pop['California', 2000] 
#Out: 33871648

unstack and stack

pop_df = pop.unstack() 
pop_df 
'''
Out: 
             2000    2010 
California 33871648 37253956 
New York   18976457 19378102 
Texas      20851820 25145561
'''

pop_df.stack() 
'''
Out: 
California 2000 33871648 
 		   2010 37253956 
New York   2000 18976457 
 		   2010 19378102 
Texas 	   2000 20851820 
 		   2010 25145561 
dtype: int64

create

df = pd.DataFrame(np.random.rand(4, 2), 
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
columns=['data1', 'data2']) 
df 
'''
Out: 
    data1       data2 
 a 1 0.554233   0.356072 
   2 0.925244   0.219474 
 b 1 0.441759   0.610054 
   2 0.171495   0.886688
'''

data = {('California', 2000): 33871648, 
 ('California', 2010): 37253956, 
 ('Texas', 2000): 20851820, 
 ('Texas', 2010): 25145561, 
 ('New York', 2000): 18976457, 
 ('New York', 2010): 19378102} 
pd.Series(data) 
'''
Out: 
California 2000 33871648 
 		   2010 37253956 
New York   2000 18976457 
 		   2010 19378102 
Texas 	   2000 20851820 
  		   2010 25145561 
dtype: int64
'''

special ways:

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]]) 
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]], 
#                labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)]) 
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]], 
#                labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_product([['a', 'b'], [1, 2]]) 
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]], 
#                labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex(levels=[['a', 'b'], [1, 2]], 
              labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) 
#Out: MultiIndex(levels=[['a', 'b'], [1, 2]], 
#                labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

*在创建 Series 或 DataFrame 时，可以将这些对象作为 index 参数，或者通过 reindex 方法更新 Series 或 DataFrame 的索引。

#create names for indexes
pop.index.names = ['state', 'year'] 
pop
'''
Out: 
	state   year 
 California 2000 33871648 
 			2010 37253956 
 New York   2000 18976457 
 			2010 19378102 
 Texas      2000 20851820 
 			2010 25145561 
 dtype: int64
'''

more than one multilevel index

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],names=['year', 'visit']) 
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], 
                                     names=['subject', 'type']) 
# 模拟数据
data = np.round(np.random.randn(4, 6), 1) #取小数点后一位
data[:, ::2] *= 10  #make the data more different 所有行，0、2、4列扩大10倍
data += 37 
# 创建DataFrame 
health_data = pd.DataFrame(data, index=index, columns=columns) 
health_data 
'''
Out: 
 subject    Bob       Guido     Sue 
 type       HR  Temp  HR  Temp  HR  Temp 
 year visit 
 2013 1     31.0 38.7 32.0 36.7 35.0 37.2 
      2 	44.0 37.7 50.0 35.0 29.0 36.7 
 2014 1 	30.0 37.4 39.0 37.8 61.0 36.9 
      2 	47.0 37.8 48.0 37.3 51.0 36.5
'''

health_data['Bob']['HR']#now it is series
'''
out:
year  visit  
2013  1     31.0
      2     44.0
2014  1     30.0
      2     47.0
'''

get the value

pop 
'''
Out: 
state year 
California 2000 33871648 
 		   2010 37253956 
New York   2000 18976457 
 		   2010 19378102 
Texas      2000 20851820 
           2010 25145561 
dtype: int64
'''

get only one:

1 2	pop['California', 2000] #Out: 33871648

partial indexing:

pop['California'] 
'''
Out: 
year 
2000 33871648 
2010 37253956 
dtype: int64
'''

slice

pop.loc['California':'New York'] 
'''
Out: 
state year 
California 2000 33871648 
 		   2010 37253956 
New York   2000 18976457 
 		   2010 19378102 dtype: int64
'''

#get the data in 2000
pop[:, 2000] 
'''
Out: 
state 
California  33871648 
New York    18976457 
Texas 		20851820 
dtype: int64
'''

mask

pop[pop > 22000000] 
'''
Out: 
state      year 
California 2000 33871648 
 		   2010 37253956 
Texas      2010 25145561 dtype: int64
'''

pop[['California', 'Texas']] 
'''
Out: 
state      year 
California 2000 33871648 
 		   2010 37253956 
Texas      2000 20851820 
           2010 25145561 
dtype: int64
'''

DataFrame

health_data 
'''
Out: 
 subject    Bob       Guido     Sue 
 type       HR  Temp  HR  Temp  HR  Temp 
 year visit 
 2013 1     31.0 38.7 32.0 36.7 35.0 37.2 
      2 	44.0 37.7 50.0 35.0 29.0 36.7 
 2014 1 	30.0 37.4 39.0 37.8 61.0 36.9 
      2 	47.0 37.8 48.0 37.3 51.0 36.5
'''

#the same method in series can be used
health_data['Guido', 'HR'] 
'''
Out: 
year visit 
2013 1 		32.0 
 	 2 		50.0 
2014 1 		39.0 
 	 2 		48.0 
Name: (Guido, HR), dtype: float64
'''

loc

health_data.iloc[:2, :2] #隐式索引
'''
Out: 
subject    Bob 
type       HR   Temp 
year visit 
2013  1    31.0 38.7 
      2    44.0 37.7
'''

#more than one indexes can be used
health_data.loc[:, ('Bob', 'HR')] 
'''
Out: 
year visit 
2013   1   31.0 
       2   44.0 
2014   1   30.0 
       2   47.0 
Name: (Bob, HR), dtype: float64
'''

IndexSlice

idx = pd.IndexSlice 
health_data.loc[idx[:, 1], idx[:, 'HR']] 
'''
Out: 
subject     Bob Guido Sue 
type        HR  HR    HR 
year visit 
2013 1     31.0 32.0 35.0 
2014 1     30.0 39.0 61.0
'''

index/column exchange

index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]]) 
data = pd.Series(np.random.rand(6), index=index) 
data.index.names = ['char', 'int'] 
data 
'''
Out: 
  char int 
 a 1 0.003001 
 2 0.164974 
 c 1 0.741650 
 2 0.569264 
 b 1 0.001693 
 2 0.526226 
 dtype: float64
'''

#code will go wrong if the DataFrame is not sorted
try: 
    data['a':'b'] 
except KeyError as e: 
    print(type(e)) 
    print(e) 
#<class 'KeyError'> 
#'Key length (1) was greater than MultiIndex lexsort depth (0)'

sort_index()

data = data.sort_index() 
data 
'''
Out: 
char int 
 a    1 0.003001 
      2 0.164974 
 b    1 0.001693 
      2 0.526226 
 c    1 0.741650 
      2 0.569264 
 dtype: float64
'''

data['a':'b'] 
'''
Out: 
char int 
 a    1 0.003001 
      2 0.164974 
 b    1 0.001693 
      2 0.526226 
 dtype: float64
'''

stack and unstack

pop.unstack(level=0) 
'''
Out: 
 state California New York Texas 
 year 
 2000 33871648 18976457 20851820 
 2010 37253956 19378102 25145561 
'''

pop.unstack(level=1) 
'''
Out: 
        year  2000   2010 
 state 
 California 33871648 37253956 
 New York   18976457 19378102 
 Texas      20851820 25145561
'''


pop.unstack().stack() 
'''
Out: 
 state      year 
 California 2000 33871648 
            2010 37253956 
 New York   2000 18976457 
            2010 19378102 
 Texas      2000 20851820 
            2010 25145561 
 dtype: int64
'''

set and reset

pop_flat = pop.reset_index(name='population') 
pop_flat 
'''
Out:
    state     year population 
 0 California 2000 33871648 
 1 California 2010 37253956 
 2 New York   2000 18976457 
 3 New York   2010 19378102 
 4 Texas      2000 20851820 
 5 Texas      2010 25145561
'''

pop_flat.set_index(['state', 'year']) 
'''
Out: 
                 population 
 state      year 
 California 2000 33871648 
            2010 37253956 
 New York   2000 18976457 
            2010 19378102 
 Texas      2000 20851820 
            2010 25145561
'''

accumulate

health_data 
'''
Out: 
 subject   Bob     Guido   Sue 
 type      HR Temp HR Temp HR Temp 
 year visit 
 2013 1    31.0 38.7 32.0 36.7 35.0 37.2 
      2    44.0 37.7 50.0 35.0 29.0 36.7 
 2014 1    30.0 37.4 39.0 37.8 61.0 36.9 
      2    47.0 37.8 48.0 37.3 51.0 36.5
'''

mean:

data_mean = health_data.mean(level='year') 
data_mean 
'''
Out: 
 subject Bob       Guido      Sue 
 type    HR Temp   HR Temp    HR Temp 
 year 
 2013    37.5 38.2 41.0 35.85 32.0 36.95 
 2014    38.5 37.6 43.5 37.55 56.0 36.70
'''

data_mean.mean(axis=1, level='type') 
'''
Out: 
 type HR        Temp 
 year 
 2013 36.833333 37.000000 
 2014 46.000000 37.283333
'''