Pandas Introduction 1
Series
first import pandas:
1 | import pandas |
create series:
pd.Series(data, index=index)
1 | data = pd.Series([0.25, 0.5, 0.75, 1.0]) |
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
1 | pd.Series(5, index=[100, 200, 300]) |
100 5
200 5
300 5
dtype: int64
it can also be dictionary, and it will sort as the index
1 | pd.Series({2:'a', 1:'b', 3:'c'}) |
1 b
2 a
3 c
dtype: object
1 | pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) |
3 c
2 a
dtype: object
index
we can check the value(similar to numpy) and index:
1 | data.values |
it can also be get as python:
1 | data[1] |
we can change the index:
1 | data = pd.Series([0.25, 0.5, 0.75, 1.0], |
the index can be whatever you like:
1 | data = pd.Series([0.25, 0.5, 0.75, 1.0], |
it can be considered as a kind of dictionary:
1 | population_dict = {'California': 38332521, |
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
it can also be selected:
1 | population['California'] |
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
DataFrame
1 | area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, |
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
combine it with the former one:
1 | states = pd.DataFrame({'population': population, |
area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
index and column
1 | states.index |
it can also be considered as dictionary:
1 | states['area'] |
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
create DataFrame:
1 | pd.DataFrame(population, columns=['population']) |
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
1 | data = [{'a': i, 'b': 2 * i} |
a b
0 0 0
1 1 2
2 2 4
when combine two column, if some values doesn’t exist, it will show NaN
1 | pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) |
a b c
0 1.0 2 NaN
1 NaN 3 4.0
it can also be made up of series:
1 | pd.DataFrame({'population': population, 'area': area}) |
area population
area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
if there is a two dimension array, it can also be made to DataFrame
1 | pd.DataFrame(np.random.rand(3, 2), |
foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718
Index函数
1 | ind = pd.Index([2, 3, 5, 7, 11]) |
join two DataFrame
1 | indA = pd.Index([1, 3, 5, 7, 9]) |
select
series
1 | data = pd.Series([0.25, 0.5, 0.75, 1.0], |
use series the same as array:
1 | # 将显式索引作为切片 |
loc
、iloc
and ix
1 | data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) |
it is difficult to remember, so we use loc:
it can be used to select through the index you set.
1 | data.loc[1] |
in the contrast, iloc can be used to select through the index it has originally.
1 | data.iloc[1] |
ix is the combination of the two, it will be mentioned later.
DataFrame
considered as dictionary
1 | area = pd.Series({'California': 423967, 'Texas': 695662, |
area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
attribute-style
1 | data['area'] |
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
they are the same:
1 | data.area is data['area'] |
but, data.area, this form can’t be used in some condition: if it can’t be a valuable’s name
calculate as dictionary
1 | data['density'] = data['pop'] / data['area'] |
reshape
1 | data.T |
California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01
get the whole index or column:
1 | data.values[0] |
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
loc、iloc and ix
the same as series
1 | data.iloc[:3, :2] |
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
1 | data.loc[:'Illinois', :'pop'] |
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
1 | #ix can mix them,but readers may not distiguish |
area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
advanced way
1 | data.loc[data.density > 100, ['pop', 'density']] |
pop density
Florida 19552860 114.806121
New York 19651127 139.076746
change the value
1 | data.iloc[0, 2] = 90 |
area pop density
California 423967 38332521 90.000000
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740
other ways
1 | #slice |
area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
1 | #use the index |
area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
1 | #mask |
area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746