首页>文档>Python>【Python 入门(四)】 pandas 库

【Python 入门(四)】 pandas 库

pandas 是基于 Numpy 构建的一个,让以 Numpy 为中心的应用变得更加简单。

pandas 主要包括三类数据结构,分别是:

Series:一维数组,与 Numpy 中的一维 array 类似。二者与 Python 基本的数据结构 List 也很相近,其区别是:List 中的元素可以是不同的数据类型,而 Array 和 Series 中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。

DataFrame:二维的表格型数据结构。很多功能与 R 中的 data.frame 类似。可以将 DataFrame 理解为 Series 的容器。以下的内容主要以 DataFrame 为主。

Panel :三维的数组,可以理解为 DataFrame 的容器。

Pandas 官网,更多功能请参考http://pandas-docs.github.io/pandas-docs-travis/index.html

In [2]: # 首先导入库
        import pandas as pd
        import numpy as np

#Series

由一组数据(各种 Numpy 数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的 Series,可以通过传递一个 list 对象来创建一个 Series

In [3]: s = pd.Series([1,3,5,np.nan,6,8])
        s
Out[3]: 0    1.0
        1    3.0
        2    5.0
        3    NaN
        4    6.0
        5    8.0
        dtype: float64

获取 Series 的索引:

In [4]: s.index
Out[4]: RangeIndex(start=0, stop=6, step=1)

#DataFrame

DataFrame 是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame 中的每一行是一个记录,名称为 Index 的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame 既有行索引也有列索引,可以被看做由 Series 组成的字典(共用同一个索引)。

创建一个 DataFrame,包括一个 numpy array, 时间索引和列名字:

In [5]: dates = pd.date_range('20130101', periods=6)
        dates
Out[5]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                       '2013-01-05', '2013-01-06'],
                       dtype='datetime64[ns]', freq='D'
In [6]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
        df
Out[6]: 	        A	        B	        C	        D
        2013-01-01	0.190720	0.114154	-0.644789	-0.672361
        2013-01-02	0.370210	-0.938678	-0.362772	-0.769552
        2013-01-03	-0.677601	0.838778	-0.222894	0.300474
        2013-01-04	0.160924	-0.926747	1.396041	-0.556695
        2013-01-05	0.466604	-0.380008	-0.616786	-0.517109
        2013-01-06	-0.960533	0.565752	0.234522	0.892747

通过传递一个能够被转换成类似序列结构的字典对象来创建一个 DataFrame:

In [7]: df2 = pd.DataFrame({'A':1.0,
                            'B':pd.Timestamp('20130102'),
                            'C':pd.Series(1, index=list(range(4)), dtype='float32'),
                            'D':np.array([3] * 4, dtype='int32'),
                            'E':pd.Categorical(['test', 'train', 'test', 'train']),
                            'F':'foo'
                            })
        df2
Out[7]: 	A	B	        C	D	E	F
        0	1.0	2013-01-02	1.0	3	test	foo
        1	1.0	2013-01-02	1.0	3	train	foo
        2	1.0	2013-01-02	1.0	3	test	foo
        3	1.0	2013-01-02	1.0	3	train	foo

查看不同列的数据类型:

In [8]: df2.dtypes
Out[8]: A           float64
        B    datetime64[ns]
        C           float32
        D             int32
        E          category
        F            object
        dtype: object

使用 Tab 自动补全功能会自动识别所有的属性以及自定义的列

#查看数据

我们以平台获取的数据为例进行讲解:

In [9]: # 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价。
        df = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', ['open', 'high', 'low', 'close'])
        df
Out[9]: 	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492

查看前几条数据:

In [10]:df.head()
Out[10]: 	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985

查看后几条数据:

In [11]:df.tail()
Out[11]: 	        open	high	low	close
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492

显示索引、列和底层的 numpy 数据:

In [12]:df.index
Out[12]:DatetimeIndex(['2017-06-01', '2017-06-02', '2017-06-05', '2017-06-06',
                       '2017-06-07', '2017-06-08', '2017-06-09', '2017-06-12',
                       '2017-06-13', '2017-06-14'],
                       dtype='datetime64[ns]', freq=None)

查看 DataFrame 的列名

In [13]:df.columns
Out[13]:Index(['open', 'high', 'low', 'close'], dtype='object')

查看 DataFrame 的值

In [14]:df.values
Out[14]:array([[ 9.0675,  9.0971,  8.9886,  9.0576],
               [ 9.0478,  9.1562,  9.0084,  9.0379],
               [ 8.9985,  9.0379,  8.8605,  8.8999],
               [ 8.8802,  8.9295,  8.8605,  8.9098],
               [ 8.8901,  9.0182,  8.8802,  8.9985],
               [ 8.9788,  9.0182,  8.9492,  8.9985],
               [ 9.0182,  9.0872,  8.9886,  9.0182],
               [ 9.0182,  9.0576,  8.9689,  8.9788],
               [ 8.9788,  9.0084,  8.9197,  8.9886],
               [ 8.9886,  8.9985,  8.9098,  8.9492]])

describe()函数对于数据的快速统计汇总:

In [15]:df.describe()
Out[15]:	open	        high	        low	        close
        count	10.000000	10.00000	10.000000	10.000000
        mean	8.986670	9.04088	        8.933440	8.983700
        std	0.060725	0.06252	        0.055209	0.051369
        min	8.880200	8.92950	        8.860500	8.899900
        25%	8.978800	9.01085	        8.887600	8.956600
        50%	8.993550	9.02805	        8.934450	8.993550
        75%	9.018200	9.07980	        8.983675	9.013275
        max	9.067500	9.15620	        9.008400	9.057600

对数据的转置(tranverse):

In [16]:df.T
Out[16]:        2017-06-01 2017-06-02 2017-06-05 2017-06-06 2017-06-07 2017-06-08 2017-06-09 2017-06-12 2017-06-13 2017-06-14
        open	9.0675	    9.0478	8.9985	    8.8802	8.8901	8.9788	    9.0182	9.0182	    8.9788	8.9886
        high	9.0971	    9.1562	9.0379	    8.9295	9.0182	9.0182	    9.0872	9.0576	    9.0084	8.9985
        low	8.9886	    9.0084	8.8605	    8.8605	8.8802	8.9492	    8.9886	8.9689	    8.9197	8.9098
        close	9.0576	    9.0379	8.8999	    8.9098	8.9985	8.9985	    9.0182	8.9788	    8.9886	8.9492

按轴进行排序

In [17]:df.sort_index(axis=1, ascending=False)
Out[17]:	        open	low	high	close
        2017-06-01	9.0675	8.9886	9.0971	9.0576
        2017-06-02	9.0478	9.0084	9.1562	9.0379
        2017-06-05	8.9985	8.8605	9.0379	8.8999
        2017-06-06	8.8802	8.8605	8.9295	8.9098
        2017-06-07	8.8901	8.8802	9.0182	8.9985
        2017-06-08	8.9788	8.9492	9.0182	8.9985
        2017-06-09	9.0182	8.9886	9.0872	9.0182
        2017-06-12	9.0182	8.9689	9.0576	8.9788
        2017-06-13	8.9788	8.9197	9.0084	8.9886
        2017-06-14	8.9886	8.9098	8.9985	8.9492

按值进行排序

In [18]:df.sort_value(by='open', ascending=False)
Out[18]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-14	8.9886	8.9985	8.9098	8.9492
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-06	8.8802	8.9295	8.8605	8.9098

#选择数据

通过下标选取数据

df[‘open’],df.open 以上两个语句是等效的,都是返回 df 名称为 open 列的数据,返回的为一个 Series。 df[0:3], df[‘2017-06-01′:’2017-06-05’] 下标索引选取的是 DataFrame 的记录,与 List 相同 DataFrame 的下标也是从 0 开始,区间索引的话,为一个左闭右开的区间,即[0:3]选取的为 0-2 三条记录。 与此等价,还可以用起始的索引名称和结束索引名称选取数据,如:df[‘a’:’b’]。有一点需要注意的是使用起始索引名称和结束索引名称时,也会包含结束索引的数据。具体看下方示例: 以上两种方式返回的都是 DataFrame。

选择一列数据:

In [19]:df['open']
Out[19]:2017-06-01    9.0675
        2017-06-02    9.0478
        2017-06-05    8.9985
        2017-06-06    8.8802
        2017-06-07    8.8901
        2017-06-08    8.9788
        2017-06-09    9.0182
        2017-06-12    9.0182
        2017-06-13    8.9788
        2017-06-14    8.9886
        Name: open, dtype: float64

选择多列数据:

In [20]:df['open', 'close']
Out[20]:	        open	close
        2017-06-01	9.0675	9.0576
        2017-06-02	9.0478	9.0379
        2017-06-05	8.9985	8.8999
        2017-06-06	8.8802	8.9098
        2017-06-07	8.8901	8.9985
        2017-06-08	8.9788	8.9985
        2017-06-09	9.0182	9.0182
        2017-06-12	9.0182	8.9788
        2017-06-13	8.9788	8.9886
        2017-06-14	8.9886	8.9492

选择多行:

In [21]:df[0:3]
Out[21]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
In [22]:df['2017-06-01':'2017-06-05']
Out[22]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999

使用标签选取数据:
df.loc[行标签,列标签]

df.loc[‘a’:’b’] #选取 ab 两行数据

df.loc[:,’open’] #选取 open 列的数据

df.loc 的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是 DataFrame,否则,则为 Series。

PS:loc 为 location 的缩写。

In [23]:df.loc['2017-06-01', 'open']
Out[23]:9.0675000000000008
In [24]:df['2017-06-01':'2017-06-06']
Out[24]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
In [25]:df.loc[:, 'open']
Out[25]:2017-06-01    9.0675
        2017-06-02    9.0478
        2017-06-05    8.9985
        2017-06-06    8.8802
        2017-06-07    8.8901
        2017-06-08    8.9788
        2017-06-09    9.0182
        2017-06-12    9.0182
        2017-06-13    8.9788
        2017-06-14    8.9886
        Name: open, dtype: float64
In [26]:df.loc['2017-06-01':'2017-06-06','open']
Out[26]:2017-06-01    9.0675
        2017-06-02    9.0478
        2017-06-05    8.9985
        2017-06-06    8.8802
        Name: open, dtype: float64

使用位置选取数据:
df.iloc[行位置,列位置]

df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值

df.iloc[[0,2],:] #选取第一行及第三行的数据

df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据

df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个 Series

df.iloc[1,:] #选取第一行数据,返回的为一个 Series

PS:iloc 则为 integer & location 的缩写

In [27]:df.iloc[1,1]    # 选取第二行,第二列的值,返回的为单个值
Out[27]:9.1562000000000001
In [28]:df.iloc[[0,2], :]    # 选取第一行及第三行的数据
Out[28]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-05	8.9985	9.0379	8.8605	8.8999
In [29]:df.iloc[0:2, :]    # 选取第一行到第三行(不包含)的数据
Out[29]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
In [30]:df.iloc[:, 1]    # 选取所有记录的第一列的值,返回的为一个Series
Out[30]:2017-06-01    9.0971
        2017-06-02    9.1562
        2017-06-05    9.0379
        2017-06-06    8.9295
        2017-06-07    9.0182
        2017-06-08    9.0182
        2017-06-09    9.0872
        2017-06-12    9.0576
        2017-06-13    9.0084
        2017-06-14    8.9985
        Name: high, dtype: float64
In [31]:df.iloc[1, :]    # 选取第一行数据,返回的为一个Series
Out[31]:open     9.0478
        high     9.1562
        low      9.0084
        close    9.0379
        Name: 2017-06-02 00:00:00, dtype: float64

更广义的切片方式是使用.ix,它自动根据给到的索引类型判断是使用位置还是标签进行切片

In [32]:df.ix[1,1]
Out[32]:9.1562000000000001
In [33]:df.ix['2017-06-01', '2017-06-05']
Out[33]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
In [34]:df.ix['2017-06-05','open']
Out[34]:8.9984999999999999
In [35]:df.ix[1, 'open']
Out[35]:9.0478000000000005
In [36]:df.ix['2017-06-01', 0]
Out[36]:9.0675000000000008

通过逻辑指针进行数据切片:
df[逻辑条件]

df[df.one >= 2] #单个逻辑条件

df[(df.one >=1 ) & (df.one < 3) ] #多个逻辑条件组合

In [37]:# 筛选出 open 大于8.8的数据
        df[df.open > 8.8]
Out[37]:	open	high	low	close
In [38]:# 筛选出 open 大于9.12的数据,并且 close 小于9.17的数据
        df[(df.open > 9.12) & (df.close < 9.17)]
Out[38]:	open	high	low	close

使用条件过来更改数据

In [39]:df[df>10]
Out[39]:	        open	high	low	close
        2017-06-01	NaN	NaN	NaN	NaN
        2017-06-02	NaN	NaN	NaN	NaN
        2017-06-05	NaN	NaN	NaN	NaN
        2017-06-06	NaN	NaN	NaN	NaN
        2017-06-07	NaN	NaN	NaN	NaN
        2017-06-08	NaN	NaN	NaN	NaN
        2017-06-09	NaN	NaN	NaN	NaN
        2017-06-12	NaN	NaN	NaN	NaN
        2017-06-13	NaN	NaN	NaN	NaN
        2017-06-14	NaN	NaN	NaN	NaN

观察可以发现,df 中小于 9 的数都变为 NaN。

下面我们就把大于 9 的数赋值为 0.

In [40]:df[df > 9.12] = 0
In [41]:df
Out[41]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	0.0000	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492

发现大于 9.12 的数都被替换为了 0.

使用 isin()方法来过滤在指定列中的数据:

In [42]:# 选取 high 列中数为0和9的数
        df[df['high'].isin([0.00,9.00])]
Out[42]:	        open	high	low	close
        2017-06-02	9.0478	0.0	9.0084	9.0379

#Panel

平台 get_price,如果是多支股票, 则返回 pandas.Panel 对象。

可通过 panel[列标,行标,股票代码] 获取数据.

In [43]:panel = get_price(['000001.XSHE', '000002.XSHE'], '2017-06-01', '2017-06-14', '1d', fields=['open', 'high', 'low', 'close'])
        panel
Out[43]:<class 'rqcommons.pandas_patch.HybridDataPanel'>
        Dimensions: 4 (items) x 10 (major_axis) x 2 (minor_axis)
        Items axis: open to close
        Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00
        Minor_axis axis: 000001.XSHE to 000002.XSHE

由打印的结果可以看出:

列标(Items axis: close to open)

行标(Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00)

股票代码(Minor_axis axis: 000001.XSHE to 000002.XSHE)

In [44]:# 取出'open'的数据
        panel['open', :, :]
Out[44]:	    000001.XSHE	000002.XSHE
        2017-06-01	9.0675	20.4058
        2017-06-02	9.0478	20.3864
        2017-06-05	8.9985	20.0673
        2017-06-06	8.8802	20.3574
        2017-06-07	8.8901	20.1833
        2017-06-08	8.9788	20.1833
        2017-06-09	9.0182	20.1833
        2017-06-12	9.0182	20.7733
        2017-06-13	8.9788	21.1891
        2017-06-14	8.9886	21.1794
In [45]:# 取出'2017-06-01'的数据
        panel[:, '2017-06-01', :]
Out[45]:                open	high	low	close
        000001.XSHE	9.0675	9.0971	8.9886	9.0576
        000002.XSHE	20.4058	20.6669	20.2413	20.3961
In [46]:# 取出000001的 DataFrame 数据
        panel[:, :, '000001.XSHE']
Out[46]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492

Panel 操作与 DataFrame 基本相同,下节会为大家讲解 DataFrame 的数据处理与规整。

#缺失数据处理

去掉包含缺失值的行

In [47]:# 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价
        df = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', fields=['open', 'high', 'low', 'close'])
        df = df[df > 9.0]
        df
Out[47]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	NaN	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	NaN	9.0379	NaN	NaN
        2017-06-06	NaN	NaN	NaN	NaN
        2017-06-07	NaN	9.0182	NaN	NaN
        2017-06-08	NaN	9.0182	NaN	NaN
        2017-06-09	9.0182	9.0872	NaN	9.0182
        2017-06-12	9.0182	9.0576	NaN	NaN
        2017-06-13	NaN	9.0084	NaN	NaN
        2017-06-14	NaN	NaN	NaN	NaN
In [48]:df.dropna()
Out[48]:	        open	high	low	close
        2017-06-02	9.0478	9.1562	9.0084	9.0379

对缺失值进行填充

In [49]:df.fillna(value=0)
Out[49]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	0.0000	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	0.0000	9.0379	0.0000	0.0000
        2017-06-06	0.0000	0.0000	0.0000	0.0000
        2017-06-07	0.0000	9.0182	0.0000	0.0000
        2017-06-08	0.0000	9.0182	0.0000	0.0000
        2017-06-09	9.0182	9.0872	0.0000	9.0182
        2017-06-12	9.0182	9.0576	0.0000	0.0000
        2017-06-13	0.0000	9.0084	0.0000	0.0000
        2017-06-14	0.0000	0.0000	0.0000	0.0000

判断数据是否为 nan,并进行布尔填充

In [50]:df.isnull(df)
Out[50]:	        open	high	low	close
        2017-06-01	False	False	True	False
        2017-06-02	False	False	False	False
        2017-06-05	True	False	True	True
        2017-06-06	True	True	True	True
        2017-06-07	True	False	True	True
        2017-06-08	True	False	True	True
        2017-06-09	False	False	True	False
        2017-06-12	False	False	True	True
        2017-06-13	True	False	True	True
        2017-06-14	True	True	True	True

#函数的应用和映射

In [51]:df.mean()  # 列计算平均值
Out[51]:open     9.037925
        high     9.060100
        low      9.008400
        close    9.037900
        dtype: float64
In [52]:df.mean(1)  # 行计算平均值
Out[52]:2017-06-01    9.074067
        2017-06-02    9.062575
        2017-06-05    9.037900
        2017-06-06         NaN
        2017-06-07    9.018200
        2017-06-08    9.018200
        2017-06-09    9.041200
        2017-06-12    9.037900
        2017-06-13    9.008400
        2017-06-14         NaN
        dtype: float64
In [53]:df.mean(axis=1, skipna = False) # skipna参数默认是True表示排除缺失值
Out[53]:2017-06-01         NaN
        2017-06-02    9.062575
        2017-06-05         NaN
        2017-06-06         NaN
        2017-06-07         NaN
        2017-06-08         NaN
        2017-06-09         NaN
        2017-06-12         NaN
        2017-06-13         NaN
        2017-06-14         NaN
        dtype: float64

常用的方法如上所介绍们,还要其他许多,可自行学习,下面罗列了一些,可供参考:

  • count 非na值的数量
  • describe 针对Series或个DataFrame列计算汇总统计
  • min、max 计算最小值和最大值
  • argmin、argmax 计算能够获取到最大值和最小值得索引位置(整数)
  • idxmin、idxmax 计算能够获取到最大值和最小值得索引值
  • quantile 计算样本的分位数(0到1)
  • sum 值的总和
  • mean 值得平均数
  • median 值得算术中位数(50%分位数)
  • mad 根据平均值计算平均绝对离差
  • var 样本值的方差
  • std 样本值的标准差
  • skew 样本值得偏度(三阶矩)
  • kurt 样本值得峰度(四阶矩)
  • cumsum 样本值得累计和
  • cummin,cummax 样本值得累计最大值和累计最小值
  • cumprod 样本值得累计积
  • diff 计算一阶差分(对时间序列很有用)
  • pct_change 计算百分数变化

#数据规整

Pandas 提供了大量的方法能够轻松的对 Series,DataFrame 和 Panel 对象进行各种符合各种逻辑关系的合并操作

  • concat 可以沿一条轴将多个对象堆叠到一起。
  • append 将一行连接到一个DataFrame上
  • duplicated 移除重复数据

concat

In [54]:df1 = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', ['open', 'high', 'low', 'close'])
        df1
Out[54]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492
In [55]:df2 = get_price('000001.XSHE', '2017-05-15', '2017-05-31', '1d', ['open', 'high', 'low', 'close'])
        df2
Out[55]:	        open	high	low	close
        2017-05-15	8.7620	8.8211	8.6831	8.7324
        2017-05-16	8.7127	8.7225	8.6043	8.7127
        2017-05-17	8.6831	8.6831	8.6240	8.6437
        2017-05-18	8.5944	8.6437	8.5846	8.6043
        2017-05-19	8.6141	8.6338	8.5550	8.5648
        2017-05-22	8.5550	8.6043	8.4860	8.5550
        2017-05-23	8.5451	8.7127	8.5156	8.6634
        2017-05-24	8.6535	8.7028	8.5550	8.6831
        2017-05-25	8.6634	9.0084	8.6535	8.9689
        2017-05-26	8.9492	8.9985	8.9098	8.9689
        2017-05-31	8.9689	9.0971	8.9295	9.0675
In [56]:pd.concat([df1,df2], axis=0)
Out[56]:              open    high     low   close
        2017-06-01  9.0675  9.0971  8.9886  9.0576
        2017-06-02  9.0478  9.1562  9.0084  9.0379
        2017-06-05  8.9985  9.0379  8.8605  8.8999
        2017-06-06  8.8802  8.9295  8.8605  8.9098
        2017-06-07  8.8901  9.0182  8.8802  8.9985
        2017-06-08  8.9788  9.0182  8.9492  8.9985
        2017-06-09  9.0182  9.0872  8.9886  9.0182
        2017-06-12  9.0182  9.0576  8.9689  8.9788
        2017-06-13  8.9788  9.0084  8.9197  8.9886
        2017-06-14  8.9886  8.9985  8.9098  8.9492
        2017-05-15  8.7620  8.8211  8.6831  8.7324
        2017-05-16  8.7127  8.7225  8.6043  8.7127
        2017-05-17  8.6831  8.6831  8.6240  8.6437
        2017-05-18  8.5944  8.6437  8.5846  8.6043
        2017-05-19  8.6141  8.6338  8.5550  8.5648
        2017-05-22  8.5550  8.6043  8.4860  8.5550
        2017-05-23  8.5451  8.7127  8.5156  8.6634
        2017-05-24  8.6535  8.7028  8.5550  8.6831
        2017-05-25  8.6634  9.0084  8.6535  8.9689
        2017-05-26  8.9492  8.9985  8.9098  8.9689
        2017-05-31  8.9689  9.0971  8.9295  9.0675

横向拼接,index 对不上的会用 NaN 填充:

In [57]:pd.concat([df1,df2], axis=1)
Out[57]:              open    high     low   close    open    high     low   close
        2017-05-15     NaN     NaN     NaN     NaN  8.7620  8.8211  8.6831  8.7324
        2017-05-16     NaN     NaN     NaN     NaN  8.7127  8.7225  8.6043  8.7127
        2017-05-17     NaN     NaN     NaN     NaN  8.6831  8.6831  8.6240  8.6437
        2017-05-18     NaN     NaN     NaN     NaN  8.5944  8.6437  8.5846  8.6043
        2017-05-19     NaN     NaN     NaN     NaN  8.6141  8.6338  8.5550  8.5648
        2017-05-22     NaN     NaN     NaN     NaN  8.5550  8.6043  8.4860  8.5550
        2017-05-23     NaN     NaN     NaN     NaN  8.5451  8.7127  8.5156  8.6634
        2017-05-24     NaN     NaN     NaN     NaN  8.6535  8.7028  8.5550  8.6831
        2017-05-25     NaN     NaN     NaN     NaN  8.6634  9.0084  8.6535  8.9689
        2017-05-26     NaN     NaN     NaN     NaN  8.9492  8.9985  8.9098  8.9689
        2017-05-31     NaN     NaN     NaN     NaN  8.9689  9.0971  8.9295  9.0675
        2017-06-01  9.0675  9.0971  8.9886  9.0576     NaN     NaN     NaN     NaN
        2017-06-02  9.0478  9.1562  9.0084  9.0379     NaN     NaN     NaN     NaN
        2017-06-05  8.9985  9.0379  8.8605  8.8999     NaN     NaN     NaN     NaN
        2017-06-06  8.8802  8.9295  8.8605  8.9098     NaN     NaN     NaN     NaN
        2017-06-07  8.8901  9.0182  8.8802  8.9985     NaN     NaN     NaN     NaN
        2017-06-08  8.9788  9.0182  8.9492  8.9985     NaN     NaN     NaN     NaN
        2017-06-09  9.0182  9.0872  8.9886  9.0182     NaN     NaN     NaN     NaN
        2017-06-12  9.0182  9.0576  8.9689  8.9788     NaN     NaN     NaN     NaN
        2017-06-13  8.9788  9.0084  8.9197  8.9886     NaN     NaN     NaN     NaN
        2017-06-14  8.9886  8.9985  8.9098  8.9492     NaN     NaN     NaN     NaN

append

In [58]:df1
Out[58]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492
In [59]:s = df1.iloc[0]
        s
Out[59]:open     9.0675
        high     9.0971
        low      8.9886
        close    9.0576
        Name: 2017-06-01 00:00:00, dtype: float64
In [60]:df1.append(s, ignore_index=False)  #ignore_index=False表示索引不变
Out[60]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492
        2017-06-01	9.0675	9.0971	8.9886	9.0576
In [61]:df1.append(s, ignore_index=True)  #ignore_index=True表示索引重置
Out[61]:	open	high	low	close
        0	9.0675	9.0971	8.9886	9.0576
        1	9.0478	9.1562	9.0084	9.0379
        2	8.9985	9.0379	8.8605	8.8999
        3	8.8802	8.9295	8.8605	8.9098
        4	8.8901	9.0182	8.8802	8.9985
        5	8.9788	9.0182	8.9492	8.9985
        6	9.0182	9.0872	8.9886	9.0182
        7	9.0182	9.0576	8.9689	8.9788
        8	8.9788	9.0084	8.9197	8.9886
        9	8.9886	8.9985	8.9098	8.9492
        10	9.0675	9.0971	8.9886	9.0576

移除重复数据 duplicated

In [62]:z = df1.append(s, ignore_index=False)
        z
Out[62]:	        open	high	low	close
        2017-06-01	9.0675	9.0971	8.9886	9.0576
        2017-06-02	9.0478	9.1562	9.0084	9.0379
        2017-06-05	8.9985	9.0379	8.8605	8.8999
        2017-06-06	8.8802	8.9295	8.8605	8.9098
        2017-06-07	8.8901	9.0182	8.8802	8.9985
        2017-06-08	8.9788	9.0182	8.9492	8.9985
        2017-06-09	9.0182	9.0872	8.9886	9.0182
        2017-06-12	9.0182	9.0576	8.9689	8.9788
        2017-06-13	8.9788	9.0084	8.9197	8.9886
        2017-06-14	8.9886	8.9985	8.9098	8.9492
        2017-06-01	9.0675	9.0971	8.9886	9.0576

查看重复数据:

In [63]:z.duplicated()
Out[63]:2017-06-01    False
        2017-06-02    False
        2017-06-05    False
        2017-06-06    False
        2017-06-07    False
        2017-06-08    False
        2017-06-09    False
        2017-06-12    False
        2017-06-13    False
        2017-06-14    False
        2017-06-01     True
        dtype: bool
0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧