pandas 是基于 Numpy 构建的一个,让以 Numpy 为中心的应用变得更加简单。
pandas 主要包括三类数据结构,分别是:
Series:一维数组,与 Numpy 中的一维 array 类似。二者与 Python 基本的数据结构 List 也很相近,其区别是:List 中的元素可以是不同的数据类型,而 Array 和 Series 中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
DataFrame:二维的表格型数据结构。很多功能与 R 中的 data.frame 类似。可以将 DataFrame 理解为 Series 的容器。以下的内容主要以 DataFrame 为主。
Panel :三维的数组,可以理解为 DataFrame 的容器。
Pandas 官网,更多功能请参考http://pandas-docs.github.io/pandas-docs-travis/index.html
In [2]: # 首先导入库
import pandas as pd
import numpy as np
#Series
由一组数据(各种 Numpy 数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的 Series,可以通过传递一个 list 对象来创建一个 Series
In [3]: s = pd.Series([1,3,5,np.nan,6,8])
s
Out[3]: 0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
获取 Series 的索引:
In [4]: s.index
Out[4]: RangeIndex(start=0, stop=6, step=1)
#DataFrame
DataFrame 是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame 中的每一行是一个记录,名称为 Index 的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame 既有行索引也有列索引,可以被看做由 Series 组成的字典(共用同一个索引)。
创建一个 DataFrame,包括一个 numpy array, 时间索引和列名字:
In [5]: dates = pd.date_range('20130101', periods=6)
dates
Out[5]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D'
In [6]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[6]: A B C D
2013-01-01 0.190720 0.114154 -0.644789 -0.672361
2013-01-02 0.370210 -0.938678 -0.362772 -0.769552
2013-01-03 -0.677601 0.838778 -0.222894 0.300474
2013-01-04 0.160924 -0.926747 1.396041 -0.556695
2013-01-05 0.466604 -0.380008 -0.616786 -0.517109
2013-01-06 -0.960533 0.565752 0.234522 0.892747
通过传递一个能够被转换成类似序列结构的字典对象来创建一个 DataFrame:
In [7]: df2 = pd.DataFrame({'A':1.0,
'B':pd.Timestamp('20130102'),
'C':pd.Series(1, index=list(range(4)), dtype='float32'),
'D':np.array([3] * 4, dtype='int32'),
'E':pd.Categorical(['test', 'train', 'test', 'train']),
'F':'foo'
})
df2
Out[7]: A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
查看不同列的数据类型:
In [8]: df2.dtypes
Out[8]: A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
使用 Tab 自动补全功能会自动识别所有的属性以及自定义的列
#查看数据
我们以平台获取的数据为例进行讲解:
In [9]: # 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价。
df = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', ['open', 'high', 'low', 'close'])
df
Out[9]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
查看前几条数据:
In [10]:df.head()
Out[10]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
查看后几条数据:
In [11]:df.tail()
Out[11]: open high low close
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
显示索引、列和底层的 numpy 数据:
In [12]:df.index
Out[12]:DatetimeIndex(['2017-06-01', '2017-06-02', '2017-06-05', '2017-06-06',
'2017-06-07', '2017-06-08', '2017-06-09', '2017-06-12',
'2017-06-13', '2017-06-14'],
dtype='datetime64[ns]', freq=None)
查看 DataFrame 的列名
In [13]:df.columns
Out[13]:Index(['open', 'high', 'low', 'close'], dtype='object')
查看 DataFrame 的值
In [14]:df.values
Out[14]:array([[ 9.0675, 9.0971, 8.9886, 9.0576],
[ 9.0478, 9.1562, 9.0084, 9.0379],
[ 8.9985, 9.0379, 8.8605, 8.8999],
[ 8.8802, 8.9295, 8.8605, 8.9098],
[ 8.8901, 9.0182, 8.8802, 8.9985],
[ 8.9788, 9.0182, 8.9492, 8.9985],
[ 9.0182, 9.0872, 8.9886, 9.0182],
[ 9.0182, 9.0576, 8.9689, 8.9788],
[ 8.9788, 9.0084, 8.9197, 8.9886],
[ 8.9886, 8.9985, 8.9098, 8.9492]])
describe()函数对于数据的快速统计汇总:
In [15]:df.describe()
Out[15]: open high low close
count 10.000000 10.00000 10.000000 10.000000
mean 8.986670 9.04088 8.933440 8.983700
std 0.060725 0.06252 0.055209 0.051369
min 8.880200 8.92950 8.860500 8.899900
25% 8.978800 9.01085 8.887600 8.956600
50% 8.993550 9.02805 8.934450 8.993550
75% 9.018200 9.07980 8.983675 9.013275
max 9.067500 9.15620 9.008400 9.057600
对数据的转置(tranverse):
In [16]:df.T
Out[16]: 2017-06-01 2017-06-02 2017-06-05 2017-06-06 2017-06-07 2017-06-08 2017-06-09 2017-06-12 2017-06-13 2017-06-14
open 9.0675 9.0478 8.9985 8.8802 8.8901 8.9788 9.0182 9.0182 8.9788 8.9886
high 9.0971 9.1562 9.0379 8.9295 9.0182 9.0182 9.0872 9.0576 9.0084 8.9985
low 8.9886 9.0084 8.8605 8.8605 8.8802 8.9492 8.9886 8.9689 8.9197 8.9098
close 9.0576 9.0379 8.8999 8.9098 8.9985 8.9985 9.0182 8.9788 8.9886 8.9492
按轴进行排序
In [17]:df.sort_index(axis=1, ascending=False)
Out[17]: open low high close
2017-06-01 9.0675 8.9886 9.0971 9.0576
2017-06-02 9.0478 9.0084 9.1562 9.0379
2017-06-05 8.9985 8.8605 9.0379 8.8999
2017-06-06 8.8802 8.8605 8.9295 8.9098
2017-06-07 8.8901 8.8802 9.0182 8.9985
2017-06-08 8.9788 8.9492 9.0182 8.9985
2017-06-09 9.0182 8.9886 9.0872 9.0182
2017-06-12 9.0182 8.9689 9.0576 8.9788
2017-06-13 8.9788 8.9197 9.0084 8.9886
2017-06-14 8.9886 8.9098 8.9985 8.9492
按值进行排序
In [18]:df.sort_value(by='open', ascending=False)
Out[18]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-14 8.9886 8.9985 8.9098 8.9492
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-06 8.8802 8.9295 8.8605 8.9098
#选择数据
通过下标选取数据
df[‘open’],df.open 以上两个语句是等效的,都是返回 df 名称为 open 列的数据,返回的为一个 Series。 df[0:3], df[‘2017-06-01′:’2017-06-05’] 下标索引选取的是 DataFrame 的记录,与 List 相同 DataFrame 的下标也是从 0 开始,区间索引的话,为一个左闭右开的区间,即[0:3]选取的为 0-2 三条记录。 与此等价,还可以用起始的索引名称和结束索引名称选取数据,如:df[‘a’:’b’]。有一点需要注意的是使用起始索引名称和结束索引名称时,也会包含结束索引的数据。具体看下方示例: 以上两种方式返回的都是 DataFrame。
选择一列数据:
In [19]:df['open']
Out[19]:2017-06-01 9.0675
2017-06-02 9.0478
2017-06-05 8.9985
2017-06-06 8.8802
2017-06-07 8.8901
2017-06-08 8.9788
2017-06-09 9.0182
2017-06-12 9.0182
2017-06-13 8.9788
2017-06-14 8.9886
Name: open, dtype: float64
选择多列数据:
In [20]:df['open', 'close']
Out[20]: open close
2017-06-01 9.0675 9.0576
2017-06-02 9.0478 9.0379
2017-06-05 8.9985 8.8999
2017-06-06 8.8802 8.9098
2017-06-07 8.8901 8.9985
2017-06-08 8.9788 8.9985
2017-06-09 9.0182 9.0182
2017-06-12 9.0182 8.9788
2017-06-13 8.9788 8.9886
2017-06-14 8.9886 8.9492
选择多行:
In [21]:df[0:3]
Out[21]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
In [22]:df['2017-06-01':'2017-06-05']
Out[22]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
使用标签选取数据:
df.loc[行标签,列标签]
df.loc[‘a’:’b’] #选取 ab 两行数据
df.loc[:,’open’] #选取 open 列的数据
df.loc 的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是 DataFrame,否则,则为 Series。
PS:loc 为 location 的缩写。
In [23]:df.loc['2017-06-01', 'open']
Out[23]:9.0675000000000008
In [24]:df['2017-06-01':'2017-06-06']
Out[24]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
In [25]:df.loc[:, 'open']
Out[25]:2017-06-01 9.0675
2017-06-02 9.0478
2017-06-05 8.9985
2017-06-06 8.8802
2017-06-07 8.8901
2017-06-08 8.9788
2017-06-09 9.0182
2017-06-12 9.0182
2017-06-13 8.9788
2017-06-14 8.9886
Name: open, dtype: float64
In [26]:df.loc['2017-06-01':'2017-06-06','open']
Out[26]:2017-06-01 9.0675
2017-06-02 9.0478
2017-06-05 8.9985
2017-06-06 8.8802
Name: open, dtype: float64
使用位置选取数据:
df.iloc[行位置,列位置]
df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值
df.iloc[[0,2],:] #选取第一行及第三行的数据
df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据
df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个 Series
df.iloc[1,:] #选取第一行数据,返回的为一个 Series
PS:iloc 则为 integer & location 的缩写
In [27]:df.iloc[1,1] # 选取第二行,第二列的值,返回的为单个值
Out[27]:9.1562000000000001
In [28]:df.iloc[[0,2], :] # 选取第一行及第三行的数据
Out[28]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-05 8.9985 9.0379 8.8605 8.8999
In [29]:df.iloc[0:2, :] # 选取第一行到第三行(不包含)的数据
Out[29]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
In [30]:df.iloc[:, 1] # 选取所有记录的第一列的值,返回的为一个Series
Out[30]:2017-06-01 9.0971
2017-06-02 9.1562
2017-06-05 9.0379
2017-06-06 8.9295
2017-06-07 9.0182
2017-06-08 9.0182
2017-06-09 9.0872
2017-06-12 9.0576
2017-06-13 9.0084
2017-06-14 8.9985
Name: high, dtype: float64
In [31]:df.iloc[1, :] # 选取第一行数据,返回的为一个Series
Out[31]:open 9.0478
high 9.1562
low 9.0084
close 9.0379
Name: 2017-06-02 00:00:00, dtype: float64
更广义的切片方式是使用.ix,它自动根据给到的索引类型判断是使用位置还是标签进行切片
In [32]:df.ix[1,1]
Out[32]:9.1562000000000001
In [33]:df.ix['2017-06-01', '2017-06-05']
Out[33]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
In [34]:df.ix['2017-06-05','open']
Out[34]:8.9984999999999999
In [35]:df.ix[1, 'open']
Out[35]:9.0478000000000005
In [36]:df.ix['2017-06-01', 0]
Out[36]:9.0675000000000008
通过逻辑指针进行数据切片:
df[逻辑条件]
df[df.one >= 2] #单个逻辑条件
df[(df.one >=1 ) & (df.one < 3) ] #多个逻辑条件组合
In [37]:# 筛选出 open 大于8.8的数据
df[df.open > 8.8]
Out[37]: open high low close
In [38]:# 筛选出 open 大于9.12的数据,并且 close 小于9.17的数据
df[(df.open > 9.12) & (df.close < 9.17)]
Out[38]: open high low close
使用条件过来更改数据
In [39]:df[df>10]
Out[39]: open high low close
2017-06-01 NaN NaN NaN NaN
2017-06-02 NaN NaN NaN NaN
2017-06-05 NaN NaN NaN NaN
2017-06-06 NaN NaN NaN NaN
2017-06-07 NaN NaN NaN NaN
2017-06-08 NaN NaN NaN NaN
2017-06-09 NaN NaN NaN NaN
2017-06-12 NaN NaN NaN NaN
2017-06-13 NaN NaN NaN NaN
2017-06-14 NaN NaN NaN NaN
观察可以发现,df 中小于 9 的数都变为 NaN。
下面我们就把大于 9 的数赋值为 0.
In [40]:df[df > 9.12] = 0
In [41]:df
Out[41]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 0.0000 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
发现大于 9.12 的数都被替换为了 0.
使用 isin()方法来过滤在指定列中的数据:
In [42]:# 选取 high 列中数为0和9的数
df[df['high'].isin([0.00,9.00])]
Out[42]: open high low close
2017-06-02 9.0478 0.0 9.0084 9.0379
#Panel
平台 get_price,如果是多支股票, 则返回 pandas.Panel 对象。
可通过 panel[列标,行标,股票代码] 获取数据.
In [43]:panel = get_price(['000001.XSHE', '000002.XSHE'], '2017-06-01', '2017-06-14', '1d', fields=['open', 'high', 'low', 'close'])
panel
Out[43]:<class 'rqcommons.pandas_patch.HybridDataPanel'>
Dimensions: 4 (items) x 10 (major_axis) x 2 (minor_axis)
Items axis: open to close
Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00
Minor_axis axis: 000001.XSHE to 000002.XSHE
由打印的结果可以看出:
列标(Items axis: close to open)
行标(Major_axis axis: 2017-06-01 00:00:00 to 2017-06-14 00:00:00)
股票代码(Minor_axis axis: 000001.XSHE to 000002.XSHE)
In [44]:# 取出'open'的数据
panel['open', :, :]
Out[44]: 000001.XSHE 000002.XSHE
2017-06-01 9.0675 20.4058
2017-06-02 9.0478 20.3864
2017-06-05 8.9985 20.0673
2017-06-06 8.8802 20.3574
2017-06-07 8.8901 20.1833
2017-06-08 8.9788 20.1833
2017-06-09 9.0182 20.1833
2017-06-12 9.0182 20.7733
2017-06-13 8.9788 21.1891
2017-06-14 8.9886 21.1794
In [45]:# 取出'2017-06-01'的数据
panel[:, '2017-06-01', :]
Out[45]: open high low close
000001.XSHE 9.0675 9.0971 8.9886 9.0576
000002.XSHE 20.4058 20.6669 20.2413 20.3961
In [46]:# 取出000001的 DataFrame 数据
panel[:, :, '000001.XSHE']
Out[46]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
Panel 操作与 DataFrame 基本相同,下节会为大家讲解 DataFrame 的数据处理与规整。
#缺失数据处理
去掉包含缺失值的行
In [47]:# 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价
df = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', fields=['open', 'high', 'low', 'close'])
df = df[df > 9.0]
df
Out[47]: open high low close
2017-06-01 9.0675 9.0971 NaN 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 NaN 9.0379 NaN NaN
2017-06-06 NaN NaN NaN NaN
2017-06-07 NaN 9.0182 NaN NaN
2017-06-08 NaN 9.0182 NaN NaN
2017-06-09 9.0182 9.0872 NaN 9.0182
2017-06-12 9.0182 9.0576 NaN NaN
2017-06-13 NaN 9.0084 NaN NaN
2017-06-14 NaN NaN NaN NaN
In [48]:df.dropna()
Out[48]: open high low close
2017-06-02 9.0478 9.1562 9.0084 9.0379
对缺失值进行填充
In [49]:df.fillna(value=0)
Out[49]: open high low close
2017-06-01 9.0675 9.0971 0.0000 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 0.0000 9.0379 0.0000 0.0000
2017-06-06 0.0000 0.0000 0.0000 0.0000
2017-06-07 0.0000 9.0182 0.0000 0.0000
2017-06-08 0.0000 9.0182 0.0000 0.0000
2017-06-09 9.0182 9.0872 0.0000 9.0182
2017-06-12 9.0182 9.0576 0.0000 0.0000
2017-06-13 0.0000 9.0084 0.0000 0.0000
2017-06-14 0.0000 0.0000 0.0000 0.0000
判断数据是否为 nan,并进行布尔填充
In [50]:df.isnull(df)
Out[50]: open high low close
2017-06-01 False False True False
2017-06-02 False False False False
2017-06-05 True False True True
2017-06-06 True True True True
2017-06-07 True False True True
2017-06-08 True False True True
2017-06-09 False False True False
2017-06-12 False False True True
2017-06-13 True False True True
2017-06-14 True True True True
#函数的应用和映射
In [51]:df.mean() # 列计算平均值
Out[51]:open 9.037925
high 9.060100
low 9.008400
close 9.037900
dtype: float64
In [52]:df.mean(1) # 行计算平均值
Out[52]:2017-06-01 9.074067
2017-06-02 9.062575
2017-06-05 9.037900
2017-06-06 NaN
2017-06-07 9.018200
2017-06-08 9.018200
2017-06-09 9.041200
2017-06-12 9.037900
2017-06-13 9.008400
2017-06-14 NaN
dtype: float64
In [53]:df.mean(axis=1, skipna = False) # skipna参数默认是True表示排除缺失值
Out[53]:2017-06-01 NaN
2017-06-02 9.062575
2017-06-05 NaN
2017-06-06 NaN
2017-06-07 NaN
2017-06-08 NaN
2017-06-09 NaN
2017-06-12 NaN
2017-06-13 NaN
2017-06-14 NaN
dtype: float64
常用的方法如上所介绍们,还要其他许多,可自行学习,下面罗列了一些,可供参考:
- count 非na值的数量
- describe 针对Series或个DataFrame列计算汇总统计
- min、max 计算最小值和最大值
- argmin、argmax 计算能够获取到最大值和最小值得索引位置(整数)
- idxmin、idxmax 计算能够获取到最大值和最小值得索引值
- quantile 计算样本的分位数(0到1)
- sum 值的总和
- mean 值得平均数
- median 值得算术中位数(50%分位数)
- mad 根据平均值计算平均绝对离差
- var 样本值的方差
- std 样本值的标准差
- skew 样本值得偏度(三阶矩)
- kurt 样本值得峰度(四阶矩)
- cumsum 样本值得累计和
- cummin,cummax 样本值得累计最大值和累计最小值
- cumprod 样本值得累计积
- diff 计算一阶差分(对时间序列很有用)
- pct_change 计算百分数变化
#数据规整
Pandas 提供了大量的方法能够轻松的对 Series,DataFrame 和 Panel 对象进行各种符合各种逻辑关系的合并操作
- concat 可以沿一条轴将多个对象堆叠到一起。
- append 将一行连接到一个DataFrame上
- duplicated 移除重复数据
concat
In [54]:df1 = get_price('000001.XSHE', '2017-06-01', '2017-06-14', '1d', ['open', 'high', 'low', 'close'])
df1
Out[54]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
In [55]:df2 = get_price('000001.XSHE', '2017-05-15', '2017-05-31', '1d', ['open', 'high', 'low', 'close'])
df2
Out[55]: open high low close
2017-05-15 8.7620 8.8211 8.6831 8.7324
2017-05-16 8.7127 8.7225 8.6043 8.7127
2017-05-17 8.6831 8.6831 8.6240 8.6437
2017-05-18 8.5944 8.6437 8.5846 8.6043
2017-05-19 8.6141 8.6338 8.5550 8.5648
2017-05-22 8.5550 8.6043 8.4860 8.5550
2017-05-23 8.5451 8.7127 8.5156 8.6634
2017-05-24 8.6535 8.7028 8.5550 8.6831
2017-05-25 8.6634 9.0084 8.6535 8.9689
2017-05-26 8.9492 8.9985 8.9098 8.9689
2017-05-31 8.9689 9.0971 8.9295 9.0675
In [56]:pd.concat([df1,df2], axis=0)
Out[56]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
2017-05-15 8.7620 8.8211 8.6831 8.7324
2017-05-16 8.7127 8.7225 8.6043 8.7127
2017-05-17 8.6831 8.6831 8.6240 8.6437
2017-05-18 8.5944 8.6437 8.5846 8.6043
2017-05-19 8.6141 8.6338 8.5550 8.5648
2017-05-22 8.5550 8.6043 8.4860 8.5550
2017-05-23 8.5451 8.7127 8.5156 8.6634
2017-05-24 8.6535 8.7028 8.5550 8.6831
2017-05-25 8.6634 9.0084 8.6535 8.9689
2017-05-26 8.9492 8.9985 8.9098 8.9689
2017-05-31 8.9689 9.0971 8.9295 9.0675
横向拼接,index 对不上的会用 NaN 填充:
In [57]:pd.concat([df1,df2], axis=1)
Out[57]: open high low close open high low close
2017-05-15 NaN NaN NaN NaN 8.7620 8.8211 8.6831 8.7324
2017-05-16 NaN NaN NaN NaN 8.7127 8.7225 8.6043 8.7127
2017-05-17 NaN NaN NaN NaN 8.6831 8.6831 8.6240 8.6437
2017-05-18 NaN NaN NaN NaN 8.5944 8.6437 8.5846 8.6043
2017-05-19 NaN NaN NaN NaN 8.6141 8.6338 8.5550 8.5648
2017-05-22 NaN NaN NaN NaN 8.5550 8.6043 8.4860 8.5550
2017-05-23 NaN NaN NaN NaN 8.5451 8.7127 8.5156 8.6634
2017-05-24 NaN NaN NaN NaN 8.6535 8.7028 8.5550 8.6831
2017-05-25 NaN NaN NaN NaN 8.6634 9.0084 8.6535 8.9689
2017-05-26 NaN NaN NaN NaN 8.9492 8.9985 8.9098 8.9689
2017-05-31 NaN NaN NaN NaN 8.9689 9.0971 8.9295 9.0675
2017-06-01 9.0675 9.0971 8.9886 9.0576 NaN NaN NaN NaN
2017-06-02 9.0478 9.1562 9.0084 9.0379 NaN NaN NaN NaN
2017-06-05 8.9985 9.0379 8.8605 8.8999 NaN NaN NaN NaN
2017-06-06 8.8802 8.9295 8.8605 8.9098 NaN NaN NaN NaN
2017-06-07 8.8901 9.0182 8.8802 8.9985 NaN NaN NaN NaN
2017-06-08 8.9788 9.0182 8.9492 8.9985 NaN NaN NaN NaN
2017-06-09 9.0182 9.0872 8.9886 9.0182 NaN NaN NaN NaN
2017-06-12 9.0182 9.0576 8.9689 8.9788 NaN NaN NaN NaN
2017-06-13 8.9788 9.0084 8.9197 8.9886 NaN NaN NaN NaN
2017-06-14 8.9886 8.9985 8.9098 8.9492 NaN NaN NaN NaN
append
In [58]:df1
Out[58]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
In [59]:s = df1.iloc[0]
s
Out[59]:open 9.0675
high 9.0971
low 8.9886
close 9.0576
Name: 2017-06-01 00:00:00, dtype: float64
In [60]:df1.append(s, ignore_index=False) #ignore_index=False表示索引不变
Out[60]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
2017-06-01 9.0675 9.0971 8.9886 9.0576
In [61]:df1.append(s, ignore_index=True) #ignore_index=True表示索引重置
Out[61]: open high low close
0 9.0675 9.0971 8.9886 9.0576
1 9.0478 9.1562 9.0084 9.0379
2 8.9985 9.0379 8.8605 8.8999
3 8.8802 8.9295 8.8605 8.9098
4 8.8901 9.0182 8.8802 8.9985
5 8.9788 9.0182 8.9492 8.9985
6 9.0182 9.0872 8.9886 9.0182
7 9.0182 9.0576 8.9689 8.9788
8 8.9788 9.0084 8.9197 8.9886
9 8.9886 8.9985 8.9098 8.9492
10 9.0675 9.0971 8.9886 9.0576
移除重复数据 duplicated
In [62]:z = df1.append(s, ignore_index=False)
z
Out[62]: open high low close
2017-06-01 9.0675 9.0971 8.9886 9.0576
2017-06-02 9.0478 9.1562 9.0084 9.0379
2017-06-05 8.9985 9.0379 8.8605 8.8999
2017-06-06 8.8802 8.9295 8.8605 8.9098
2017-06-07 8.8901 9.0182 8.8802 8.9985
2017-06-08 8.9788 9.0182 8.9492 8.9985
2017-06-09 9.0182 9.0872 8.9886 9.0182
2017-06-12 9.0182 9.0576 8.9689 8.9788
2017-06-13 8.9788 9.0084 8.9197 8.9886
2017-06-14 8.9886 8.9985 8.9098 8.9492
2017-06-01 9.0675 9.0971 8.9886 9.0576
查看重复数据:
In [63]:z.duplicated()
Out[63]:2017-06-01 False
2017-06-02 False
2017-06-05 False
2017-06-06 False
2017-06-07 False
2017-06-08 False
2017-06-09 False
2017-06-12 False
2017-06-13 False
2017-06-14 False
2017-06-01 True
dtype: bool