《利用python进行数据分析》第二版第11章-时间序列学习笔记

2021-06-26 19:34:11 阅读：183 来源： 互联网

标签：11 数据分析 00 01 03 python 30 2000 2012

文章目录

一、日期和时间数据的类型及工具
- 字符串与datatime间的转换
二、时间序列基础
- 索引、选择、子集
- 含有重复索引的时间序列
三、日期范围、频率和移位
四、时区处理
五、时间区间和区间算术
六、重新采样与频率转换
七、移动窗口函数

一、日期和时间数据的类型及工具

Python中关于日期和时间、日历数据的模块为datatime、time、calendar

**datatime模块中的数据类型如下：即from datatime import *****

data	用公历日历存储日历日期（年、月、日）
time	将时间存储为小时、分钟、秒、微秒
datatime	存储日期和时间
timedelta	表示两个datatime值之间的的差（如日、秒、微秒）
tzinfo	用于存储时区信息的基本类型

datatime格式化详细说明（兼容ISO C89）

%Y	四位的年份
%y	两位的年份
%m	两位的月份[01, 12]
%d	两位的日期号[01, 31]
%H	小时，24小时制[00, 23]
%I	小时，12小时制[01, 12]
%M	两位的分钟[00, 59]
%S	秒[00, 61]（60,61是闰秒）
%w	星期日期[0（星期天）, 6]
%U	一年中的星期数[00, 53]。以星期天为每周的第一天，一年中第一个星期天前的日期作为‘第0周’
%W	一年中的星期数[00, 53]。以星期一为每周的第一天，一年中第一个星期一前的日期作为‘第0周’
%z	格式为+HHMM或-HHMM的UTC时区偏移；如果没有时区则为空
%F	%Y-%m-%d的简写（如，2021-6-24）
%D	%m/%d/%y的简写（如，06/24/21）

datatime对象特定地区日期格式化选项

%a	缩写的工作日名称
%A	全写的工作日名称
%b	简写的月份名称
%B	全写的月份名称
%c	完整的日期和时间 (如，‘Tue 01 May 2012 04:20:57 PM’)
%p	AM或PM的地区等效
%x	适合地区的格式化日期 (如在美国 '05/01/2012’即为May 1)
%X	适合地区的时间 (如，’ 04:24:12 PM ')

import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

from datetime import datetime
now = datetime.now()
now

'''datetime.datetime(2021, 6, 24, 16, 47, 28, 879956)'''

now.year, now.month, now.day

'''(2021, 6, 24)'''

# timedelta表示两个datetime对象的时间差
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

'''datetime.timedelta(days=926, seconds=56700)'''

delta.days

'''926'''

delta.seconds

'''56700'''

# 为datetime对象加/减一个timedelta产生新的datatime对象
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

'''datetime.datetime(2011, 1, 19, 0, 0)'''

start - 2 * timedelta(12)

'''datetime.datetime(2010, 12, 14, 0, 0)'''

字符串与datatime间的转换

# 用str方法或传递一个指定格式给strftime方法对datetime对象和pandas的Timestamp对象进行格式化
stamp = datetime(2011, 1, 3)
str(stamp)

''''2011-01-03 00:00:00''''

stamp.strftime('%Y-%m-%d')

''''2011-01-03''''

# 用datetime.strptime()和格式码将字符串转换为日期
# datetime.strptime()是在已知格式的情况下转换日期的好方法
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

'''datetime.datetime(2011, 1, 3, 0, 0)'''

datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

'''[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]'''

# 对于通用的日期格式可以使用dateutil包的parser.parse方法
# dateutil可以解析大部分人类可理解的日期表示
from dateutil.parser import parse
parse('2011-01-03')

'''datetime.datetime(2011, 1, 3, 0, 0)'''

parse('Jan 31, 1997 10:45 PM')

'''datetime.datetime(1997, 1, 31, 22, 45)'''

# 传递dayfirst=True，来解析日期在月份之前的数据
parse('6/12/2011', dayfirst=True)

'''datetime.datetime(2011, 12, 6, 0, 0)'''

# pd.to_datetime()可以转换很多不同的日期格式
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)

'''DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)'''

# pd.to_datetime()还可处理那些被认为是缺失值的值（None，空字符串等）
idx = pd.to_datetime(datestrs + [None])
idx

'''DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)'''

idx[2]  # NaT(Not a time)是pandas中时间戳数据值是null的值

'''NaT'''

pd.isnull(idx)

'''array([False, False,  True])'''

二、时间序列基础

pandas中的基础时间序列种类是由时间戳索引的Series；

pandas外部则通常表示为python字符串或datatime对象

from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

# pandas中基础的时间序列：由时间戳索引的Series
ts = pd.Series(np.random.randn(6), index=dates)
ts

'''
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# 这些datetime对象实际上是被放在一个DatetimeIndex中
ts.index

'''
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)
'''

# 跟其他Series一样，不同索引的时间序列之间的算术运算会自动按日期对齐
ts + ts[::2]

'''
2011-01-02   -0.409415
2011-01-05         NaN
2011-01-07   -1.038877
2011-01-08         NaN
2011-01-10    3.931561
2011-01-12         NaN
dtype: float64
'''

# pandas用NumPy的datetime64数据类型以纳秒ns形式存储时间戳
ts.index.dtype

'''dtype('<M8[ns]')'''

# DatetimeIndex中的各个标量值是pandas的Timestamp对象
stamp = ts.index[0]
stamp

'''Timestamp('2011-01-02 00:00:00')'''

索引、选择、子集

# 当基于标签进行索引和选择时，时间序列和其他pandas.Series类似
stamp = ts.index[2]
ts[stamp]

'''-0.5194387150567381'''

# 传递能解释为日期的字符串,'20110110'也可以
ts['1/10/2011']

'''1.9657805725027142'''

# 对于长的时间序列，可以传递一个可解释为年份或年份月份的字符串选择数据
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

'''
2000-01-01    0.092908
2000-01-02    0.281746
2000-01-03    0.769023
2000-01-04    1.246435
2000-01-05    1.007189
                ...   
2002-09-22    0.930944
2002-09-23   -0.811676
2002-09-24   -1.830156
2002-09-25   -0.138730
2002-09-26    0.334088
Freq: D, Length: 1000, dtype: float64
'''

longer_ts['2001']

'''
2001-01-01    1.599534
2001-01-02    0.474071
2001-01-03    0.151326
2001-01-04   -0.542173
2001-01-05   -0.475496
                ...   
2001-12-27    0.057874
2001-12-28   -0.433739
2001-12-29    0.092698
2001-12-30   -1.397820
2001-12-31    1.457823
Freq: D, Length: 365, dtype: float64
'''

longer_ts['2001-05']

'''
2001-05-01   -0.622547
2001-05-02    0.936289
2001-05-03    0.750018
2001-05-04   -0.056715
2001-05-05    2.300675
                ...   
2001-05-27    0.235477
2001-05-28    0.111835
2001-05-29   -1.251504
2001-05-30   -2.949343
2001-05-31    0.634634
Freq: D, Length: 31, dtype: float64
'''

ts

'''
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# 传递datatime对象进行切片
ts[datetime(2011, 1, 7):]

'''
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# 传递不包含在时间序列中的时间戳进行切片，以执行范围查询
ts['1/6/2011':'1/11/2011']

'''
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
dtype: float64
'''

传递可别解释为日期的字符串、datatime对象或时间戳进行切片都是产生了原时间序列的视图；即没有数据被复制，且在切片上的修改会反映到原始数据上

# 等价的实例方法truncate，可以在两个日期间对Series进行切片
ts.truncate(before='1/5/2011', after='1/10/2011')

'''
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
dtype: float64
'''

# 上面的这些操作也适用于DataFrame
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])
long_df.loc['5-2001']

含有重复索引的时间序列

# 在某个时间戳上有多个数据观察值，即时间序列含有重复索引
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

'''
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
'''

# 通过index.is_unique属性，查看索引是否唯一
dup_ts.index.is_unique

'''False'''

# 对含有重复索引的时间序列进行索引，结果是标量还是Series切片取决于时间戳是否重复
dup_ts['1/3/2000']  # not duplicated

'''4'''

dup_ts['1/2/2000']  # duplicated

'''
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
'''

# 聚合含有重复索引的数据，传递level=0给groupby即可
grouped = dup_ts.groupby(level=0)
grouped.mean()

'''
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
'''

grouped.count()

'''
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64
'''

三、日期范围、频率和移位

生成日期范围

# pd.date_range()：根据特定频率生成指定长度的DatatimeIndex
# 默认生成每日的时间戳
index = pd.date_range('2012-04-01', '2012-06-01')
index

'''
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
'''

# 若只传递一个起始或结尾日期，则必须传递一个用于生成范围的数字给periods
pd.date_range(start='2012-04-01', periods=20)

'''
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')
'''

pd.date_range(end='2012-06-01', periods=20)

'''
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
'''

# 传递时间序列频率给freq，会生成落在指定日期范围的DatatimeIndex
pd.date_range('2000-01-01', '2000-12-01', freq='BM')

'''
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')
'''

# pd.date_range()默认保留开始或结束时间戳的时间
# 传递normalize=True会标准化为零点的时间戳，即去掉了时间戳中的时间
pd.date_range('2012-05-02 12:56:31', periods=5)

'''
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')
'''

pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)

'''
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')
'''

基础时间序列频率
别名	偏置类型	描述
D	Day	日历日的每天
M	BusinessDay	工作日的每天
H	Hour	每小时
T或min	Minute	每分钟
S	Second	每秒
L或ms	Milli	每毫秒（1/1000秒）
U	Micro	每微秒（1/1000000秒）
M	MonthEnd	日历的月内最后一天
BM	BusinessMonthEnd	工作日的月内最后一天
MS	MonthBegin	日历的月内第一天
BMS	BusinessMonthBegin	工作日的月内第一天
W-MON，W-TUE，……	Week	按给定星期日期按每周取日期（MON,TUE,WED,THU,FRI,SAT或SUN）
WOM-1MON，WOM-2MON，……	WeekOfMonth	在本月的一/二/三或四周创建按周分隔的日期（如WOM-3FRI：每月第3个星期五）
Q-JAN，Q-FEB，……	QuarterEnd	每月最后一个日历日的季度日期，以表示月份结束的年份（JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV或DEC）
BQ-JAN，BQ-FEB，……	BusinessQuarterEnd	每月最后一个工作日对应的季度日期，以表示月份结束的年份
QS-JAN，QS-FEB，……	QuarterBegin	每月第一个日历日对应的季度日期，以表示月份结束的年份
BQS-JAN，BQS-FEB，……	BusinessQuarterBegin	每月第一个工作日对应的季度日期，以表示月份结束的年份
A-JAN，A-FEB，……	YearEnd	给定月份所在月的最后一个日历日的年度日期（JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV或DEC）
BA-JAN，BA-FEB，……	BusinessYearEnd	给定月份所在月的最后一个工作日的年度日期
AS-JAN，AS-FEB，……	YearBegin	给定月份所在月的第一个日历日的年度日期
BAS-JAN，BAS-FEB，……	BusinessYearBegin	给定月份所在月的第一个工作日的年度日期

频率和日期偏置

# 对于每个基础频率，都有一个对象可别用于定义日期偏置
# 如，每小时的频率可用Hour类来表示
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour

'''<Hour>'''

# 传递整数定义偏置量的倍数
four_hours = Hour(4)
four_hours

'''<4 * Hours>'''

pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')

'''
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')
'''

# 多个偏置可通过加法联合
Hour(2) + Minute(30)

'''<150 * Minutes>'''

pd.date_range('2000-01-01', periods=10, freq='1h30min')

'''
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')
'''

# 传递freq='WOM-3FRI'，获取指定日期范围的月中某星期的第三个星期五的日期
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
list(rng)

'''
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]
'''

向前或向后移位日期

# 移位指将时间序列中数据按时间向前或向后移动
# 通过Series或者DataFrame的shift()方法实现
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

'''
2000-01-31   -0.066748
2000-02-29    0.838639
2000-03-31   -0.117388
2000-04-30   -0.517795
Freq: M, dtype: float64
'''

# 向前移位数据，会在起始位引入缺失值
ts.shift(2)

'''
2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -0.066748
2000-04-30    0.838639
Freq: M, dtype: float64
'''

# 向后移位数据，会在结束为引入缺失值
ts.shift(-2)

'''
2000-01-31   -0.117388
2000-02-29   -0.517795
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64
'''

shift()常用于计算时间序列或DataFrame多列时间序列的百分比变化，即ts/ts.shift(1)-1

# 传递频率freq='M'给shift将会移位时间戳，而不是数据
# 这里表示：时间戳向前移位两个月，即+2个月
# 注意数据的向前移位与时间戳的向前移位的差异
ts.shift(2, freq='M')

'''
2000-03-31   -0.066748
2000-04-30    0.838639
2000-05-31   -0.117388
2000-06-30   -0.517795
Freq: M, dtype: float64
'''

# 传递freq='D'，表示每个时间戳按天向前移位3天，即+3天
ts.shift(3, freq='D')

'''
2000-02-03   -0.066748
2000-03-03    0.838639
2000-04-03   -0.117388
2000-05-03   -0.517795
dtype: float64
'''

# # 传递freq='90T'，表示每个时间戳向前移位90分钟
ts.shift(1, freq='90T')

'''
2000-01-31 01:30:00   -0.066748
2000-02-29 01:30:00    0.838639
2000-03-31 01:30:00   -0.117388
2000-04-30 01:30:00   -0.517795
dtype: float64
'''

使用偏置进行移位日期

# pandas日期偏置也可使用datatime或Timestamp对象完成
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()

'''Timestamp('2011-11-20 00:00:00')'''

# 锚定偏置量，如MonthEnd，BusinessMonthEnd会将日期前滚到下一个日期
now + MonthEnd()

'''Timestamp('2011-11-30 00:00:00')'''

now + MonthEnd(2)

'''Timestamp('2011-12-31 00:00:00')'''

# 锚定偏置可用rollforward，rollback显示的将日期向前或向后滚动
offset = MonthEnd()
offset.rollforward(now)

'''Timestamp('2011-11-30 00:00:00')'''

offset.rollback(now)

'''Timestamp('2011-12-31 00:00:00')'''

# 将移位方法与groupby一起使用
ts = pd.Series(np.random.randn(20),
               index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts

'''
2000-01-15   -0.116696
2000-01-19    2.389645
2000-01-23   -0.932454
2000-01-27   -0.229331
2000-01-31   -1.140330
2000-02-04    0.439920
2000-02-08   -0.823758
2000-02-12   -0.520930
2000-02-16    0.350282
2000-02-20    0.204395
2000-02-24    0.133445
2000-02-28    0.327905
2000-03-03    0.072153
2000-03-07    0.131678
2000-03-11   -1.297459
2000-03-15    0.997747
2000-03-19    0.870955
2000-03-23   -0.991253
2000-03-27    0.151699
2000-03-31    1.266151
Freq: 4D, dtype: float64
'''

# offset.rollforward会默认对ts的时间戳索引进行操作
ts.groupby(offset.rollforward).mean()

'''
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
dtype: float64
'''

# resample()方法可达到同的效果
ts.resample('M').mean()

'''
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
Freq: M, dtype: float64
'''

四、时区处理

# 时区通常被表示为UTC的偏置，时区信息来源于第三方库pytz
import pytz
pytz.common_timezones[-5:]

'''['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']'''

# pytz.timezone()获得pytz的时区对象
tz = pytz.timezone('America/New_York')
tz

'''<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>'''

时区的本地化和转换

rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64
'''

# 用于返回时区
print(ts.index.tz)

'''None'''

# 日期范围可通过时区集合生成
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

'''
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
'''

ts

'''
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64
'''

# 用tz_localize()可将简单时区转化为本地化时区
ts_utc = ts.tz_localize('UTC')
ts_utc

'''
2012-03-09 09:30:00+00:00   -0.202469
2012-03-10 09:30:00+00:00    0.050718
2012-03-11 09:30:00+00:00    0.639869
2012-03-12 09:30:00+00:00    0.597594
2012-03-13 09:30:00+00:00   -0.797246
2012-03-14 09:30:00+00:00    0.472879
Freq: D, dtype: float64
'''

ts_utc.index

'''
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
'''

# 本地化时区可通过tz_convert()转换为另一个时区
ts_utc.tz_convert('America/New_York')

'''
2012-03-09 04:30:00-05:00   -0.202469
2012-03-10 04:30:00-05:00    0.050718
2012-03-11 05:30:00-04:00    0.639869
2012-03-12 05:30:00-04:00    0.597594
2012-03-13 05:30:00-04:00   -0.797246
2012-03-14 05:30:00-04:00    0.472879
Freq: D, dtype: float64
'''

# 用tz_localize()本地化时区为美国纽约时区
ts_eastern = ts.tz_localize('America/New_York')
ts_eastern.tz_convert('UTC')

'''
2012-03-09 14:30:00+00:00   -0.202469
2012-03-10 14:30:00+00:00    0.050718
2012-03-11 13:30:00+00:00    0.639869
2012-03-12 13:30:00+00:00    0.597594
2012-03-13 13:30:00+00:00   -0.797246
2012-03-14 13:30:00+00:00    0.472879
dtype: float64
'''

ts_eastern.tz_convert('Europe/Berlin')

'''
2012-03-09 15:30:00+01:00   -0.202469
2012-03-10 15:30:00+01:00    0.050718
2012-03-11 14:30:00+01:00    0.639869
2012-03-12 14:30:00+01:00    0.597594
2012-03-13 14:30:00+01:00   -0.797246
2012-03-14 14:30:00+01:00    0.472879
dtype: float64
'''

# tz_localize()、tz_convert()也是DatetimeIndex的实例方法
ts.index.tz_localize('Asia/Shanghai')

'''
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq=None)
'''

时区感知时间戳对象的操作

# 单独的Timestamp对象也可本地化为时区感知时间戳，并进行时区转化
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')

'''Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')'''

# 创建Timestamp对象时传递时区
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow

'''Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')'''

# 时区感知的Timestamp对象内存储了一个Unix纪元(1970-1-1)至今的纳秒数量的UTC时间戳数值
stamp_utc.value

'''1299902400000000000'''

# 纳秒数量UTC时间戳数值在时区转化中不变
stamp_utc.tz_convert('America/New_York').value

'''1299902400000000000'''

from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp

'''Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')'''

# 加的Hour()是UTC的1小时
stamp + Hour()

'''Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')'''

stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp

'''Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')'''

stamp + 2 * Hour()

'''Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')'''

不同时区间的操作

# 不同时区间的操作会先自动转为UTC时间，结果也是UTC时间
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2012-03-07 09:30:00    0.522356
2012-03-08 09:30:00   -0.546348
2012-03-09 09:30:00   -0.733537
2012-03-12 09:30:00    1.302736
2012-03-13 09:30:00    0.022199
2012-03-14 09:30:00    0.364287
2012-03-15 09:30:00   -0.922839
2012-03-16 09:30:00    0.312656
2012-03-19 09:30:00   -1.128497
2012-03-20 09:30:00   -0.333488
Freq: B, dtype: float64
'''

ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index

'''
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
'''

五、时间区间和区间算术

Period类表示的是时间区间，即时间范围。如，一些天、一些月、一些季度、一些年

# 以下Period对象表示从2007年1月1日到2007年12月31日
p = pd.Period(2007, freq='A-DEC')
p

'''Period('2007', 'A-DEC')'''

# 做加减法时直接按频率freq='A-DEC'移位
p + 5

'''Period('2012', 'A-DEC')'''

p - 2

'''Period('2005', 'A-DEC')'''

# 具有相同频率的两个区间的差是频率的倍数
pd.Period('2014', freq='A-DEC') - p

'''<7 * YearEnds: month=12>'''

# 用pd.period_range()可构造规则区间序列
rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
rng

'''PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')'''

# PeriodIndex类存储的是区间序列，可作为pandas数据结构的轴索引
pd.Series(np.random.randn(6), index=rng)

'''
2000-01   -0.514551
2000-02   -0.559782
2000-03   -0.783408
2000-04   -1.797685
2000-05   -0.172670
2000-06    0.680215
Freq: M, dtype: float64
'''

# pd.PeriodIndex()可生成PeriodIndex类
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index

'''PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')'''

区间频率转换

# 创建一个Period区间，以12月作为年度结束月份
p = pd.Period('2007', freq='A-DEC')
p

'''Period('2007', 'A-DEC')'''

# asfreq()可以将区间和PeriodIndex对象转换为其他的频率
p.asfreq('M', how='start')

'''Period('2007-01', 'M')'''

p.asfreq('M', how='end')

'''Period('2007-12', 'M')'''

# 创建一个Period区间，以6月作为年度结束月份
# 这种情况下，2006年7月至2007年6月作为2007年度
p = pd.Period('2007', freq='A-JUN')
p

'''Period('2007', 'A-JUN')'''

# 指定how为'start'，会返回2007年度的第一个月
# 由于是以6月作为年度的结束月份，故第一个月份为2006-07
p.asfreq('M', 'start')

'''Period('2006-07', 'M')'''

p.asfreq('M', 'end')

'''Period('2007-06', 'M')'''

# 以上是从低频率转换为高频率，也可从高频率转换为低频率，即类似月转为年
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')
# 传入'A-JUN'，表名年度结束月份为6月，故2007-8属于2008年度

'''Period('2008', 'A-JUN')'''

p = pd.Period('Aug-2007', 'M')
p.asfreq('A-SEP')
# 传入'A-SEP'，表名年度结束月份为6月，故2007-8属于2007年度

'''Period('2007', 'A-SEP')'''

# 完整的PeriodIndex对象或时间序列均可做以上类似的转换
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2006    1.607578
2007    0.200381
2008   -0.834068
2009   -0.302988
Freq: A-DEC, dtype: float64
'''

ts.asfreq('M', how='start')

'''
2006-01    1.607578
2007-01    0.200381
2008-01   -0.834068
2009-01   -0.302988
Freq: M, dtype: float64
'''

ts.asfreq('B', how='end')

'''
2006-12-29    1.607578
2007-12-31    0.200381
2008-12-31   -0.834068
2009-12-31   -0.302988
Freq: B, dtype: float64
'''

季度区间频率

# 季度数据一般是财年结尾，故类似2012Q4一般有不同的含义
# 设置freq='Q-JAN'，表示季度结尾的月份是1月
# 故2012Q4表示的是2011年11月到2012年1月这4个月
p = pd.Period('2012Q4', freq='Q-JAN')
p

'''Period('2012Q4', 'Q-JAN')'''

p.asfreq('D', 'start')

'''Period('2011-11-01', 'D')'''

p.asfreq('D', 'end')

'''Period('2012-01-31', 'D')'''

# 获取季度倒数第2个工作日下午4点的时间戳
# p.asfreq('B', 'e') - 1获取季度倒数第2个工作
# (p.asfreq('B', 'e') - 1).asfreq('T', 's')将获取的日期转换为分钟，且是当天零点
# 再'+ 16 * 60'是加下午4点的分钟数
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

'''Period('2012-01-30 16:00', 'T')'''

# 用to_timestamp()转换为时间戳
p4pm.to_timestamp()

'''Timestamp('2012-01-30 16:00:00')'''

# pd.period_range()用于生成季度序列
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)), index=rng)
ts

'''
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
'''

# 获取季度倒数第2个工作日下午4点的时间戳
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

'''
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32
'''

将时间戳转换为区间（及逆转换）

rng = pd.date_range('2000-01-01', periods=3, freq='M')
ts = pd.Series(np.random.randn(3), index=rng)
ts

'''
2000-01-31    1.663261
2000-02-29   -0.996206
2000-03-31    1.521760
Freq: M, dtype: float64
'''

# 通过时间戳索引的Series和DataFrame可用to_period()方法转换为区间
pts = ts.to_period()
pts

'''
2000-01    1.663261
2000-02   -0.996206
2000-03    1.521760
Freq: M, dtype: float64
'''

rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = pd.Series(np.random.randn(6), index=rng)
ts2

'''
2000-01-29    0.244175
2000-01-30    0.423331
2000-01-31   -0.654040
2000-02-01    2.089154
2000-02-02   -0.060220
2000-02-03   -0.167933
Freq: D, dtype: float64
'''

# 区间转换后索引包含重复区间是被允许的
ts2.to_period('M')

'''
2000-01    0.244175
2000-01    0.423331
2000-01   -0.654040
2000-02    2.089154
2000-02   -0.060220
2000-02   -0.167933
Freq: M, dtype: float64
'''

# 创建以JAN-1月作为每年最后一个季度末的月份
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts3 = pd.Series(np.arange(len(rng)), index=rng)
ts3

'''
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
'''

# to_timestamp()可将区间再转换为时间戳
# how默认为start，取季度第一个月零点作为索引；也可设置how='end'
ts3.to_timestamp()

'''
2010-08-01    0
2010-11-01    1
2011-02-01    2
2011-05-01    3
2011-08-01    4
2011-11-01    5
Freq: QS-NOV, dtype: int32
'''

从数组生成PeriodIndex

data = pd.read_csv('examples/macrodata.csv')
data.head(5)

在这里插入图片描述

data.year.head()

'''
0    1959.0
1    1959.0
2    1959.0
3    1959.0
4    1960.0
Name: year, dtype: float64
'''

data.quarter.head()

'''
0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
Name: quarter, dtype: float64
'''

index = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                       freq='Q-DEC')
index

'''
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')
'''

data.index = index
data.head()

在这里插入图片描述

六、重新采样与频率转换

重新采样 ：将时间序列从一个频率转换为另一个频率的过程

向下采样：将更高频率聚合到低频率

向上采样：将低频率转换到高频率

但并非所有的重新采样都是上面两类，如将W-WED转换到W-REI

pandas对象的resample()方法会进行频率转换，其有类似groupby的功能

rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2000-01-01    0.631634
2000-01-02   -1.594313
2000-01-03   -1.519937
2000-01-04    1.108752
2000-01-05    1.255853
                ...   
2000-04-05   -0.423776
2000-04-06    0.789740
2000-04-07    0.937568
2000-04-08   -2.253294
2000-04-09   -1.772919
Freq: D, Length: 100, dtype: float64
'''

ts.resample('M').mean()

'''
2000-01-31   -0.165893
2000-02-29    0.078606
2000-03-31    0.223811
2000-04-30   -0.063643
Freq: M, dtype: float64
'''

# kind='period'指定结果中索引的类型为period
ts.resample('M', kind='period').mean()

'''
2000-01   -0.165893
2000-02    0.078606
2000-03    0.223811
2000-04   -0.063643
Freq: M, dtype: float64
'''

resample()方法的参数
freq	采样频率，为字符串或DataOffset对象（如，‘5min’，‘M’，或Second(1)）
axis	沿着哪个轴采样，默认axis=0
fill_method	向上采样时的差值方式，默认不插值，可选’ffill’、‘bfill’
closed	向下采样中，每段间隔的哪一段是封闭的，可选’right’、‘left’
label	向下采样中，用’right’/'left’的箱标签标记聚合结果（如，9:30到9:35的5分钟间隔可被标记为9:30/9:35）
loffset	对箱标签进行时间调校（如，’-1s’/Second(-1)可将聚合标签向前移动1秒）
limit	在前向或后向填充时，填充区间最大值
kind	指定结果中索引为区间（‘period’）或时间戳（‘timestamp’），默认为时间序列索引
convention	对区间重新采样时，用于将低频周期转换为高频的约定，默认为’start’，可选’end’

向下采样

使用resample()向下采样需要考虑的事情：

每段间隔的哪一边是闭合的
如何在间隔的起始或结束位置标记每个已聚合的箱体

rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
ts

'''
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32
'''

# 默认closed='left'，将索引间隔分为[00:00,00:05)、[00:05,00:10)...，
# 之后再在这样的间隔上将数据聚合
# 默认label='left',在结果中取索引间隔的左侧作为行标签
ts.resample('5min', closed='left').sum()

'''
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32
'''

ts.resample('5min', closed='right').sum()

'''
1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32
'''

ts.resample('5min', closed='right', label='right').sum()

'''
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32
'''

# 设置loffset='1s'，会将结果中的索引向右移动1秒，即加1秒
ts.resample('5min', closed='right',
            label='right', loffset='1s').sum()

'''
2000-01-01 00:00:01     0
2000-01-01 00:05:01    15
2000-01-01 00:10:01    40
2000-01-01 00:15:01    11
Freq: 5T, dtype: int32
'''

开端-峰值-谷值-结束（OHLC）重新采样

# ohlc()会获得索引间隔内的第一个值、最后一个值、最大值、最小值
ts.resample('5min').ohlc()

在这里插入图片描述

向上采样与差值

# upsampling时并不需要聚合，用asfreq()进行频率转换
frame = pd.DataFrame(np.random.randn(2, 4),
                     index=pd.date_range('1/1/2000', periods=2,
                                         freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

在这里插入图片描述

# 用asfreq()进行频率转换，从低频向高频转换时会引入缺失值
df_daily = frame.resample('D').asfreq()
df_daily

# 若不想引入缺失值，则可用ffill()设置填充值
frame.resample('D').ffill()

在这里插入图片描述

# 传递limit=2给ffill()则仅向前填充两行数据
frame.resample('D').ffill(limit=2)

在这里插入图片描述

frame.resample('W-THU').ffill()

在这里插入图片描述

使用区间进行重新采样 Resampling with Periods

frame = pd.DataFrame(np.random.randn(24, 4),
                     index=pd.period_range('1-2000', '12-2001',
                                           freq='M'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

在这里插入图片描述

# 向下采样，频率为'A-DEC'，年度，12月作为年底最后一月
annual_frame = frame.resample('A-DEC').mean()
annual_frame

在这里插入图片描述

# 向上采样，并设置填充方式ffill()，默认convention='start'
# Q-DEC: 季度, 12月作为最后一季度的最后一个月
annual_frame.resample('Q-DEC').ffill()

在这里插入图片描述

# 向上采样，设置填充方式ffill()
# 传递convention='end'，设置结果中索引为原来第一个索引年份的季度末开始，到原来最后一个索引年份的季度末终止
annual_frame.resample('Q-DEC', convention='end').ffill()

在这里插入图片描述

向下采样中，原序列中频率必须是结果中频率的子区间；月----年
向上采样总，原序列中频率必须是结果中频率的父区间；年----月
仔细揣摩以下的几种操作，通过画图理解

annual_frame.resample('Q-MAR').ffill()

在这里插入图片描述

annual_frame.resample('Q-MAR', convention='end').ffill()

在这里插入图片描述

annual_frame.resample('Q-APR').ffill()

在这里插入图片描述

annual_frame.resample('Q-APR', convention='end').ffill()

在这里插入图片描述

七、移动窗口函数

移动窗口函数与其他的统计函数一样，都会自动排除缺失数据

close_px_all = pd.read_csv('examples/stock_px_2.csv',
                           parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JmvjtbsJ-1624704788804)(https://i.loli.net/2021/06/26/lFGDQ649VqbrJxu.png)]

# 按照工作日频率进行重采样
close_px = close_px.resample('B').ffill()
close_px

在这里插入图片描述

%matplotlib inline
close_px.AAPL.plot()
# rolling()，可以在Series和DataFrame上通过一个window(即下方传入的数字250)进行调用
# 这里是根据250个滑动窗口分组
close_px.AAPL.rolling(250).mean().plot()

在这里插入图片描述

# 传递min_periods=10，会最小先计算前10闭的标准差，再计算前11笔的标准差，以此类推
# 直到计算到前250组数据后，再采用移动窗口求标准差
appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()
appl_std250[5:12]

'''
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64
'''

appl_std250.plot()

在这里插入图片描述

# expanding()会使视窗逐渐扩大，左边不动，右边向右扩展；求扩展窗口的均值
# 如，这里2003-01-16为原01-16和01-15的均值
# 如，这里2003-01-17为原01-15至01-17的均值
expanding_mean = appl_std250.expanding().mean()
expanding_mean[5:12]

'''
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.076128
2003-01-17    0.088208
Freq: B, Name: AAPL, dtype: float64
'''

# 传递logy=True，对y取对数
close_px.rolling(60).mean().plot(logy=True)

在这里插入图片描述

# 传递'20D'，固定取日历日的20天进行移动；前面传递250是取索引列的250天作为一个窗口
close_px.rolling('20D').mean().head()

在这里插入图片描述

# 传递'3D'，固定取日历日的3天进行移动平均
# 故结果中2003-01-06的均值不变，因为4和5号没有数据
close_px.rolling('3D').mean().head()

在这里插入图片描述

指数加权函数

# 通过span=30指定衰减因子，即权重分别为2/31、2/31*29/31.....
aapl_px = close_px.AAPL['2006':'2007']
# 简单移动平均
ma60 = aapl_px.rolling(30, min_periods=20).mean()
# 指数加权平均
ewma60 = aapl_px.ewm(span=30).mean()
ma60.plot(style='k--', label='Simple MA')
ewma60.plot(style='k-', label='EW MA')
plt.legend()

在这里插入图片描述

二元移动窗口函数

同时操作两个时间序列，使用相同的移动窗口大小，如corr()或协方差

spx_px = close_px_all['SPX']
spx_rets = spx_px.pct_change()
spx_rets

'''
2003-01-02         NaN
2003-01-03   -0.000484
2003-01-06    0.022474
2003-01-07   -0.006545
2003-01-08   -0.014086
                ...   
2011-10-10    0.034125
2011-10-11    0.000544
2011-10-12    0.009795
2011-10-13   -0.002974
2011-10-14    0.017380
Name: SPX, Length: 2214, dtype: float64
'''

returns = close_px.pct_change()
returns

在这里插入图片描述

# 注意这里，returns.AAPL有2292行数据，而spx_rets有2214行数据
# 故他们的行索引在计算移动视窗时并不是完全一样的
# 正常情况下，是不会出现这样的情况，可能是作者弄错了
# corr()计算滚动相关系数
corr = returns.AAPL.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

在这里插入图片描述

corr = returns.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

在这里插入图片描述

用户自定义移动窗口函数

# 在rolling()及相关方法上使用apply()可以在移动窗口中使用自定义的数组函数
from scipy.stats import percentileofscore
# 有多少百分比的样本x小于0.02
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()

在这里插入图片描述

标签：11,数据分析,00,01,03,python,30,2000,2012
来源： https://blog.csdn.net/KikuWong/article/details/118253111

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

《利用python进行数据分析》第二版 第11章-时间序列 学习笔记

文章目录

一、日期和时间数据的类型及工具

字符串与datatime间的转换

二、时间序列基础

索引、选择、子集

含有重复索引的时间序列

三、日期范围、频率和移位

生成日期范围

频率和日期偏置

向前或向后移位日期

使用偏置进行移位日期

四、时区处理

时区的本地化和转换

时区感知时间戳对象的操作

不同时区间的操作

五、时间区间和区间算术

区间频率转换

季度区间频率

将时间戳转换为区间（及逆转换）

从数组生成PeriodIndex

六、重新采样与频率转换

向下采样

开端-峰值-谷值-结束（OHLC）重新采样

向上采样与差值

使用区间进行重新采样 Resampling with Periods

七、移动窗口函数

指数加权函数

二元移动窗口函数

用户自定义移动窗口函数

《利用python进行数据分析》第二版第11章-时间序列学习笔记