标签:rt 1.1 1.2 reinforcement st learning policy
1. Model Free
1.1 Monte Carlo
1.1.1 Value Iteration
SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update
Q
(
s
,
a
)
=
1
N
(
s
,
a
)
∑
i
G
i
t
(
s
,
a
)
Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a)
Q(s,a)=N(s,a)1i∑Git(s,a)
4. Inprove the policy based uppdated Q value
1.1.2 Policy interation
1.1.2 Policy Gradient
1.2 TD
1.2.1 Value Iteration
can be done in non-episode environment)
SARSA
- None episode setting, need tuple
(
s
t
,
a
t
,
r
t
,
a
t
+
1
)
(s_t,a_t,r_t,a_{t+1})
(st,at,rt,at+1)
2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t)) Q(st,at)=Q(s,a)+α(rt+rQ(st+1,at+1)−Q(st,at))
3. Inprove the policy based updated Q value
Q learning off policy learning
1. None episode setting, need tuple
(
s
t
,
a
t
,
r
t
)
(s_t,a_t,r_t)
(st,at,rt)
2. update
Q
(
s
t
,
a
t
)
=
Q
(
s
,
a
)
+
α
(
r
t
+
r
max
a
′
Q
(
s
t
+
1
,
a
′
)
−
Q
(
s
t
,
a
t
)
)
Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t))
Q(st,at)=Q(s,a)+α(rt+ra′maxQ(st+1,a′)−Q(st,at))
DQN
1.2.2 Policy Gradient
2. Model Based
标签:rt,1.1,1.2,reinforcement,st,learning,policy 来源: https://blog.csdn.net/huoxingshu12345/article/details/111528597
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。