ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

使用英特尔 VTune Profiler 进行挖矿CPU指令数据分析

2022-08-03 00:33:34  阅读:189  来源: 互联网

标签:LOAD VTune RETIRED MEM Profiler UOPS CYCLES MISS 挖矿


门罗币挖矿指令:

Collection and Platform Info
    Application Command Line:    D:\share\xmrig-6.18.0-msvc-win64\xmrig-6.18.0\xmrig.exe -o fr.minexmr.com:443 -u 4971qQbWrJRUGDvEUUvqsw29MNz68Cus7d6DAsmTmGoZd4o9AL9FAJiFSvo5uZK1ezguR46n689Rk3zApMZTcB3gQfDMULX -p x --tls
    Operating System:    Microsoft Windows 10
    Computer Name:    DESKTOP-ALRVTLS
    Result Size:    1.7 GB 采集的全量数据规模
    Collection start time:    15:29:48 02/08/2022 UTC
    Collection stop time:    15:32:55 02/08/2022 UTC
    Collector Type:    Event-based sampling driver
    Finalization mode: Fast. If the number of collected samples exceeds the threshold, this mode limits the number of processed samples to speed up post-processing.
    CPU
        Name:    Intel(R) microarchitecture code named Rocketlake
        Frequency:    2.6 GHz
        Logical CPU Count:    12
        Cache Allocation Technology
            Level 2 capability:    not detected
            Level 3 capability:    not detected

分析类型:

 

运行截图:

 

=

 

 

运行近2分钟,我们看下数据结果:

 

 

 

全量数据采集有1.7GB!还是比较恐怖的。。。

看下整体结果:

 

 

 

 但从性能上看的话,瓶颈在backend。

 

看看单点的retiring,主要的CPU指令都在做啥:

 

 

 

FP的浮点运算比较多,13%

 

front-end的,cache miss、分支预测失误这些,占比很少:

 

 

 

backend的,

 

 

 

Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

从描述看,是L2 cache拖后腿了,L1的100%,L2的太低,貌似是这个意思。

 

 

 

看下call stack,耗时最多的就1个module。

 

 

我们看下event count:

 

 

 

将hardware event type导出来:

Hardware Events
    Hardware Event Type	Hardware Event Count
    ARITH.DIVIDER_ACTIVE	571,366,714,095   ==>arith.divider_active [Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations] baclears.any [Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction
                                                                       [当除法单元忙于执行除法或平方根运算时循环。 整数和浮点运算的帐户] baclears.any [计算前端重新转向时的总数,主要是当BPU无法提供正确的预测时******除法、平方根运算,符合挖矿的特质!!!

BACLEARS.ANY 24,000,720 ===》The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.ANY event counts the number of baclears for any type of branch.

翻译过来是:BACLEARS 事件计算前端被重新引导的次数,主要是在分支预测单元无法提供正确预测并且由前端的分支地址计算器纠正时。 BACLEARS.ANY 事件计算任何类型分支的 baclears 数量。==》看来是分支预测miss哪里的!
BR_INST_RETIRED.ALL_BRANCHES 179,656,042,170 ==>ALL_BRANCHES 计算退出的任何分支指令的数量。 分支预测预测分支目标并使处理器能够在知道分支真实执行路径之前很久就开始执行指令。 所有分支都使用分支预测单元 (BPU) 进行预测。 该单元不仅根据分支的 EIP,还根据执行到达该 EIP 的执行路径来预测目标地址。 BPU 可以有效地预测以下分支类型:条件分支、直接调用和跳转、间接调用和跳转、返回。 BR_MISP_RETIRED.ALL_BRANCHES 695,542,005 CPU_CLK_UNHALTED.DISTRIBUTED 2,762,526,000,000 ==》此事件在活动超线程(即 C0 中的超线程)之间分配循环计数。 超线程在执行 HLT 或 MWAIT 指令时变为非活动状态。 如果所有其他超线程都处于非活动状态(或禁用或不存在),则所有计数都归因于该超线程。 要在核心处于活动状态时获得完整计数,请将每个超线程的计数相加。 CPU_CLK_UNHALTED.REF_TSC 2,522,358,800,000 CPU_CLK_UNHALTED.THREAD 3,122,854,800,000 CPU_CLK_UNHALTED.THREAD_P 3,103,054,654,575 CYCLE_ACTIVITY.CYCLES_L1D_MISS 2,207,076,621,210 ==》Cycles while L1 cache miss demand load is outstanding. CYCLE_ACTIVITY.CYCLES_MEM_ANY 2,970,053,910,135 CYCLE_ACTIVITY.STALLS_L1D_MISS 1,527,559,582,665 CYCLE_ACTIVITY.STALLS_L2_MISS 226,650,679,950 CYCLE_ACTIVITY.STALLS_L3_MISS 162,225,486,675 CYCLE_ACTIVITY.STALLS_MEM_ANY 1,551,274,653,810 CYCLE_ACTIVITY.STALLS_TOTAL 1,592,284,776,840 DSB2MITE_SWITCHES.PENALTY_CYCLES 1,669,550,085 DTLB_LOAD_MISSES.STLB_HIT:cmask=1 5,694,170,820 DTLB_LOAD_MISSES.WALK_ACTIVE 84,254,527,560 DTLB_STORE_MISSES.STLB_HIT:cmask=1 292,508,775 DTLB_STORE_MISSES.WALK_ACTIVE 370,511,115 EXE_ACTIVITY.1_PORTS_UTIL 273,300,409,950 EXE_ACTIVITY.2_PORTS_UTIL 390,990,586,485 EXE_ACTIVITY.BOUND_ON_STORES 195,000,585 FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 563,478,403,845 FRONTEND_RETIRED.ANY_DSB_MISS 24,163,691,340 FRONTEND_RETIRED.DSB_MISS 660,046,200 FRONTEND_RETIRED.L2_MISS 24,001,680 FRONTEND_RETIRED.LATENCY_GE_16 45,003,150 FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1 25,053,253,605 FRONTEND_RETIRED.LATENCY_GE_4 232,516,275 ICACHE_16B.IFDATA_STALL 2,205,039,690 ICACHE_64B.IFTAG_STALL 1,176,017,640 IDQ.DSB_CYCLES_ANY 710,761,066,140 IDQ.DSB_CYCLES_OK 619,500,929,250 IDQ.DSB_UOPS 3,580,955,371,425 IDQ.MITE_CYCLES_ANY 92,280,138,420 IDQ.MITE_CYCLES_OK 67,200,100,800 IDQ.MITE_UOPS 335,040,502,560 IDQ.MS_SWITCHES 657,019,710 IDQ.MS_UOPS 4,468,634,055 IDQ_UOPS_NOT_DELIVERED.CORE 351,316,053,945 IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE 38,835,116,505 ILD_STALL.LCP 7,500,135 INST_RETIRED.ANY 3,769,987,000,000 INST_RETIRED.NOP 90,000,135 INT_MISC.CLEAR_RESTEER_CYCLES 7,215,129,870 INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes 975,017,550 INT_MISC.UOP_DROPPING 16,350,049,050 L1D_PEND_MISS.FB_FULL 3,135,009,405 L1D_PEND_MISS.FB_FULL_PERIODS 180,000,540 L1D_PEND_MISS.L2_STALL 2,910,008,730 L1D_PEND_MISS.PENDING 2,753,288,259,840 L2_RQSTS.ALL_RFO 37,389,560,835 L2_RQSTS.RFO_HIT 24,540,368,100 LD_BLOCKS.STORE_FORWARD 3,000,090 LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 7,704,231,120 MACHINE_CLEARS.COUNT 85,502,565 MEM_INST_RETIRED.ALL_STORES 200,160,600,480 MEM_INST_RETIRED.ANY 732,047,196,135 MEM_INST_RETIRED.LOCK_LOADS 15,001,050 MEM_INST_RETIRED.SPLIT_LOADS 9,000,270 MEM_INST_RETIRED.SPLIT_STORES 12,000,360 MEM_INST_RETIRED.STLB_MISS_LOADS 1,413,042,390 MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT 600,330 MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM 2,401,320 MEM_LOAD_RETIRED.FB_HIT 136,277,038,725 MEM_LOAD_RETIRED.L1_HIT 336,031,008,090 MEM_LOAD_RETIRED.L1_MISS 60,759,911,385 MEM_LOAD_RETIRED.L2_HIT 54,858,822,870 MEM_LOAD_RETIRED.L3_HIT 4,997,549,265 MEM_LOAD_RETIRED.L3_MISS 456,191,520 OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 9,735,029,205 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD 2,673,818,021,430 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO 1,002,168,006,495 RESOURCE_STALLS.SCOREBOARD 5,067,152,010 TOPDOWN.BACKEND_BOUND_SLOTS 9,234,752,770,425 TOPDOWN.SLOTS 13,658,454,097,535 UOPS_DECODED.DEC0 33,000,099,000 UOPS_DECODED.DEC0:cmask=1 17,385,052,155 UOPS_DISPATCHED.PORT_0 910,771,366,155 UOPS_DISPATCHED.PORT_1 994,651,491,975 UOPS_DISPATCHED.PORT_2_3 534,780,802,170 UOPS_DISPATCHED.PORT_4_9 223,530,335,295 UOPS_DISPATCHED.PORT_5 850,201,275,300 UOPS_DISPATCHED.PORT_6 899,491,349,235 UOPS_DISPATCHED.PORT_7_8 207,810,311,715 UOPS_EXECUTED.CYCLES_GE_3 855,031,282,545 UOPS_EXECUTED.THREAD 4,300,326,450,480 UOPS_ISSUED.ANY 4,063,476,095,205 UOPS_RETIRED.SLOTS 3,905,945,858,910

 

我++,太多了,写个程序排序下再分析。

TOPDOWN.SLOTS 13658454097535   ==》pass,分析用的吧
TOPDOWN.BACKEND_BOUND_SLOTS 9234752770425 ==》同上
UOPS_EXECUTED.THREAD 4300326450480  ==》Number of uops to be executed per-thread each cycle. 对挖矿检测应该没啥用
UOPS_ISSUED.ANY 4063476095205   ==>Uops that Resource Allocation Table (RAT) issues to Reservation Station (RS). 对挖矿检测应该没啥用
UOPS_RETIRED.SLOTS 3905945858910
INST_RETIRED.ANY 3769987000000
IDQ.DSB_UOPS 3580955371425
CPU_CLK_UNHALTED.THREAD 3122854800000
CPU_CLK_UNHALTED.THREAD_P 3103054654575
CYCLE_ACTIVITY.CYCLES_MEM_ANY 2970053910135
CPU_CLK_UNHALTED.DISTRIBUTED 2762526000000
L1D_PEND_MISS.PENDING 2753288259840
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD 2673818021430
CPU_CLK_UNHALTED.REF_TSC 2522358800000
CYCLE_ACTIVITY.CYCLES_L1D_MISS 2207076621210
CYCLE_ACTIVITY.STALLS_TOTAL 1592284776840
CYCLE_ACTIVITY.STALLS_MEM_ANY 1551274653810
CYCLE_ACTIVITY.STALLS_L1D_MISS 1527559582665
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO 1002168006495
UOPS_DISPATCHED.PORT_1 994651491975
UOPS_DISPATCHED.PORT_0 910771366155
UOPS_DISPATCHED.PORT_6 899491349235
UOPS_EXECUTED.CYCLES_GE_3 855031282545
UOPS_DISPATCHED.PORT_5 850201275300
MEM_INST_RETIRED.ANY 732047196135
IDQ.DSB_CYCLES_ANY 710761066140
IDQ.DSB_CYCLES_OK 619500929250
ARITH.DIVIDER_ACTIVE 571366714095
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 563478403845
UOPS_DISPATCHED.PORT_2_3 534780802170
EXE_ACTIVITY.2_PORTS_UTIL 390990586485
IDQ_UOPS_NOT_DELIVERED.CORE 351316053945
MEM_LOAD_RETIRED.L1_HIT 336031008090
IDQ.MITE_UOPS 335040502560
EXE_ACTIVITY.1_PORTS_UTIL 273300409950
CYCLE_ACTIVITY.STALLS_L2_MISS 226650679950
UOPS_DISPATCHED.PORT_4_9 223530335295
UOPS_DISPATCHED.PORT_7_8 207810311715
MEM_INST_RETIRED.ALL_STORES 200160600480
BR_INST_RETIRED.ALL_BRANCHES 179656042170
CYCLE_ACTIVITY.STALLS_L3_MISS 162225486675
MEM_LOAD_RETIRED.FB_HIT 136277038725
IDQ.MITE_CYCLES_ANY 92280138420
DTLB_LOAD_MISSES.WALK_ACTIVE 84254527560
IDQ.MITE_CYCLES_OK 67200100800
MEM_LOAD_RETIRED.L1_MISS 60759911385
MEM_LOAD_RETIRED.L2_HIT 54858822870
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE 38835116505
L2_RQSTS.ALL_RFO 37389560835
UOPS_DECODED.DEC0 33000099000
FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1 25053253605
L2_RQSTS.RFO_HIT 24540368100
FRONTEND_RETIRED.ANY_DSB_MISS 24163691340
UOPS_DECODED.DEC0:cmask=1 17385052155
INT_MISC.UOP_DROPPING 16350049050
OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 9735029205
LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 7704231120
INT_MISC.CLEAR_RESTEER_CYCLES 7215129870
DTLB_LOAD_MISSES.STLB_HIT:cmask=1 5694170820
RESOURCE_STALLS.SCOREBOARD 5067152010
MEM_LOAD_RETIRED.L3_HIT 4997549265
IDQ.MS_UOPS 4468634055
L1D_PEND_MISS.FB_FULL 3135009405
L1D_PEND_MISS.L2_STALL 2910008730
ICACHE_16B.IFDATA_STALL 2205039690
DSB2MITE_SWITCHES.PENALTY_CYCLES 1669550085
MEM_INST_RETIRED.STLB_MISS_LOADS 1413042390
ICACHE_64B.IFTAG_STALL 1176017640
INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes 975017550
BR_MISP_RETIRED.ALL_BRANCHES 695542005
FRONTEND_RETIRED.DSB_MISS 660046200
IDQ.MS_SWITCHES 657019710
MEM_LOAD_RETIRED.L3_MISS 456191520
DTLB_STORE_MISSES.WALK_ACTIVE 370511115
DTLB_STORE_MISSES.STLB_HIT:cmask=1 292508775
FRONTEND_RETIRED.LATENCY_GE_4 232516275
EXE_ACTIVITY.BOUND_ON_STORES 195000585
L1D_PEND_MISS.FB_FULL_PERIODS 180000540
INST_RETIRED.NOP 90000135
MACHINE_CLEARS.COUNT 85502565
FRONTEND_RETIRED.LATENCY_GE_16 45003150
FRONTEND_RETIRED.L2_MISS 24001680
BACLEARS.ANY 24000720
MEM_INST_RETIRED.LOCK_LOADS 15001050
MEM_INST_RETIRED.SPLIT_STORES 12000360
MEM_INST_RETIRED.SPLIT_LOADS 9000270
ILD_STALL.LCP 7500135
LD_BLOCKS.STORE_FORWARD 3000090
MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM 2401320
MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT 600330

 

明天再分析,眼睛都合不上了。。。。

 

标签:LOAD,VTune,RETIRED,MEM,Profiler,UOPS,CYCLES,MISS,挖矿
来源: https://www.cnblogs.com/bonelee/p/16545623.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有