亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频

蟲蟲首頁| 資源下載| 資源專輯| 精品軟件
登錄| 注冊

您現在的位置是:首頁 > 技術閱讀 >  C++ std::function的開銷

C++ std::function的開銷

時間:2024-02-15

經常看到說function的開銷比較大,慎用function之類的討論。

那function究竟哪里開銷大,我找到了一篇為function做profile的文章,這篇文章中的英文比較簡單,我就不翻譯了,英文吃力的朋友也可以直接看下面的數據:

Popular folklore demands that you avoid std::function if you care about performance.

But is it really true? How bad is it?

Nanobenchmarking std::function

Benchmarking is hard. Microbenchmarking is a dark art. Many people insist that nanobenchmarking is out of the reach for us mortals.

But that won’t stop us: let’s benchmark the overhead of creating and calling a std::function.

We have to tread extra carefully here. Modern desktop CPUs are insanely complex, often with deep pipelines, out-of-order execution, sophisticated branch prediction, prefetching, multiple level of caches, hyperthreading, and many more arcane performance-enhancing features.

The other enemy is the compiler.

Any sufficiently advanced optimizing compiler is indistinguishable from magic.

We’ll have to make sure that our code-to-be-benchmarked is not being optimized away. Luckily, volatile is still not fully deprecated and can be (ab)used to prevent many optimizations. In this post we will only measure throughput (how long does it take to call the same function 1000000 times?). We’re going to use the following scaffold:

template<class F>void benchmark(F&& f, float a_in = 0.0f, float b_in = 0.0f){    auto constexpr count = 1'000'000;
volatile float a = a_in; volatile float b = b_in; volatile float r;
auto const t_start = std::chrono::high_resolution_clock::now(); for (auto i = 0; i < count; ++i) r = f(a, b); auto const t_end = std::chrono::high_resolution_clock::now();
auto const dt = std::chrono::duration<double>(t_end - t_start).count(); std::cout << dt / count * 1e9 << " ns / op" << std::endl;}

Double checking with godbolt we can verify that the compiler is not optimizing the function body even though we only compute 0.0f + 0.0f in a loop. The loop itself has some overhead and sometimes the compiler will unroll parts of the loop.

Baseline

Our test system in the following benchmarks is an Intel Core i9-9900K running at 4.8 GHz (a modern high-end consumer CPU at the time of writing). The code is compiled with clang-7 and the libcstd++ standard library using -O2 and -march=native.

We start with a few basic tests:

benchmark([](float, float) { return 0.0f; });      // 0.21 ns / op (1 cycle / op)benchmark([](float a, float b) { return a + b; }); // 0.22 ns / op (1 cycle / op)benchmark([](float a, float b) { return a / b; }); // 0.62 ns / op (3 cycles / op)

The baseline is about 1 cycle per operation and the a / b test verifies that we can reproduce the throughput of basic operations (a good reference is AsmGrid, X86 Perf on the upper right). (I’ve repeated all benchmarks multiple times and chose the mode of the distribution.)

Calling Functions

The first thing we want to know: How expensive is a function call?

using fun_t = float(float, float);
// inlineable direct callfloat funA(float a, float b) { return a + b; }
// non-inlined direct call__attribute__((noinline)) float funB(float a, float b) { return a + b; }
// non-inlined indirect callfun_t* funC; // set externally to funA
// visible lambdaauto funD = [](float a, float b) { return a + b; };
// std::function with visible functionauto funE = std::function<fun_t>(funA);
// std::function with non-inlined functionauto funF = std::function<fun_t>(funB);
// std::function with function pointerauto funG = std::function<fun_t>(funC);
// std::function with visible lambdaauto funH = std::function<fun_t>(funD);
// std::function with direct lambdaauto funI = std::function<fun_t>([](float a, float b) { return a + b; });

The results:

benchmark(funA); // 0.22 ns / op (1 cycle  / op)benchmark(funB); // 1.04 ns / op (5 cycles / op)benchmark(funC); // 1.04 ns / op (5 cycles / op)benchmark(funD); // 0.22 ns / op (1 cycle  / op)benchmark(funE); // 1.67 ns / op (8 cycles / op)benchmark(funF); // 1.67 ns / op (8 cycles / op)benchmark(funG); // 1.67 ns / op (8 cycles / op)benchmark(funH); // 1.25 ns / op (6 cycles / op)benchmark(funI); // 1.25 ns / op (6 cycles / op)

This suggests that only A and D are inlined and that there is some additional optimization possible when using std::function with a lambda.

Constructing std::function

We can also measure how long it takes to construct or copy a std::function:

std::function<float(float, float)> f;
benchmark([&]{ f = {}; }); // 0.42 ns / op ( 2 cycles / op)benchmark([&]{ f = funA; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funB; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funC; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funD; }); // 1.46 ns / op ( 7 cycles / op)benchmark([&]{ f = funE; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funF; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funG; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funH; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funI; }); // 4.37 ns / op (21 cycles / op)

The result of f = funD suggests that constructing a std::function directly from a lambda is pretty fast. Let’s check that when using different capture sizes:

struct b4 { int32_t x; };struct b8 { int64_t x; };struct b16 { int64_t x, y; };
benchmark([&]{ f = [](float, float) { return 0; }; }); // 1.46 ns / op ( 7 cycles / op)benchmark([&]{ f = [x = b4{}](float, float) { return 0; }; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = [x = b8{}](float, float) { return 0; }; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = [x = b16{}](float, float) { return 0; }; }); // 1.66 ns / op ( 8 cycles / op)

I didn’t have the patience to untangle the assembly or the libcstd++ implementation to check where this behavior originates. You obviously have to pay for the capture and I think what we see here is a strange interaction between some kind of small function optimization and the compiler hoisting the construction of b16{} out of our measurement loop.

Summary

I think there is a lot of fearmongering regarding std::function, not all of it is justified.

My benchmarks suggest that on a modern microarchitecture the following overhead can be expected on hot data and instruction caches:

calling a non-inlined function4 cycles
calling a function pointer4 cycles
calling a std::function of a lambda5 cycles
calling a std::function of a function or function pointer7 cycles
constructing an empty std::function7 cycles
constructing a std::function from a function or function pointer21 cycles
copying a std::function21..24 cycles
constructing a std::function from a non-capturing lambda7 cycles
constructing a std::function from a capturing lambda21+ cycles

A word of caution: the benchmarks really only represent the overhead relative to a + b. Different functions show slightly different overhead behavior as they might use different scheduler ports and execution units that might overlap differently with what the loop requires. Also, a lot of this depends on how willing the compiler is to inline.

We’ve only measured the throughput. The results are only valid for “calling the same function many times with different arguments”, not for “calling many different functions”. But that is a topic for another post.


亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频
久久先锋资源| 欧美日韩国产成人在线91| 亚洲美女视频在线观看| 国产一区二区三区高清播放| 亚洲二区视频在线| 亚洲一区二区三区精品在线| 久久久激情视频| 欧美久色视频| 在线成人www免费观看视频| 99视频精品全国免费| 久久精品人人做人人综合| 欧美精品导航| 国产一区二区三区自拍| 日韩一二三在线视频播| 美女图片一区二区| 狠狠入ady亚洲精品| 亚洲综合视频1区| 欧美日韩在线三区| 日韩视频一区二区在线观看| 免费在线看成人av| 亚洲国产成人久久| 美脚丝袜一区二区三区在线观看| 国产精品乱码久久久久久| 一本久久综合亚洲鲁鲁| 欧美承认网站| 亚洲人成网站999久久久综合| 久久性色av| 亚洲精品你懂的| 久久精品欧美日韩| 国产亚洲人成a一在线v站| 亚洲欧美日韩国产一区| 欧美三级韩国三级日本三斤| 99pao成人国产永久免费视频| 欧美精品午夜视频| 亚洲精品极品| 欧美日韩日韩| 亚洲无限av看| 国产日本欧美视频| 久久精品免费电影| 黄色一区二区在线观看| 久久久亚洲国产天美传媒修理工 | 国产亚洲午夜| 欧美在线三区| 亚洲第一黄网| 欧美区高清在线| 一本色道久久综合狠狠躁的推荐| 欧美福利电影在线观看| 亚洲精品视频在线| 国产精品九九久久久久久久| 亚洲午夜三级在线| 国产日产欧美精品| 免费在线观看精品| 亚洲自拍高清| 亚洲电影欧美电影有声小说| 欧美另类99xxxxx| 亚洲欧洲av一区二区三区久久| 国产深夜精品福利| 男女激情久久| 亚洲视频播放| 国产一区视频网站| 欧美另类一区二区三区| 亚洲欧美变态国产另类| 一区二区在线免费观看| 欧美日韩你懂的| 欧美一级二级三级蜜桃| 亚洲国产高清一区| 国产精品爱啪在线线免费观看 | 国产精品伦一区| 久久久天天操| 亚洲线精品一区二区三区八戒| 国产真实乱偷精品视频免| 农夫在线精品视频免费观看| 亚洲精品一区二区三区蜜桃久| 精品电影一区| 欧美日韩系列| 久久亚洲精品中文字幕冲田杏梨| 一区二区三区欧美亚洲| 精品成人在线视频| 欧美日韩一区二区精品| 西西裸体人体做爰大胆久久久| 在线观看成人网| 国产精品主播| 欧美片在线观看| 久久亚洲高清| 欧美一级在线播放| 亚洲视频在线播放| 亚洲美女视频网| **欧美日韩vr在线| 国产在线拍偷自揄拍精品| 欧美日韩在线播放三区四区| 男人的天堂亚洲在线| 久久精品水蜜桃av综合天堂| 亚洲综合视频在线| 中日韩美女免费视频网址在线观看| 狠狠色狠狠色综合日日五| 国产色综合久久| 国产精品久久久久久久久 | 国产麻豆精品在线观看| 欧美精品xxxxbbbb| 久久在线免费观看| 久久精品国产精品 | 国产精品va在线播放我和闺蜜| 久久精视频免费在线久久完整在线看| 夜夜精品视频| 这里只有精品视频| 一本色道久久综合狠狠躁篇的优点 | 免费日韩视频| 欧美中文字幕在线播放| 午夜亚洲伦理| 欧美一区二区免费观在线| 亚洲免费视频观看| 亚洲特级片在线| 亚洲一区二区三区免费视频| a91a精品视频在线观看| 亚洲欧洲综合另类| 亚洲日本成人| 亚洲天堂视频在线观看| 亚洲在线播放| 久久国产色av| 欧美国产1区2区| 欧美日韩一区二区免费在线观看| 国产精品成人国产乱一区| 国产乱码精品一区二区三| 国外成人性视频| 亚洲精品国产精品国自产观看| 久久久国产精品一区二区中文| 午夜精品福利一区二区三区av| 亚洲女人天堂成人av在线| 午夜精品久久久| 老司机免费视频一区二区三区| 噜噜噜久久亚洲精品国产品小说| 欧美日韩高清在线一区| 国产亚洲电影| 亚洲精品欧美精品| 午夜影视日本亚洲欧洲精品| 久久综合九色综合欧美就去吻 | 午夜精品一区二区三区四区| 欧美在线网站| 欧美激情成人在线视频| 国产乱肥老妇国产一区二| 在线观看日韩www视频免费 | 一本色道88久久加勒比精品 | 亚洲视频在线免费观看| 欧美在线综合视频| 欧美久久九九| 一区二区三区在线视频观看| 一区二区三区久久网| 久久亚洲精选| 国产欧美一区二区三区沐欲| 亚洲黄色av| 久久九九免费| 国产精品婷婷| 亚洲日本黄色| 久久久久国产精品一区二区| 欧美调教vk| 亚洲精品无人区| 免费成人av| 激情欧美一区| 欧美伊久线香蕉线新在线| 欧美日韩亚洲国产精品| 在线精品国产欧美| 性色一区二区| 国产精品国产福利国产秒拍| 亚洲精品网站在线播放gif| 久久视频一区| 黄网动漫久久久| 国产一区二区三区四区在线观看 | 欧美一区二区免费视频| 欧美日韩在线免费观看| 国内精品国产成人| 免费看亚洲片| 亚洲一区二区综合| 欧美日韩精品系列| 韩日视频一区| 亚洲国产欧美另类丝袜| 亚洲国产成人av| 免费欧美视频| 亚洲电影免费观看高清完整版在线 | 国产伦精品一区二区三区视频黑人 | 狠狠干综合网| 亚洲日本一区二区三区| 在线看国产日韩| 亚洲欧美日韩一区二区在线 | 麻豆精品精华液| 在线观看三级视频欧美| 久久阴道视频| 亚洲欧洲精品一区二区三区不卡| 猫咪成人在线观看| 亚洲经典视频在线观看| 欧美激情在线播放| 一本色道综合亚洲| 国产精品免费看久久久香蕉| 午夜精品久久久久久久久久久| 国产亚洲精品综合一区91| 米奇777超碰欧美日韩亚洲| 亚洲精品1区| 欧美视频一区二区三区四区| 亚洲专区在线视频| 一区二区三区在线免费视频| 欧美激情精品久久久|