亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频

蟲(chóng)蟲(chóng)首頁(yè)| 資源下載| 資源專輯| 精品軟件
登錄| 注冊(cè)

您現(xiàn)在的位置是:首頁(yè) > 技術(shù)閱讀 >  C++ std::function的開(kāi)銷

C++ std::function的開(kāi)銷

時(shí)間:2024-02-15

經(jīng)常看到說(shuō)function的開(kāi)銷比較大,慎用function之類的討論。

那function究竟哪里開(kāi)銷大,我找到了一篇為function做profile的文章,這篇文章中的英文比較簡(jiǎn)單,我就不翻譯了,英文吃力的朋友也可以直接看下面的數(shù)據(jù):

Popular folklore demands that you avoid std::function if you care about performance.

But is it really true? How bad is it?

Nanobenchmarking std::function

Benchmarking is hard. Microbenchmarking is a dark art. Many people insist that nanobenchmarking is out of the reach for us mortals.

But that won’t stop us: let’s benchmark the overhead of creating and calling a std::function.

We have to tread extra carefully here. Modern desktop CPUs are insanely complex, often with deep pipelines, out-of-order execution, sophisticated branch prediction, prefetching, multiple level of caches, hyperthreading, and many more arcane performance-enhancing features.

The other enemy is the compiler.

Any sufficiently advanced optimizing compiler is indistinguishable from magic.

We’ll have to make sure that our code-to-be-benchmarked is not being optimized away. Luckily, volatile is still not fully deprecated and can be (ab)used to prevent many optimizations. In this post we will only measure throughput (how long does it take to call the same function 1000000 times?). We’re going to use the following scaffold:

template<class F>void benchmark(F&& f, float a_in = 0.0f, float b_in = 0.0f){    auto constexpr count = 1'000'000;
volatile float a = a_in; volatile float b = b_in; volatile float r;
auto const t_start = std::chrono::high_resolution_clock::now(); for (auto i = 0; i < count; ++i) r = f(a, b); auto const t_end = std::chrono::high_resolution_clock::now();
auto const dt = std::chrono::duration<double>(t_end - t_start).count(); std::cout << dt / count * 1e9 << " ns / op" << std::endl;}

Double checking with godbolt we can verify that the compiler is not optimizing the function body even though we only compute 0.0f + 0.0f in a loop. The loop itself has some overhead and sometimes the compiler will unroll parts of the loop.

Baseline

Our test system in the following benchmarks is an Intel Core i9-9900K running at 4.8 GHz (a modern high-end consumer CPU at the time of writing). The code is compiled with clang-7 and the libcstd++ standard library using -O2 and -march=native.

We start with a few basic tests:

benchmark([](float, float) { return 0.0f; });      // 0.21 ns / op (1 cycle / op)benchmark([](float a, float b) { return a + b; }); // 0.22 ns / op (1 cycle / op)benchmark([](float a, float b) { return a / b; }); // 0.62 ns / op (3 cycles / op)

The baseline is about 1 cycle per operation and the a / b test verifies that we can reproduce the throughput of basic operations (a good reference is AsmGrid, X86 Perf on the upper right). (I’ve repeated all benchmarks multiple times and chose the mode of the distribution.)

Calling Functions

The first thing we want to know: How expensive is a function call?

using fun_t = float(float, float);
// inlineable direct callfloat funA(float a, float b) { return a + b; }
// non-inlined direct call__attribute__((noinline)) float funB(float a, float b) { return a + b; }
// non-inlined indirect callfun_t* funC; // set externally to funA
// visible lambdaauto funD = [](float a, float b) { return a + b; };
// std::function with visible functionauto funE = std::function<fun_t>(funA);
// std::function with non-inlined functionauto funF = std::function<fun_t>(funB);
// std::function with function pointerauto funG = std::function<fun_t>(funC);
// std::function with visible lambdaauto funH = std::function<fun_t>(funD);
// std::function with direct lambdaauto funI = std::function<fun_t>([](float a, float b) { return a + b; });

The results:

benchmark(funA); // 0.22 ns / op (1 cycle  / op)benchmark(funB); // 1.04 ns / op (5 cycles / op)benchmark(funC); // 1.04 ns / op (5 cycles / op)benchmark(funD); // 0.22 ns / op (1 cycle  / op)benchmark(funE); // 1.67 ns / op (8 cycles / op)benchmark(funF); // 1.67 ns / op (8 cycles / op)benchmark(funG); // 1.67 ns / op (8 cycles / op)benchmark(funH); // 1.25 ns / op (6 cycles / op)benchmark(funI); // 1.25 ns / op (6 cycles / op)

This suggests that only A and D are inlined and that there is some additional optimization possible when using std::function with a lambda.

Constructing std::function

We can also measure how long it takes to construct or copy a std::function:

std::function<float(float, float)> f;
benchmark([&]{ f = {}; }); // 0.42 ns / op ( 2 cycles / op)benchmark([&]{ f = funA; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funB; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funC; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funD; }); // 1.46 ns / op ( 7 cycles / op)benchmark([&]{ f = funE; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funF; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funG; }); // 5.00 ns / op (24 cycles / op)benchmark([&]{ f = funH; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = funI; }); // 4.37 ns / op (21 cycles / op)

The result of f = funD suggests that constructing a std::function directly from a lambda is pretty fast. Let’s check that when using different capture sizes:

struct b4 { int32_t x; };struct b8 { int64_t x; };struct b16 { int64_t x, y; };
benchmark([&]{ f = [](float, float) { return 0; }; }); // 1.46 ns / op ( 7 cycles / op)benchmark([&]{ f = [x = b4{}](float, float) { return 0; }; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = [x = b8{}](float, float) { return 0; }; }); // 4.37 ns / op (21 cycles / op)benchmark([&]{ f = [x = b16{}](float, float) { return 0; }; }); // 1.66 ns / op ( 8 cycles / op)

I didn’t have the patience to untangle the assembly or the libcstd++ implementation to check where this behavior originates. You obviously have to pay for the capture and I think what we see here is a strange interaction between some kind of small function optimization and the compiler hoisting the construction of b16{} out of our measurement loop.

Summary

I think there is a lot of fearmongering regarding std::function, not all of it is justified.

My benchmarks suggest that on a modern microarchitecture the following overhead can be expected on hot data and instruction caches:

calling a non-inlined function4 cycles
calling a function pointer4 cycles
calling a std::function of a lambda5 cycles
calling a std::function of a function or function pointer7 cycles
constructing an empty std::function7 cycles
constructing a std::function from a function or function pointer21 cycles
copying a std::function21..24 cycles
constructing a std::function from a non-capturing lambda7 cycles
constructing a std::function from a capturing lambda21+ cycles

A word of caution: the benchmarks really only represent the overhead relative to a + b. Different functions show slightly different overhead behavior as they might use different scheduler ports and execution units that might overlap differently with what the loop requires. Also, a lot of this depends on how willing the compiler is to inline.

We’ve only measured the throughput. The results are only valid for “calling the same function many times with different arguments”, not for “calling many different functions”. But that is a topic for another post.


主站蜘蛛池模板: 错那县| 泰宁县| 白城市| 马鞍山市| 宕昌县| 龙游县| 吉林市| 永安市| 双鸭山市| 名山县| 翁牛特旗| 马尔康县| 绥阳县| 句容市| 平乡县| 凉山| 肇源县| 南皮县| 武宁县| 平果县| 修水县| 博乐市| 元朗区| 巴林左旗| 奉贤区| 客服| 松阳县| 五指山市| 南部县| 锦屏县| 云阳县| 扎囊县| 渭南市| 台东市| 阜康市| 崇左市| 定南县| 漯河市| 玉田县| 金坛市| 湘潭市|