Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transaction speed running C++ on Linux #45

Open
JeremyBowen opened this issue May 12, 2011 · 12 comments
Open

Transaction speed running C++ on Linux #45

JeremyBowen opened this issue May 12, 2011 · 12 comments
Assignees
Milestone

Comments

@JeremyBowen
Copy link

I've got a potential problem with speed of calls between a client and a server process running C++ code under linux.

The issue is that I can get a very low throughput in terms of number of transactions per second. We've tried C# on windows and can get 10's of thousands of calls per second when running the client/server both on the same PC (localhost) or across a network.

Under Linux running the C++ code, I can't get anywhere near that number of transactions.

I have a simple loop that performs two RPC calls 100 times and timing this with the Linux "time" command. This takes approx 16 seconds when running the client & server on the same PC. It's roughly the same across a network.

Running the Linux C++ client against the Windows C# Server results in approx the same slow behaviour. (~16s per 200 calls)
Running the Windows C# client against the Linux C++ Server results in approx the same slow behaviour.
Running a Linux Python client against a C++ Linux server results in the same slow behaviour (~16s per 200 calls)

It looks like the Linux C++ code is possibly waiting for a timeout or maybe there is something blocking, waiting to setup a connection. I'm guessing here but the CPU barely notices so it doesn't appear to be a lack of raw CPU cycles.

We've been unable to create a C++ Windows version to compare C++ vs C# and nor have I been able to figure out the relevant python syntax to create a python server under Linux.

If I can figure out how to attach some sample code to this issue I will do so, otherwise I'll post it here shortly.

@tomerfiliba
Copy link
Owner

hi, i assume you're related to @miksovic?

anyway, thank you very much for your effort, no further tests are necessary. it's clear that the c++ implementation is considerably slower... and you say it doesn't spend much time in CPU, so it seems to be waiting on locks or IO. my first thought was maybe my c++ skills were rusty and I used lots of copy-constructing or something in that spirit, but if it's not CPU bound, it's less probable to be the cause. it's more likely that i'm either doing the IO wrong, or that my Mutex class is very slow

some pointers:

a good test would be running using strace, to see where it's waiting. i'd appreciate if you could do that for me.

as for attaching files -- it would be easiest to just create a repo on github and upload your files there (that way it's revision-controlled and easy to share), or you could use http://gist.github.com (or pastebin, etc.)

and last but not least -- i'm busy over my head at the moment, it might take me a several days to get to it. any progress you guys make would be greatly appreciated (you don't have to solve the problem -- just locate it). the best way is to fork the project, commit your changes/tests, and then send me a pull request.

thanks,
-tomer

@JeremyBowen
Copy link
Author

Hi yes, @miksovic is primarily working on Windows and I'm working on Linux.

Thanks for your response. I'm not sure this is entirely C++ related. Maybe it's more a Linux thing. I've run strace and ltrace over both my client and server programs and don't have a huge amount of new information.

I'll post the results of my tests as separate comments:

@JeremyBowen
Copy link
Author

First up is testing my Linux clients (Python and C++) against a Linux C++ server:

Python Client

$ ltrace -c -S python ./client.py
% time seconds usecs/call calls function


73.34 62.395172 87388 714 SYS_select
26.60 22.630919 22630919 1 Py_Main
0.02 0.014398 15 916 SYS_open
0.01 0.007676 18 408 SYS_sendto
...
real 0m8.250s
user 0m0.172s
sys 0m0.072s

Again with strace
$ time strace -c python ./client.py
% time seconds usecs/call calls errors syscall


78.64 0.000081 0 408 sendto
21.36 0.000022 0 295 248 stat
0.00 0.000000 0 221 read
0.00 0.000000 0 50 write
...
real 0m8.253s
user 0m0.144s
sys 0m0.084s

C++ CLient

$ ltrace -c -S ./client
% time seconds usecs/call calls function


27.29 12.051400 107 112473 mcount
21.32 9.411614 15253 617 _ZNSi4readEPcl
10.50 4.634701 7511 617 _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl
10.42 4.598853 15695 293 _ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv
7.53 3.323090 8144 408 _ZNSo5writeEPKcl
5.43 2.397605 23505 102 _ZNSo5flushEv
4.70 2.075017 919 2257 _ZNKSt15basic_streambufIcSt11char_traitsIcEE5pbaseEv
4.67 2.061238 1260 1635 _ZNKSt9basic_iosIcSt11char_traitsIcEE3eofEv
3.29 1.453525 4960 293 recvmsg
3.22 1.421815 4803 296 SYS_recvmsg
...
real 0m11.073s
user 0m1.756s
sys 0m5.576s
Now strace on C++ client:

$ time strace -c ./client
% time seconds usecs/call calls errors syscall


100.00 0.004000 19 209 recvmsg
0.00 0.000000 0 11 read
0.00 0.000000 0 51 write
0.00 0.000000 0 12 open
0.00 0.000000 0 16 close
...
real 0m8.101s
user 0m0.024s
sys 0m0.024s

@JeremyBowen
Copy link
Author

Now Linux Clients to Windows C# Server process on network machine.

Python Client on Linux

$ time ltrace -c -S python ./client.py
% time seconds usecs/call calls function


72.59 60.406528 84602 714 SYS_select
27.34 22.755401 22755401 1 Py_Main
0.02 0.013669 14 916 SYS_open
0.01 0.009206 22 408 SYS_sendto
...
real 0m20.849s
user 0m0.164s
sys 0m0.068s

$ time strace -c python ./client.py
% time seconds usecs/call calls errors syscall


59.46 0.000022 0 916 755 open
40.54 0.000015 0 61 munmap
0.00 0.000000 0 221 read
0.00 0.000000 0 50 write
...
real 0m20.630s
user 0m0.128s
sys 0m0.092s

C++ Client on Linux

$ time ltrace -c -S ./client
% time seconds usecs/call calls function


19.65 45.960395 74490 617 _ZNSi4readEPcl
18.48 43.230169 70065 617 _ZNSt15basic_streambufIcSt11char_traitsIcEE6xsgetnEPcl
18.47 43.195086 423481 102 _ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv
18.32 42.847775 420076 102 recvmsg
18.31 42.833417 407937 105 SYS_recvmsg
4.06 9.488322 90 104441 mcount
2.24 5.228938 12816 408 _ZNSo5writeEPKcl
0.16 0.377865 3704 102 _ZNSo5flushEv
...
real 0m21.221s
user 0m1.652s
sys 0m4.892s

$ time strace -c ./client
% time seconds usecs/call calls errors syscall


-nan 0.000000 0 11 read
-nan 0.000000 0 51 write
-nan 0.000000 0 12 open
-nan 0.000000 0 16 close
-nan 0.000000 0 13 fstat
-nan 0.000000 0 35 mmap
-nan 0.000000 0 24 mprotect
...
real 0m20.486s
user 0m0.016s
sys 0m0.028s

@JeremyBowen
Copy link
Author

Running C# -> C# on the Windows box is extremely fast by comparison although I don't have timings (too fast).
C# client on Windows to C++ Server on Linux is again slow (~16 seconds for the same test)

I'm beginning to wonder if my Linux box has some sort of problem. It's obviously not restricted to C++ as both C++, C# and Python clients behave the same way, as far as I can tell, regardless of what language the server is implemented in.

I appreciate that you're busy and I don't expect you to solve this for me. I'll keep trying to identify the cause of the poor performance and I'll do anything I can to try and locate the problem.

Cheers

@tomerfiliba
Copy link
Owner

thanks for the effort!

-tomer

@tomerfiliba
Copy link
Owner

C++ to C++, linux box, modified the C++ test to include a small performance test:

// test performance
boost::posix_time::ptime t0(boost::posix_time::microsec_clock::local_time());
int tmp = 0;
const int cycles = 100;
for (int i = 0; i < cycles; i++) {
    double res = adam->think(188, i+1);
    if (res > 2) {
        tmp = 1; // to make sure this loop is not optimized out
    }
}
boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time());
double interval = (t1 - t0).total_microseconds() / 1000000.0;
cout << "performing " << cycles << " cycles took " << interval << " seconds" << endl;

results in performing 100 cycles took 8.00376 seconds... bummer

@tomerfiliba
Copy link
Owner

okay, strace seems to reveal some interesting facts: first of all, boost uses sendmsg and recvmsg instead of send and recv, but it don't suppose there's performance difference between them. on the other hand, the last two recvmsg-es are extremely slow (time consumed in system call printed in brackets at the end of the line)

client side

sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000023>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\35", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000010>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000008>
sendmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0@\0\0\0\0\0\0\0", 29}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 29 <0.000007>
recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7\0\0\0\0\0@g\200\0\0\0\0\0\0\0\0\tIDL_MAGIC\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 4 <0.039792>
recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\t\0\0\0\0\0@W\200\0\0\0\0\0\0\0\0\tIDL_MAGIC\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 17 <0.039703>

server side

recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7\0\0\0\0\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0?\360\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 4 <0.039833>
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\35\0\0\0\0\1\0\r\2744\0\0\0\0\10\211.0@g\200\0\0\0\0\0@\0\0"..., 504}], msg_controllen=0, msg_flags=0}, 0) = 37 <0.039705>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\7", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000024>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\t", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000011>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0", 4}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 4 <0.000009>
sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0@W\200\0\0\0\0\0", 9}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9 <0.000008>

both the server and the client seem to be waiting a long time (blocking on recvmsg) for incoming data... 0.04 second. this happens twice for each invocation and thus every invocation takes roughly 0.08sec, which adds up to 8 seconds per 100 calls.

@ghost ghost assigned tomerfiliba Jun 7, 2011
@JeremyBowen
Copy link
Author

I don't know if you've been working on this lately but how difficult would it be to replace the Boost socket handling with a simple implementation of a socket library.

I've been experimenting with some simple socket handling code and the strace info reveals no problems in sending/receiving times. (~8 seconds per 40,000 calls compared with 8 seconds per 100 calls)

At first glance, linking this into the rest of the C++ framework in libagnos however would require quite a bit of work as I'm not really familiar with the boundaries of the boost code and which features you are making use of.

@tomerfiliba
Copy link
Owner

yes, i think the problem is boost's tcp::iostream, so i started a branch using tcp::socket instead (to directly do send() and recv()). but then i wouldn't be able to use compression (boost's zlib filter), so i'll have to resort to writing my own version of tcp::iostream. it's on my todo list.

@davidsj2
Copy link
Contributor

Any updates on this issue?

@tomerfiliba
Copy link
Owner

no, i'm not actively maintaining agnos any more...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants