shorne | stekern: olofk: nor next series of changes I am thinking the memcpy memset funcs, some change from sebastien macke and getting our gpio driver into staging (and updated to latest gpio api) | 05:37 |
---|
ZipCPU | Let's see ... I annotated the disassembly of the memcpy function. | 15:04 |
---|---|---|
ZipCPU | olofk: Here's an annotated disassembly of the memcpy function: https://justpaste.it/yci0 | 15:11 |
_franck_ | I started to port this one to or1k assembly some times ago: http://git.musl-libc.org/cgit/musl/tree/src/string/memcpy.c | 15:13 |
ZipCPU | I'm now using -O3 and I've verified that the strcmp, memcpy, and strcpy calls are all the optimized newlib versions | 16:26 |
shorne | pick eb6b230 openrisc: Add optimized memcpy routine | 18:32 |
shorne | _franck_: you can look at my memcpy routine, I also send it to the kernel list and got some response on it | 18:34 |
ZipCPU | OpenRISC has no memcpy() or strcmp() instructions like the VAX did, so ... you'd expect a bit of a difference. | 08:37 |
---|---|---|
ZipCPU | kc5tja: Let's discuss loop unrolling for a moment. When measuring the ZipCPU's performance, I unrolled the loops of the strcmp, strcpy, and memcpy manually. | 14:38 |
zump | The HW guys use some sort of system called AXI. I have to somehow remove that in each verilog block and replace it with, you know, memcpy. | 10:18 |
---|
shorne | so... from the response from Jonas I spent the past day and a half looking into adding a memcpy builtin to the openrisc gcc backent | 08:06 |
---|
olofk | I've been thinking about that too. It's a bit annoying that we have to add memcpy and friends to all OSes and libc implementations. A single one in gcc (+llvm I guess) would be nice | 08:59 |
---|
shorne | olofk: I did some more testing on my memcpy routines. Intestingly doing the microblaze type analigned word copies is not that much better than byte copies it seems | 02:59 |
---|---|---|
shorne | anyway some numbers: memcpy word copies (supporting non-alignment) with loop unrolls, during linux boot avg cycles 1880.3 | 03:05 |
shorne | memcpy byte copies, during linux boot avg cycles 7603.0 | 03:06 |
shorne | The memcpy stuff is here: | 03:46 |
olofk | Or did you mean memcpy? | 05:31 |
shorne | olofk: thats what it loks like. sorry memcpy... | 05:31 |
olofk | Would be great to get that upstreamed as well. memcpy will probably improve things during runtime as well | 05:32 |
wallento | I get memcpy undefined :) | 06:12 |
wallento | ERROR: "memcpy" [crypto/sha256_generic.ko] undefined! | 06:17 |
wallento | ERROR: "memcpy" [crypto/jitterentropy_rng.ko] undefined! | 06:17 |
wallento | ERROR: "memcpy" [crypto/hmac.ko] undefined! | 06:17 |
wallento | ERROR: "memcpy" [crypto/echainiv.ko] undefined! | 06:17 |
wallento | ERROR: "memcpy" [crypto/drbg.ko] undefined! | 06:17 |
shorne | I have been playing with memcpy optimization for linux... something is bit strange | 04:57 |
---|---|---|
shorne | nice, finally fixed some stupid bugs in my memcpy | 06:51 |
shorne | olofk: you did memset, did you have a look at memcpy? | 17:22 |
---|---|---|
olofk | shorne: I never got around to do memcpy. It was slightly more complicated than memset, so I let it be | 17:26 |
shorne | microblaze. this one http://lxr.free-electrons.com/source/arch/microblaze/lib/memcpy.c#L36? | 17:30 |
olofk | Four times faster memcpy should probably help a bit in many cases | 18:12 |
shorne | So I am playing with memcpy, for now I think the microblaze one is a bit too much and I just dont want to copy. Also it doesnt unroll loops | 19:12 |
dalias | (for example if we wanted to compare different memcpys) | 16:44 |
---|---|---|
olofk | But for something like memcpy, you would probably have to factor in memory latency as well. | 16:48 |
sheridp | I'm hitting an undefined reference to memcpy as a result of the following line: int colList [4] = {0,1,2,3}; | 23:27 |
---|---|---|
sheridp | does anyone know where to grab memcpy from? | 23:27 |
_franck__ | olofk: while you are working on this you should try the C version of memcpy from musl | 12:48 |
---|---|---|
_franck__ | it's the same king of algorithm you can find in arch/microblaze/lib/memcpy.c | 12:49 |
_franck__ | dalias: olofk : what I started to do some times ago is to optimize the assembly version of memcpy from musl | 09:02 |
---|---|---|
olofk | _franck__: You do memcpy and I do memset then :) | 09:03 |
olofk | IIRC memcpy and memset where the biggest contributors. I did some experiments to optimize that, and by stealing the memset code from Microblaze, it dropped down from the top 10 | 08:28 |
---|---|---|
olofk | Can you use DMA for memcpy easily under Linux btw? | 08:32 |
olofk | True. I haven't got a clue how large the common memcpy is | 08:33 |
poke53282 | http://stackoverflow.com/questions/25521422/dma-memcpy-operation-in-linux | 08:39 |
olofk | And DMA for memcpy seems not worth the trouble, even though I think that our break-even block size would be much smaller than 1MB since we have smaller caches than an x86 | 08:45 |
maxpaln | we are trying to assess performance on a memcpy. The way it is currently implemented (possibly by the C code or more likely by the assembler as interpretted by the compiler) is at a word-by-word basis. So copying a block of, say, 1KB, requires a 1000x loop that reads a byte and writes it somewhere else. | 13:05 |
---|---|---|
_franck__ | maxpaln: are talking about memcpy from Linux or from the libc ? | 13:09 |
poke53282 | I will figure out the correct code line tomorrow. It must be one before the memcpy (case REL_COPY:) if I remember correctly the disasm code. | 06:13 |
---|
_franck__ | I started something but I switched to something else. I compiled musl memcpy, disassembled it and started to optimize it in order to put the result in the kernel | 12:59 |
---|
dalias | _franck_, fyi, i have a pretty generic high-performance "C" implementation of memcpy that does the word shuffling for alignment, as part of musl libc | 00:46 |
---|---|---|
dalias | it's modeled after the way you'd do it in asm for most risc isa's, and is competitive with the asm implementations for most (iirc it's something like 80-90% of the speed of android's memcpy.s on arm) | 00:47 |
stekern | _franck_: Linux doesn't use libgcc for memcpy in the generic case: http://lxr.free-electrons.com/source/lib/string.c#L589 | 03:29 |
stekern | and if you have written an efficient copy_to/from_user (we haven't), then you basically have an efficient memcpy too | 03:33 |
_franck_ | (memcpy) should we have an arch specific memcpy then ? looks like most (all ?) of Linux arch that have a memcopy do it in assembler | 16:31 |
_franck_ | dalias: (memcpy) good to know. If I had to make an asm memcpy, I would compile it down and tweaked by hand like pgavin said | 16:35 |
_franck_ | well, memcpy complied with -O3 gives me 387 lines of assembly code... I hope there is some room for optimization | 16:44 |
_franck_ | I don't understand something in the calling of "void *memcpy(void *restrict dest, const void *restrict src, size_t n)" | 19:42 |
_franck_ | stekern: blueCmd : is that a reason why we don't have __HAVE_ARCH_MEMCPY (with an asm version of memcpy doing word transfert and handling aligment) in the kernel ? | 22:13 |
---|---|---|
olofk | http://www.juliusbaxter.net/openrisc-irc/search?q=memcpy | 22:14 |
_franck_ | I would have said that the Kernel doesn't use the GCC memcpy but now I'm not sure. It is linked with libgcc right ? | 22:19 |
stekern | jonibo: you mentioned a assembly version memcpy in openrisc linux, but I can't seem to find that, only the __copy_to/from_user, which AFAICT does about what memcpy do, but it's not used for memcpy | 12:24 |
---|---|---|
jonibo | yeah... fair enough... I thought I remembered an existing memcpy but apparently it doesn't exist... just the "user" mem copy functions | 12:24 |
stekern | interestingly enough, my grepping yesterday also showed that arm can be configured to use their memcpy implementation for the copy_to/from_user | 12:27 |
jonibo | i think maybe GCC has the memcpy function (byte-by-byte)... I'm pretty sure I've seen it somewhere | 12:28 |
jonibo | yeah, but that should be fine for memcpy too because you should simply never hit an exception there... | 12:33 |
jonibo | anyway, this is one of the BIG pieces missing pieces of openrisc GCC... optimized versions of the builtin functions... strcpy, memcpy, et cetera, et cetera... | 12:54 |
---|---|---|
stekern | or did you just mean that we are using generic versions of memcpy et al? | 14:36 |
jonibo | stekern: I meant that the functions memcpy, strchr, memset, et al are either not optmized or poorly optimized... memcpy has an assembly version, but it's really quite poor (and I think GCC has the same poor version) | 15:39 |
jonibo | memcpy currently just copies byte-by-byte | 15:45 |
poke53281 | not that I know. unaligned memcpy or memset maybe | 03:10 |
---|
juliusb | memcpy()? | 08:45 |
---|
stekern | oh, and I have got it to run the dhrystone test as fast as gcc-4.8, by tweaking the limit where memcpys vs inserted load/stores goes | 20:49 |
---|
stekern | clang was way behind at 0.5, I checked what slowed it down so much and it was the strcpys being done as memcpys, could be me doing something suboptimal in our backend | 12:09 |
---|
73 matches in 2795 log files with 136823 lines (1.3 seconds).
Generated by irclogsearch.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!