分享
 
 
 

SSE2初学指南

王朝other·作者佚名  2006-01-09
窄屏简体版  字體: |||超大  

一篇转载文章,不错!

Intro

What is SSE2?

SSE2 is an extension of assembly language which allows programs to execute one operation on multiple pieces of data at a time. Because SSE2 is assembly however, it only works on processors that support it. If the commands are attempted to be executed on a machine which is not capable of doing so, a general protection fault will be encountered. Luckily there are easy ways to tell if the processor(s) you are running on supports SSE2.

Basic Structure of SSE2

SSE2 works just like any other set of assembly calls. There are registers in which data can be stored and operations that can execute on these registers. Each register is 16 bytes (2 doubles). The 8 registers are named xmm0 through xmm7.

Basics

Some Codeinline void Add(double *x, double *y, double *retval)

{

asm

{

// Copy the first 16 bytes into xmm0, starting at the memory x points to

movupd xmm0, [x]

// Copy the first 16 bytes into xmm1, starting at the memory y points to

movupd xmm1, [y]

// Add the 2 doubles in xmm1 to the 2 doubles in xmm0, and put the

// result in xmm0, overwriting the previous data stored there

addpd xmm0, xmm1

// Copy the 16 bytes of data in xmm0 to the memory ret points to

movupd [retval], xmm0

}

}

Hopefully my comments before each line were enough to let you know what was going on. In case they weren't, I'll go into a little more detail about each line. asm{}

This keyword lets your compiler know that the code you are giving it will be in assembly and that it should compile it as such. It also, conveniently, tells the compiler to inline the code. This means that there is NO overhead for the asm block. movupd xmm0, [x]

movupd xmm1, [y]

This command copies data from the second operand to the first; as always in Intel syntax, the asm is in dest, src order. By putting brackets around the x, we tell the mov command to copy the data that x points to the actual value of the pointer. The square brackets can be thought of as a method of dereferencing a pointer. addpd xmm0, xmm1

This is the line that does the actual arithmetic. It takes the value from the 2nd operand, src, and adds it to the 1st operand, dest, and stores the resulting value in the 1st operand, dest. movupd [retval], xmm0

Here, we copy the data that is in xmm0 to the memory that retval points to. Again, the square brackets dereference the variable retval in the same way that a '*' does in C/C++.

Some more operationssubpd dest, src // subtract dest from src, store in dest

mulpd dest, src // multiply dest and src, store in dest

divpd dest, src // divide src by dest, store in dest

minpd dest, src // store the smallest value, either dest or src, in dest

maxpd dest, src // store the largest value, either dest or src, in dest

sqrtpd dest, src // take the square root of src and put the result in dest

A full list of SSE2 operations and a description of each can be found at HAYES Technologies.

Making the Most Of SSE2

The Faster Move Instruction

Up until now, we have been using movapd to move data to and from our registers. This is much slower than the instruction movapd which does the exact same thing, but assumes that the data is 16 byte aligned. This means that the pointer supplied must be divisible by 16. This becomes a rather large problem if you are compiling your code with gcc or the one supplied with Microsoft Visual C++. One solution to this problem is to use a different compiler such as one that Intel provides. The inherent problem with this is that the Intel compiler is not freeware like gcc. If you have already spent money on some other compiler, you probably do not want to spend more on this new compiler. One hack that I have come up with is the following: #define AllignData(data) (void *)(((int)data + 15) &~ 0x0F)

//or an inline function if you prefer:

inline void *AllignData(void *data)

{

return (void *)(((int)data + 15) &~ 0x0F);

}

void main

{

const int sizeofdata = 512;

double *lotsofdata;

double *tempptr;

tempptr = new double[sizeofdata + 2];

lotsofdata = AllignData(tempptr);

asm

{

movapd xmm0, [lotsofdata];

// do lots of CPU intensive SSE2 operations on lotsofdata here

}

delete [] tempptr;

lotsofdata = 0;

// do not delete lotsofdata, just set it to 0, the memory it

// points to is no longer valid

}

What we need for this instruction to work is memory that has a 16-byte alligned memory address.

Memory that is 16byte alligned:

When we use the new command in C/C++ to allocate memory, we are given memory that may or may not be 16byte alligned. So, what we do, is instead of using the very first bit of our memory block, we start using it at the first place that is 16byte alligned.

-- yellow is unusbale, green is what we actually use --

By doing this, we waste the memory that comes before the first 16-byte alligned memory address. Normally this is not too big of a problem as the overhead is only once per memory block and if we are allocating large amounts of memory at once, 1 or even 12 bytes will hardly make a difference. The only problem with this is that by not using the beginning of our memory, we end up having a smaller amount of usable memory than we asked for. In the worst case, 15 bytes are not usable:

Therefore, to make sure that we get a specific number of usable bytes, we allocate at least 15 extra bytes. In the case of allocating double values, this means we must allocate an extra 2 doubles, giving us 16 extra bytes:

If you are not using doubles, just make sure that at least 15 extra bytes are allocated. Just determine the size of the data type and compute how many are needed to give you the needed padding.

Another important thing to notice is that we delete the variable that holds the original pointer to all of our memory. If you try to delete the memory starting in the middle of our memory block, you cause a general protection fault and have a memory leak.

Ordering Your Operations

One thing that is often over looked is the order that registers are used. When an operation is performed, there is a delay while the result is bieng moved to its destination. if the next operation requires this value, it must wait for it to be stored into the register. If however, the next operation does not need this data, it does not need to wait for it to be stored, it can go ahead and execute at the same time that the previous result is getting stored.

For instance, there will be a speed difference between the following two code segments:

#1: asm

{

movupd xmm0, [x] // xmm0 = x

movupd xmm1, [y] // xmm1 = y

movupd xmm2, [z] // xmm2 = z

movupd xmm3, [w] // xmm3 = w

movapd xmm4, xmm2 // xmm4 = z

movapd xmm5, xmm1 // xmm5 = y

movapd xmm6, xmm0 // xmm6 = x

addpd xmm2, xmm3 // xmm2 = z + w

addpd xmm1, xmm2 // xmm1 = y + z + w

addpd xmm0, xmm1 // xmm0 = x + y + z + w

mulpd xmm4, xmm3 // xmm4 = z * w

mulpd xmm5, xmm4 // xmm5 = y * z * w

mulpd xmm6, xmm5 // xmm6 = x * y * z * w

divpd xmm0, xmm6 // xmm0 = (x * y * z * w) / (x + y + z + w)

movupd [ret], xmm0 // ret = (x * y * z * w) / (x + y + z + w)

}

#2: asm

{

movupd xmm0, [x] // xmm0 = x

movupd xmm1, [y] // xmm1 = y

movupd xmm2, [z] // xmm2 = z

movupd xmm3, [w] // xmm3 = w

movapd xmm4, xmm2 // xmm4 = z

movapd xmm5, xmm1 // xmm5 = y

movapd xmm6, xmm0 // xmm6 = x

addpd xmm2, xmm3 // xmm2 = z + w

mulpd xmm4, xmm3 // xmm4 = z * w

addpd xmm1, xmm2 // xmm1 = y + z + w

mulpd xmm5, xmm4 // xmm5 = y * z * w

addpd xmm0, xmm1 // xmm0 = x + y + z + w

mulpd xmm6, xmm5 // xmm6 = x * y * z * w

divpd xmm0, xmm6 // xmm0 = (x * y * z * w) / (x + y + z + w)

movupd [ret], xmm0 // ret = (x * y * z * w) / (x + y + z + w)

}

The second piece of code will run faster. This is because in the second case, there are only 2 cases where one instruction relies on the data from the previous one to perform its computations. Because of this, instructions can be executed immediately after the previous one finsihes instead of waiting for it to store its result in the registers.

Don't get carried away

A common mistake made by people new to SSE2 is to convert a lot of their old and future code into SSE2. This can actually result in slower code. The reason for this is the very large overhead for the CPU to copy memory to the registers. If you have an application that is doing a small number of operations on a large data set, you can expect to be less efficient than if you are doing a lot of operations on a small amount of data.

Compiling SSE2 with gcc/g++

The first thing that you need to remember to do when you want to compile SSE2 embedded C/C++ code with gcc/g++, is to throw in the -masm=intel switch during compile. You must also put ".intel_syntax noprefix" in front of your asm code and surround it with quotes like this: asm(".intel_syntax noprefix\n");

asm(" mov eax, x\n");

asm(" movupd xmm0, [eax+0x00]\n");

asm(" movupd xmm1, [eax+0x10]\n");

asm(" addpd xmm0, xmm1\n");

asm(" movupd [eax+0x20], xmm0\n");

or

asm(".intel_syntax noprefix

mov eax, x

movupd xmm0, [eax+0x00]

movupd xmm1, [eax+0x10]

addpd xmm0, xmm1

movupd [eax+0x20], xmm0\n");

Note that the asm block is inside "()" not "{}". Also, if you want to use a variable declared in your C/C++ code, you must define it publicly. Any variables defined locally, whether inside your main function, inside a for loop, etc, will not be seen by the linker and will be considered an "undefined reference".

End

Questions? Comments? Suggestions? mis-spellings? Grammatical Errors? Email me at shilindalian@msn.com

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有