[GDK] - [C++] Visual Studio Inline Assembler

Author

Message

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 18th Sep 2011 14:24

Link

Visual Studio Inline Assembler x86

I've finally come back to experiment with the inline assembler within visual studio. Mainly because I am needing an extremely fast random number generator as a lot of my future code will be working with procedural content. Now that I've got a complete but very simple random number generator working I thought I'd post it.

When using the inline assembler from within visual studio you are free to use EAX, EBX, ECX, EDX, ESI, and EDI registers. Anything that's left in EAX will be the return value.

Why the inline assembler?

Because inline assembler code is extremely fast! Simple...

Inline assembler example code

I've left some of the old code commented out so you can see where I started with this.

+ Code Snippet

// inline assembler is free to use EAX, EBX, ECX, EDX, ESI, and EDI
// a standard asm function returns eax

__declspec(naked)
int		frandasm(int seed)
{
	static int rnd_seed;	// never re-defined on subsequent calls (static)

	//if (seed != -1) {

	__asm {
		;push	edx				// the registers don't need preserving
		mov		eax, [seed]
		jns		set_seed;		// if signed flag set
		mov		eax,[rnd_seed]
		mov		edx, 0019660dh	// mix its value up
		mul		edx
		add		eax, 3c6ef35fh
		mov		[rnd_seed], eax
		;pop	edx
		ret						// returns eax as the int
set_seed:
		mov		[rnd_seed],eax	// store new seed value
		ret
	};

	//} else rnd_seed = seed;

//	return rnd_seed;
}

You may note the '//' in the above code, that's because whilst I was learning the inline assembler I originally mixed it within C++ itself, but at the end of the day I wanted just a pure inline assembler function.

The above function is defined as:

+ Code Snippet

int frandasm(int seed = -1);

Which will allow for two types of call, one to set the seed and with just empty braces will return a random number value.

Using the inline assembler it is going to be so much faster than using:

+ Code Snippet

// returns a 32 bit unsigned random number
#define rand32() ( ( (unsigned int)rand() << 16 ) + (unsigned int)rand() )

To test whether this inline assembler function did actually work I threw a simple test in using Pure GDK (can use GDK too of course):

+ Code Snippet

void test4() {
	// test the asm random number generator
	int v;
	while (windowEvent() != WM_CLOSE && dbEscapeKey() == 0) {
		//
		dbClearScreen();	// GDK use dbCLS()
		frandasm(1);		// init the seed
		for (v=0; v < 40; v++) {
			itoa(frandasm(),gstring,10);	// global string pointer
			dbText(0,v*12,gstring);
		}
		dbSync();
	}
}

As I experiment as well as re-learn assembler programming I will post more examples up and hopefully others will find them useful too.

Warning! May contain Nuts!

Back to top

Profile PM Email Website

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 18th Sep 2011 21:46

Link

Okay this next function is a bit rubbish but now I'm getting used to using the inline assembler. This time I've mixed C++ and assembly (note the removal of _declspec(naked) ).

A few times in the past, and in these forums, I've been told not to worry about optimising my code, but bad habits don't die so quick. So now I'm teaching myself x86 after many years of z80 and 680x0.

In DBPro, GDK or PGDK you could not get close to the speed of this:

+ Code Snippet

void	test4a()
{
	int bmptr, bmpit, bmwid;

	dbLockPixels();

	bmptr = dbGetPixelsPointer();
	bmpit = dbGetPixelsPitch();
	bmwid = ipref.width;	// width of current display

	__asm {
		// preserve EAX, ECX, EDX, EBX, original ESP, EBP, ESI, and EDI
		pushad				// give me some registers to play with
							// prob don't need to

		mov		eax,50
		mul		[bmpit]
		add		eax,[bmptr]	// ebx = ptr + pit * 100
		mov		ebx,eax

		mov		eax,0		// original pixel value to write to screen
		mov		ecx,200		// loop counter y (draw 200 lines)
loop_y:	mov		edx,[bmwid]	// loop counter x
loop_x:	mov		[ebx],eax	// write pixel to screen
		add		ebx,4		// next pixel position
		dec		edx
		jnz		loop_x
		add		eax,010101h	// inc colour value
		dec		ecx
		jnz		loop_y

		popad				// restore the registers
	}
	dbUnlockPixels();
}

Warning! May contain Nuts!

Back to top

Profile PM Email Website

Mr Bigglesworth

17

Years of Service

User Offline

Joined: 4th Mar 2008

Location:

Posted: 19th Sep 2011 05:53

Link

This looks like something I may want to learn, given its complexity

Back to top

Profile PM Email

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 20th Sep 2011 03:10

Link

It's not actually that complex, it just takes a few more instructions to write out to actually do something simple like addition.

To get the instruction reference just google "x86 reference".

I'm just starting to play about with the FPU (Floating Point Unit) so that I can do heavy duty maths calculations much faster than C++ can.

Leaving some variables in registers also speeds up execution time over small pieces of code instead of reading and writing from memory locations.

Warning! May contain Nuts!

Back to top

Profile PM Email Website

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 25th Sep 2011 18:31 Edited at: 25th Sep 2011 18:51

Link

Simple use of the FPU:

After a good few hours of fiddling with this tiny little bit of code I did manage to get a result. I had to remove the (naked) and have it return a value in a variable.

+ Code Snippet

//__declspec(naked)	// pure asm routine
float	float_calc(int inum, float fnum) {
	float res;
	static float val=1.5;	// couldn't push immediate vals so had to use this way
	__asm {
		fld	fnum		// push float to the FPU stack
		fild	inum		// push int to the FPU stack
		fadd	st(0),st(1)	// add them (result in st(0)
		fld	val			// push float to the FPU stack
		fmul	st(0),st(1)	// multiple them
		fstp	res;		// pop st(0) result in res
	}
	return res;
}

The reason I'm learning the basics of asm and the use of the FPU is that I want to be able to at the end of my project optimise some of the heavier calculations.

The internals of the FPU gives you direct access to Cosine, Sine, Square Root, Pi, Power of, and Tangents, which can obviously give a vast speed up of heavy calculations.

If the above code was a function that calculated a 3D distance formula and that was required over hundreds to thousands of times during a game loop then the speed increase would help.

Bad habits die hard...

EDIT: The only problem I'm having at the moment is that because I've had to remove the (naked) part the compiler is now adding extra code to save registers which I'm not using in the function. I'm determined to get around this because if a function returns a float value then it should be left on the FPU's stack in st(0).

//__declspec(naked)	// pure asm routine
float	float_calc(int inum, float fnum) {
005DEEC0  push        ebp  
005DEEC1  mov         ebp,esp 
005DEEC3  sub         esp,44h 
005DEEC6  push        ebx  
005DEEC7  push        esi  
005DEEC8  push        edi  
	float res;
	static float val=1.5;	// couldn't push immediate vals so had to use this way
	__asm {
		fld		fnum		// push float to the FPU stack
005DEEC9  fld         dword ptr [fnum] 
		fild	inum		// push int to the FPU stack
005DEECC  fild        dword ptr [inum] 
		fadd	st(0),st(1)	// add them (result in st(0)
005DEECF  fadd        st,st(1) 
		fld		val			// push float to the FPU stack
005DEED1  fld         dword ptr [val (6A86C0h)] 
		fmul	st(0),st(1)	// multiple them
005DEED7  fmul        st,st(1) 
		fstp	res;		// pop st(0) result in res
005DEED9  fstp        dword ptr [res] 
	}
	return res;
005DEEDC  fld         dword ptr [res] 
}
005DEEDF  pop         edi  
005DEEE0  pop         esi  
005DEEE1  pop         ebx  
005DEEE2  mov         esp,ebp 
005DEEE4  pop         ebp  
005DEEE5  ret

+ Code Snippet

//__declspec(naked)	// pure asm routine
float	float_calc(int inum, float fnum) {
005DEEC0  push        ebp  
005DEEC1  mov         ebp,esp 
005DEEC3  sub         esp,44h 
005DEEC6  push        ebx  
005DEEC7  push        esi  
005DEEC8  push        edi  
	float res;
	static float val=1.5;	// couldn't push immediate vals so had to use this way
	__asm {
		fld		fnum		// push float to the FPU stack
005DEEC9  fld         dword ptr [fnum] 
		fild	inum		// push int to the FPU stack
005DEECC  fild        dword ptr [inum] 
		fadd	st(0),st(1)	// add them (result in st(0)
005DEECF  fadd        st,st(1) 
		fld		val			// push float to the FPU stack
005DEED1  fld         dword ptr [val (6A86C0h)] 
		fmul	st(0),st(1)	// multiple them
005DEED7  fmul        st,st(1) 
		fstp	res;		// pop st(0) result in res
005DEED9  fstp        dword ptr [res] 
	}
	return res;
005DEEDC  fld         dword ptr [res] 
}
005DEEDF  pop         edi  
005DEEE0  pop         esi  
005DEEE1  pop         ebx  
005DEEE2  mov         esp,ebp 
005DEEE4  pop         ebp  
005DEEE5  ret

Warning! May contain Nuts!

Back to top

Profile PM Email Website

Da_Rhyno

14

Years of Service

User Offline

Joined: 25th May 2011

Location:

Posted: 25th Sep 2011 19:40 Edited at: 25th Sep 2011 20:00

Link

If you need any assistance/tutelage on the FPU instructions, I have a link to a website which, while rather old, is very usable and helped me learn.

Here you are: http://www.website.masmforum.com/tutorials/fptute/

(This is mainly for anyone who may need some help with it, though I probably should have put a ASM tutorial there as well, but the one I used growing up is no longer online.)

Back to top

Profile PM

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 25th Sep 2011 23:52

Link

I'm not getting very far with using the __declspec(naked) and getting a return from the FPU in st(0) so I'm having to stick with the compiler inserting the prolog and epilog code.

Ah well, a few clock cycles won't matter that much. At least I can get away with (naked) functions with returns as integers.

Apparently you can do it if you're just using MASM but I'm just figuring out x86 after being so used to the 680x0.

@Da_Rhyno - I've got that site already bookmarked and I've been referring to it quite a bit lately. Thank you.

Warning! May contain Nuts!

Back to top

Profile PM Email Website

Da_Rhyno

14

Years of Service

User Offline

Joined: 25th May 2011

Location:

Posted: 26th Sep 2011 02:32

Link

I'm wondering if it's possible that the compiler is sticking something else in that register. The reason I bring it up is because I think that the FPU shares the same registers as the MMX routines IIRC.

Back to top

Profile PM

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 26th Sep 2011 02:47 Edited at: 26th Sep 2011 02:48

Link

@Da_Rhyno - On the disassembly everything seems fine when using the __declspec (naked) but I just don't get the results expected. I also get two different sets of results when I run my test code in Debug and Release. For the time being I'll have to stick with the restrictions of the MSVC inline assembler.

I did also notice that the st(0) register would be empty after returning from the function during debugging.

On a positive note, I've managed to convert the perlin noise function that is called the most when generating height maps and textures.

The original C++ code:

+ Code Snippet

double PerlinNoise::Noise_old(int x, int y) const
{
	int n = x + y * 57;
	n = (n << 13) ^ n;
	int t = (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff;
	return 1.0 - double(t) * 0.931322574615478515625e-9;/// 1073741824.0);
}

And converted to ASM: (not using memory addressing variables constantly)

+ Code Snippet

double PerlinNoise::Noise(int x, int y) const
{
	static double v1 = 0.931322574615478515625e-9;
	double res;
	int n;

	__asm {		// going to ignore the ^ power of bit
		// n=x+y*57
		mov	eax,dword ptr [y]
		imul	eax,eax,39h
		add	eax,dword ptr [x]
		mov	ebx,eax				// is n
		// n=(n<<13)^n
		shl	eax,0Dh
		xor	eax,ebx
		mov	ebx,eax				// is n again
		// int t = (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff;
		imul	eax,ebx
		imul	eax,eax,3D73h
		add	eax,0C0aeh
		imul	eax,ebx
		add	eax,5208DD0Dh
		and	eax,7FFFFFFFh
		mov	dword ptr [n],eax
		// return 1.0 - double(t) * 0.931322574615478515625e-9;
		fild	dword ptr [n]
		fmul	qword ptr [v1]
		fld1
		fsubrp	st(1),st
		fstp	qword ptr [res]
	}
	return res;
}

After a review of this there isn't actually any noticeable difference in speed as the MSVC compiler does do a commendable job. Saying that though, if the removal of the prolog and epilog code was possible in MSVC there would be another slight increase. The difference only really comes from being able to use registers instead of memory located variables during large math formulas.

Warning! May contain Nuts!

Back to top

Profile PM Email Website

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 27th Sep 2011 00:27 Edited at: 27th Sep 2011 00:34

Link

Converting C++ to Inline Assembler - More FPU code

This time I've managed to figure out what I was doing with converting the Interpolate function from the Perlin Noise class. I had to reverse the 'fmul' as well as pop from the FPU stack.

Before I did convert to inline assembler, I decided to use a different interpolate method, using the Cosine instead of Cubic method so I get an even better speed increase and still get the detail required.

The original C++ function:

+ Code Snippet

double PerlinNoise::Interpolate_cos(double a,double b,double x) const
{
	double f=(1.0-cos(x * 3.1415927))* 0.5;
	return a*(1.0-f)+b*f;
}

The final converted version in Inline Assembler:

+ Code Snippet

double PerlinNoise::Interpolate(double a,double b,double x) const
{
	static double half=0.5f;
	double res;
	__asm {
		fld	qword ptr [x]	// load x
		fldpi			// load PI (FPU's version)
		fmulp	st(1),st	// multiply and just leave result
		fcos			// built in FPU Cosine
		fld1			// built in FPU 1.0000
		fsubrp	st(1),st	// Subtract and just leave result
		fmul	qword ptr [half]// My 0.5
		fstp	qword ptr [res]	// pop and store in res
		fld1			// built in FPU 1.0000
		fsub	qword ptr [res]	// subtract res from 1.0000
		fmul	qword ptr [a]	// multiply by a
		fld	qword ptr [b]	// Load b on stack
		fmul	qword ptr [res]	// multiply b with res
		faddp	st(1),st	// add, leave result in st(1) and pop
		fstp	qword ptr [res]	// store final result in res
	}
	return res;
}

Tracking down errors in inline assembler code!!! A nightmare!!!

In the above code I have these lines, which are now fixed.

+ Code Snippet

		fld	qword ptr [x]
		fldpi
		fmulp	st(1),st

I originally used this:

+ Code Snippet

		fmul	st,st(1)

'fmulp st(1),st' - Multiplies st(0) with st(1) and leaves the result in st(1) and then pops the stack leaving just the result in st(0). And that was my error. I was leaving values on the stack and getting the wrong results. It was through looking at the disassembled code I realised this as the FPU stack wouldn't cause the program to crash. The learning curve of inline assembler and studying the FPU is proving difficult but not impossible.

Warning! May contain Nuts!

Back to top

Profile PM Email Website

IanM

Retired Moderator

22

Years of Service

User Offline

Joined: 11th Sep 2002

Location: In my moon base

Posted: 3rd Oct 2011 22:05

Link

Quote: "I had to remove the (naked) and have it return a value in a variable."

IIRC, any floating-point type is returned in ST(0). In addition, you need to ensure that your floating-point stack is empty when your function returns (except for the possible return value).

You can do that by issuing an 'finit' at some point (again taking care not to lose any floating-point return value).

Here's a quick breakdown of the return registers.
float / double - returned in ST(0)
up to 32 bits - returned in EAX
up to 64 bits - returned EAX (low) / EDX (high)
over 64 bits - done by storing the valuee in memory and returning a pointer to that memory in EAX.

DBPro's float values are returned as if they were 32 bit values rather than floats (ie, in EAX rather than ST(0)).

Quote: "if the removal of the prolog and epilog code was possible in MSVC"

Project properties -> C/C++ -> Optimisation -> Omit Frame Pointers = Yes

This frees up the EBP register too, so potentially means more variables in registers during calculations and therefore faster processing.

TBH though, although it's a great learning experience, unless you know the actual timing of machine code instructions and their optional modifiers, and how each one affects instruction pipelining, caching etc, then you might as well stick with highly optimised C++ - YMMV.

For instance, I don't use machine code in my plug-ins, except where it can do something that C++ itself can't (for instance, returning both double, 8 byte and 4 byte results to DBPro at the same time when there's no way to know what the caller is expecting).

Get the latest Matrix1Utility plug-ins (11-SEPT-2011)

Back to top

Profile PM Email Website

WLGfx

17

Years of Service

User Offline

Joined: 1st Nov 2007

Location: NW United Kingdom

Posted: 4th Oct 2011 04:33

Link

It's definitely an old habit of mine as all my old stuff on the Atari and the Amiga was done in C and 680x0 assembler code. And while I've been reading up on this new stuff, I've learned a lot about the pipelines and caches. Mainly using integer registers I've found you can get almost blitter like speed with optimised code, which has made me think about memory manipulation and transfers, especially with bitmaps. Which has made me start to learn some more. Now with the dual core etc out, some of the code can run better than almost "ghost mode" like.

Just like the old days (20 years plus ago), most of my code will be in C but end of the line optimisations I will do in assembly. Only when I need to speed something important up. This learning curve has opened my eyes to this new technology and being able to have up to 5 or 6 instructions run at only 1 clock cycle is just amazing.

I've got lots more ideas for DBP plugin features but I will have to study such things as accessing DBP itself from within the plugin, ie for bitmaps, images, sounds, object, etc.

My main project is soon about to start after many months of toying with this, learning that, and it has come in very useful finally re-learning assembly code. Although the C++ side is still an issue, as one of the game programming guides says, C code is good enough. I've enough knowledge about C++ to pull it off. And along the way I'll very likely still be trying out new things.

I'm actually going to start using a separate assembler soon instead of using the inline assembler for some of the changes, as the inline assembler is actually restricted on many things that MASM isn't. Such as creating a normal function to return FPU ST(0). Slightly difficult in MSVC. It also has better control for segments. All still a part of re-learning it all over again.

The FPU I've almost mastered after many experiments. Something I never used to do on the Amiga. MMX and SSE instructions I've not looked into yet but will get around to it.

As always, I'll always code in C/C++ then polish it later on.

Mental arithmetic? Me? (That's for computers) I can't subtract a fart from a plate of beans!
Warning! May contain Nuts!

Back to top

Profile PM Email Website

Sorry your browser is not supported!

Code Snippets / [GDK] - [C++] Visual Studio Inline Assembler