The following is another selection of blog posts from the development of PlayBASIC to DLL over the past months,
[large]PlayBASIC To Dll - ARGB & RGB Colour Crunching[/large] (October 02, 2013)
Rounded up the
RGB +
ARGB colour crunching routines after lunch. The generators looks at the parameters and tries to pre-compute any literal fields. This can make some calls reduce to as little as 3 instructions. About a
1/3 of what it costs the call a function let alone, perform it's action.
A function call like this
Result=ARGB(255,255,G,255) would normally pass 4 fields onto the stack, call the ARGB function and then handle the result. But now it's solved in only 4 opcodes, assuming the parameters are integer to begin with, if not, they have to be recast prior.
Tested all the combinations and they're all faster inline. The following tidbit output less than 1/2 the code that it would have originally
ThisRGB=ARGB(255,255,255,255)
ThisRGB=ARGB(255,255,255,b)
ThisRGB=ARGB(255,255,g,255)
ThisRGB=ARGB(255,255,g,b)
ThisRGB=ARGB(255,r,255,255)
ThisRGB=ARGB(255,r,255,b)
ThisRGB=ARGB(255,r,g,255)
ThisRgb=ARgb(255,r,g,b)
ThisRGB=ARGB(a,255,255,255)
ThisRGB=ARGB(a,255,255,b)
ThisRGB=ARGB(a,255,g,255)
ThisRGB=ARGB(a,255,g,b)
ThisRGB=ARGB(a,r,255,255)
ThisRGB=ARGB(a,r,255,b)
ThisRGB=ARGB(a,r,g,255)
ThisRgb=ARgb(a,r,g,b)
Tested both methods against some competitors and it's
5.5 faster at one and
2.2 times faster than another. But I knew that already
[large]PlayBASIC To Dll - BASIC 2D Graphics Commands[/large] (October 04, 2013)
The weather has finally started to fine up, so i've been out doing such useful things as collecting fire wood and about training for the 100K challenge event. It's a shame my bike keeps breaking down. But between the day to day grind I've been dropping in some more command sets. The first was the intersection library. Which includes like things
LinesIntersect and various other vector operations. While doing that found the translator would attempt to cast integers to floats in code, where the function wanted float parameters.
So an expression like
Status=LinesIntersect(00,0,800,600,800,0,0,600), where all the parameters are literal integers, would spit out a
int->float cast for each parameter, then push the temp floats onto the stack. Which is pretty wasteful when you consider the translator could do this operation at compile time, rather than making code that does it at run time. Trapping such situations just make it produce cleaner and ultimately faster code.
Been dropping in 2D graphics commands at lunch time today, these are primitive stuff line DOT, LINE, BOX etc. These commands are VM bound, which means they're actually part of the VM instruction set, rather than being nice easy functions I can point at. So to call them, I've got to write wrappers for most if not all of them, but the up side is that once the wrapping is done, calling from the command from external DLL is same as calling from the VM.
Here's a test code snippet. Running the VM as is (purely on the VM) gives about a
19->20 fps return on my Athlon system. If i compile the function
DLL_FASTDOTFLL it runs about
55fps. Which is not too bad when you consider the equivalent loop in C/C++ is only
5 fps faster.
t=timer()
For demolp=0 to 1000
lockbuffer
ThisRGb=point(0,0)
DLL_FASTDOTFLL(0,0,GetSCreenWidth()-1,GetScreenHeight()-1)
unlockbuffer
Sync
next
print timer()-t
Sync
waitkey
Function DLL_FASTDOTFLL(x1,y1,x2,y2)
Static ThisColour
CurrentRGB=ThisCOLOUR
ThisColour++
for ylp=y1 to y2
For xlp=x1 to x2
fastDot xlp,ylp,CurrentRGB
next
CurrentRGB++
next
EndFunction
Of course the demo is pretty silly, we're basically calling a function 100's of thousands of times to draws strips of colour. It's more more efficient to lock the buffer and poke 32bits into it directly.
To really drive the point home, you do the same thing with BOXC and the VM version already run 230->250 FPS. Converting that DLL would give us some more FPS, but not a lot.
t=timer()
For demolp=0 to 1000
lockbuffer
ThisRGb=point(0,0)
DLL_FASTDOTFLL2(0,0,GetSCreenWidth()-1,GetScreenHeight()-1)
unlockbuffer
Sync
next
print timer()-t
Sync
waitkey
Function DLL_FASTDOTFLL2(x1,y1,x2,y2)
Static ThisColour
CurrentRGB=ThisCOLOUR
ThisColour++
for ylp=y1 to y2
boxc x1,ylp,x2,ylp+1,true,CurrentRGB
CurrentRGB++
next
EndFunction
[large]PlayBASIC To Dll - File Streams, Surfaces & Trig Commands Go online[/large] (October 10, 2013)
The last day or so has been pretty productive, solved the issues with the PRINT/TEXT statements from the translator, added INK/PEN command set wrappers, added surface control command set (stuff like RenderToImage, lockbuffer etc), added Trig function library (cos/sin stuff) and finally got to the File Stream command set. I'm not sure of the exact number but that brings us it up to about 190->200 commands. I've tested the big stuff, but not
every single command, so there's bound to be few teething problems.
The file stream commands cover the sequential and random access file handling, and are another one of those counter intuitive areas where we see a number of programming myths come to the forefront. Disc access is an aspect of programming where as hardware improves so do the assumptions. People seem to forget that accessing data on optical devices imposes a seek time. Every time you grab a chunk of data, part (if not most) of that
access time is just in waiting for the device to find it.
Take a look at this code fragment. The function is literally spooling a file into memory bank byte by byte. Every time to read a byte, this action also includes a seek. If you test it with small file the function would performs OK, but the bigger the file the more seek overhead and slowing loading becomes.
Function DLL_ReadFileToBank(Filename$,File_Size)
Thisbank=NewBank(File_Size)
fh=readnewfile(filename$)
if fh
ptr=getbankptr(ThisBank)
For lp =0 to File_Size-1
PokeByte Ptr+lp,ReadByte(fh)
next
Closefile fh
endif
EndFunction ThisBank
Compiling the loop to machine code would make it faster right ?
Nope, and when you test the two versions side by side, when loading the same file [color=green]100 times[/b], the machine code version takes [color=green]2100 milliseconds[/b], compared to [color=green]2245 milliseconds [/b]on the VM. Why?... Well because the speed of the for/next loop isn't the issue here. It's simply that every byte you fetch, has a seek time form the spinning hard drive. So the bottle neck is the optical device (the hard drive).
The solution is to change the size of the blocks you read from the disc. Reading Integers is effectively 4 times faster than reading bytes. Since there's only 1 seek per every 4 bytes. So it makes sense to that bigger the block you fetch, the more data will be burst from the device into memory in one hit. Hence why we have
ReadMemory command. This command fetches a couple of K per hit (from memory) at a time which virtually negates the seek time impact upon the overall load time.
This version of the function can load the same file 100 times in 22 milliseconds.
Function DLL_ReadFileToBank_PBReadMemory(Filename$,File_Size)
Thisbank=NewBank(File_Size)
fh=readnewfile(filename$)
if fh
ptr=getbankptr(ThisBank)
ReadMemory fh,Ptr,FIle_Size
Closefile fh
endif
EndFunction ThisBank
[large]PlayBASIC To Dll - Maps[/large] (October 28, 2013)
It's been a busy few days in the real world but the mapping support is slowly getting done. Needed to rearrange the command blocking a little to make it easier to export. Which kind of feels like groundhog day but the end result is easier to deal with. Was necessary as some of the query functions had been blocked into a general set. Now mapping commands with a common purpose are in sub groups.
In exported programs, the actual DLL only initializes the command sets you use. So if you don't use banks say, then your dll doesn't need those hooks. With stuff like maps which are broken into 4 sub command sets, the same benefit applies. If you don't use collision or occlusion commands, then the exported dll doesn't need that block of hooks. Which helps make the exported libraries smaller. Previously if you used any map command the entire command set table would be exported. Just extra bloat for nothing.
Size wise the DLL are pretty small. In the current test project which is around 1800 lines for example and produces an 80K of byte code file. When we exporting this to assembly, it creates about 250K of assembly which assembles down to about 50K of DLL size. The DLL's don't need the data tables etc in them, so they should generally be smaller than the total byte code. The byte code segment is only about 10K though. The code distribution pretty similar, some stuff is smaller in byte code where others end up smaller in machine code. So it'll balance out.
[large]PlayBASIC To Dll - Select Statement Support[/large] (November 19, 2013)
Unlike most of the control statements
Select/Case statement blocks are only now being implemented in the translator. PlayBASIC's support for Select Cases is rather unique. Generally select/switch statement blocks are explicitly literal, so cases must be constant. Which helps the compiler back end when producing the appropriate logic. The assumption most programmers make, is their low level compiler is building a jump table, but this actually isn't true. Visual C/C++ produces a range of different solutions for switch statements, as the input data has to fit the jump table model, even then, it often doesn't .
In the most BASIC languages (PB included) select/case block are generally nothing more elaborate
IF/EndIF structures. The benefit being the 'select variable' can be cached on register unlike a block of user IF/Then statements. The caching removes most of memory access when falling through the structure. Converting this logic directly to assembly is a cake walk and will certainly perform very well. Hey, if it's good enough for visual C it's good enough for us
However, the PlayBASIC compiler & runtime times support Variables, expressions, floats and even strings in case statements. So you can really mix up the anything you like into them (within reason). This means the translator has to look ahead at any following case statements when producing code to try and suss out if a register can be cached or not. The current version doesn't bother with that for a minute, just wanted to get the structure up and running first.
Here's the current working test code, in this example we're building an Integer select on the A Variable with literal case matching.
Function DLL_SelectCase(A,b#,s$)
Select A
case 0
print "a=0"
case 1
print "a=1"
case 5 to 2
print "a=2 to 5"
case 10,6,7,8,9
print "a=6,7,8,9,10"
default
print "no Match"
EndSelect
EndFunction
The translator already includes generation logic to pre-flips literal case terms and sorts + recast case rows into order. So in the
case 10,6,7,8,9 line, the literals are pre-sorted. If the set has more than 3 values it inserts a bounds check for you. So if the select variable is outside the range, it moves on without falling through this block of compares.
Edit #1:
Working on dynamic versions of the select statement block. The dynamic versions gives the exporter support for variables in case statements. The current focus is getting a good range detection without branching. Have worked out way, but it's behavior is different from the VM. The VM treats the range as inclusive ->inclusive, but the assembly version is currently inclusive->exclusive. Which might not sound like a big deal but the behavior then is different between the two.
Function DLL_IntegerSelectCaseTest(A )
b#=123.456
Variable=15
Variable2=20
r1=50
r2=60
Select A
case 0
result=0
case 1
result=1
case 5 to 2
result=2
case 10,6,7,8,9
result=3
case 66.5
result=4
case Variable
result=5
case B#
result=6
case Variable to 20
result=7
case r1 to r2
result=8
default
result=-1
EndSelect
EndFunction result
[large]PlayBASIC To Dll - Flex Rotater[/large] (December 03, 2013)
This is something of a tech demo, much like the fractal render from a while back. The idea here being to take a bigger chunk of code (in this case an old Amiga demo effect) and and build that into a DLL library. The routine is an old z rotater styled effect where the spans are rotated and the result is drawn pixel by pixel to the screen. The render loop is interpolating 800*600 pixels and drawing them via the awfuly generic
FastDot. Moveover the texture fetch is just reading a 2D integer array. So the code was slapped together out to work, rather than be fast.
Running the original routine on the PB VM returns about a 500 plus millisecond refresh on my 8 year old (single core) Athlon system. The first conversion to DLL cuts that to about 61-62 milliseconds. I suspect that once the array reading and fastdot are removed, it'll be possible to at least double possible triple that. The only trouble at the moment is the machine code version isn't acting the same the VM version. So there's some math operation not behaving the same way.
Bellow is an example of what the original effect running on the Amiga looks like.