PBFX V1.73b - Instruction Set Cont.
It shouldn't come as any great surprise, but today the instruction set journey continues. As outlined earlier my plan is to have unknown opcodes (from VM2) fall harmlessly back to VM1. To make this work means bridging how VM1 stores data, i.e. make the data VM2 friendly. This is needed so that both sides see the same data, ( op codes/variables etc). Without that, the two side are unaware. For example, if you set a variable to value or do basic math operation then this runs on the VM2 side. If you then use this value with say a built in function/command, the runtime currently has to eject from Vm2 and run the command through Vm1. To do this, the parameter values need to be grabbed from/sent VM2 runtime. So such operation has to be patched to read from the Vm2 tables. Otherwise we'd be sending it whatever's in the Vm1 tables. Which should be nothing.
So far we've only got handful or so more operations running in Vm2 than before, but more importantly the bridging between the two has improved a lot. This combination is enabling it run more complicated programs than it could yesterday. Stuff like basic loop, decisions, some data structures (simple arrays) and math/string and (built in) command calling. Which is enough to get to get it running stuff like the standard bench test. Which to my surprise, after a little tinkering, runs. The surprising thing about it, aside from it working, was that it's already slightly faster than what 1.72d. About 1/3rd of second faster (screen shots attached). Which is odd as I was expecting it to be about a second slower. Why ? - well the basic math operation are running on Vm2 but the loop controls are still running on Vm1. So it changing contexts constantly. I think in this case it just happens the speed up comes from doing the group of math operations. ie..
ts=Timer()
For i = 0 To MaxTests
temp=(temp+1)*lp/(lp+1)-lp
Next i
t=Timer()-ts
Results#(test)=results#(test)+t
Print "For Next: Add + Sub + Mult + Div:"+Str$(results#(test)/frames)
test=test+1
This is pretty encouraging, as that should mean we'll still see a fairly reasonable code processing improvement when PB is finally running pure VM2 code. While I knew Vm2 was going to be a faster design (I've had Vm2 tech demos that date back 3 years !

). I was a tad worried that the performance gain might not be as great today when compared to modern VM1 design. Originally Vm2 was about 3 -> 4 times faster than Vm1, when they we're first envisaged. Today, I think a more realistic outlook is to hopefully make VM2 somewhere in the 25->50% mark faster, than VM1.
Anyway, before you get your hopes up just yet, it's a long way from that from that today. As when I run some other simple bench marks (Such as
FastDot, we see the expected slow down from the massive context changes. In the official fast dot test, we see a slow down from drawing 800*600*32 pixels at 21fps, down to a 12fps.
Ie.
lockbuffer
c2=point(0,0)
For ypoint=0 To h-1
c2=c
For xpoint=0 To w-1
fastdot xpoint,ypoint,c2
Next
c=c+xpoint+1
Next
unlockbuffer
Currently this loops is Vm1 biased, it's still calling Vm2 but falling out immediately. Which really eats up the time. I'm surprised it runs as fast as does actually. I thought it was going to be much slower.. but anyway.. back to it.
PBFX V1.73 - Looping On VM2.
The opcodes are broken up into instruction levels, banks if you will. This helps the VM2 pick it's way through the program code quickly. Previously the only code on Vm2 was level 0 and level 4 which are the basic math operations, compares and few other bits of bobs . So today I've moving the loop controls. To do this requires changing the how the compiler outputs the loop opcodes, as VM uses different set of opcodes for looping than VM1. The first port of call has been the FOR/NEXT control. The foundation of the pretty much every BASIC program. So it's important this structure is as lean as possible. Both in terms of the memory and VM readability.
The results have been very pleasing, with todays alpha being able to outperform yesterdays fastest build (running the standard test) by around another 1/2 of a second. It routinely runs the standard test now in 1.6 seconds, at least for the short term. The test represents approximately 200*40000*20 cycles .. That's about 160 Million plus total cycles.. For the sake of comparison that's only
0.1 of second slower than PB's two main 'machine code' (cough) competitors. That doesn't really mean much.
On the subject of native code, coming from almost 20 years of assembly background it's been difficult to the fight the urge to move down the
source to native code compilation road. Native code generation makes perfect sense on static / fixed architectures. Such as consoles. System where the hardware environment is uniform, same cpu, same memory, same video/sound etc. The PC is not such an environment. I wouldn't be that surprised if that pre-compilation model becomes obsolete on the PC.
Why ? - because of the near infinities array of the target hardware. The crux of the problem is that when compilers generate native machine code, they will a favor certain series of instructions for particular operations. They do this under the assumption these operations are the fastest, shortest way of achieving a particular result. While this might indeed be true on targeted CPU at compile time, but this targeting can disadvantage the code when running on the other flavors of cpu. Which is further compounded when you factor in memory speed / cache. Which is unknown at compile, since the program is the built upon the developers PC and not the targeted user..
This is fine when building code for todays super scalar behemoths, but these are obsolete models as we move towards parallel multi /
mini core environments. Which use a simpler, reduced instruction sets core with absolutely no modern frills. These cores a world away from the p3/p4/64bit and dual/quad core systems today. Which brings up an obvious question, how on earth can any compiler produce optimal native code today for the future. While I certainly
don't have an answer, but the model that makes the most sense would be a modular compilation upon demand based architecture. The trouble is, if code generation is then 'fixed', then it might as well be pre-compiled to native. So the generation engine would need to build the best version of the source routine it can, for the given the environment. hmmm the mind boggles...
Anyway, here's another screenie.