pulling the image data from device can be very expensive (the GPU and CPU aren't on the same BUS), so it's not an ideal structure for doing something real time.
anyway the cost per pixel in the
GetImageRGBA function seems pretty high with two loops with some redundant calcs inside the inner loops..
so you should be able to restructure and improve the through put of that part of the code, but the bottle neck may well be pulling it fom video memory to begin with or freeing the buffer each time.
So taking a look at this part,
For n = startByte To imgDataSize-1
pixelArray[n-startByte] = GetMemblockByte(imgMem, n)
Next
For i = 0 To pixelData.Length
pixelData[i].red = pixelArray[(i*4)]
pixelData[i].green = pixelArray[(i*4)+1]
pixelData[i].blue = pixelArray[(i*4)+2]
pixelData[i].alpha = pixelArray[(i*4)+3]
Next
The second loop contains pixel offset calc, so it'd be easier on the runtime to only compute this once.
For n = startByte To imgDataSize-1
pixelArray[n-startByte] = GetMemblockByte(imgMem, n)
Next
For i = 0 To pixelData.Length
; compute offset once, removes 3 calcs per pixel
Offset= i*4
pixelData[i].red = pixelArray[Offset]
pixelData[i].green = pixelArray[Offset+1]
pixelData[i].blue = pixelArray[Offset+2]
pixelData[i].alpha = pixelArray[Offset+3]
Next
Depending upon the runtime
FOR/NEXT loops can either compute the END expression every loop or they'll pre compute the END loop value once and protect it within a local.. IF it computes the expression every loop, then some free speed can be had just be computing the end values up front.
imgDataSize_MinusOne =imgDataSize-1
For n = startByte To imgDataSize_MinusOne
pixelArray[n-startByte] = GetMemblockByte(imgMem, n)
Next
; make sure we only compute size once
PixelDataSize =pixelData.Length
For i = 0 To pixelDataSize
; compute offset once, removes 3 calcs per pixel
Offset= i*4
pixelData[i].red = pixelArray[Offset]
pixelData[i].green = pixelArray[Offset+1]
pixelData[i].blue = pixelArray[Offset+2]
pixelData[i].alpha = pixelArray[Offset+3]
Next
The problem with this is there's twp loops going over the data byte by byte.. so a 256*256 image gives 256*256*4*2 loops.. which is lot of empty overhead for a runtime to soak up..
You can rid of half of the looping just by merging them..
imgDataSize_MinusOne =imgDataSize-1
; make sure we only compute size once
PixelDataSize =pixelData.Length
; Prolly shoukd check if the target buffer is big enough for data. but we won't here
For n = startByte To imgDataSize_MinusOne step 4
; compute dest offset
i = (N- StartByte) / 4
; unroll the read to grabdthe 4 pixels
pixelData[i].red = GetMemblockByte(imgMem, n)
pixelData[i].green = GetMemblockByte(imgMem, n+1)
pixelData[i].blue = GetMemblockByte(imgMem, n+2)
pixelData[i].alpha = GetMemblockByte(imgMem, n+3)
Next
if you assume 1 to 1 ratio of runtime opcodes to user code (which is very unlikely) but easy to visualize.. The cost per pixel is about say 15 operations compared to 28/29 in the original loop.
Even so it's still not going to sing if you is throw a big image at it... Although if know the section that has changed, then you can selectively grad that section from the memblock->array. So only when the entire image changed would you need to brute force the buffer.
You could just read the pixels as Integers and split up the RGB's, but that's something for you to do.. although the split cost might be heavier than a memblock peek... but that's where the fun is in this stuff
NOTE: All this is untested.