[DBP] State space AI learning algorithm

Author

Message

The thing

21

Years of Service

User Offline

Joined: 22nd Jan 2004

Location: Somewhere in the U.K

Posted: 17th Mar 2011 00:06

Link

I've been looking into artificial intelligence as a project of mine and cooked this up. The code gives an AI capable of generalised learning; rather than being coded for a specific problem it can teach itself to solve problems the programmer hasn't forseen. Be warned, it is fairly processor and memory intensive and requires a bit of time for it to make decent progress (but you can see it actively learning).

Set display mode 1280,800,32,0
 sync rate 0
set window on
sync on

`SARSA variables
statesize=18
statenu=262144
actionnu=3

greed#=0.4
`lambda#=0.3125
lambda#=0.7
gamma#=1.0
learnrate#=0.1
base#=0.5
probtarg#=2
learning#=1

`Declare varables
total# as double float

`sub-routines
gosub init_physics
gosub init_agent

do

if spacekey()=0 then gosub Update_agent

gosub Draw

sync
loop

```````````````````
physics:

`PHYSICS
ink rgb(0,0,0),rgb(0,0,0)
box 0,0,1280,800

cart_vel#=cart_vel#+(time_step#*cart_acc#)
cart_pos#=cart_pos#+(time_step#*cart_vel#)

angle_vel#=angle_vel#+(time_step#*angle_acc#)
angle#=angle#+(time_step#*angle_vel#)

angle#=(3.141592654*wrapvalue(180*angle#/3.141592654)/180)

angle_acc#=(((g#*sin(180*angle#/3.141592654))+(cos(180*angle#/3.141592654)*(-force#-(mass_pole#*length#*angle_vel#*angle_vel#)))/(mass_pole#+mass_cart#)))/((length#*1.3333333)-(((mass_pole#*cos(180*angle#/3.141592654)*cos(angle#))/(mass_cart#+mass_pole#))))

cart_acc#=(force#+(mass_pole#*length#)*((angle_vel#*angle_vel#*sin(180*angle#/3.141592654))-(angle_acc#*cos(180*angle#/3.141592654))))/(mass_cart#+mass_pole#)

steps=steps+1

return

`````````````````
Draw:

`Draw

ink rgb(255,255,255),rgb(0,0,0)
line cart_pos#+640,500,cart_pos#+640+(300*sin(180*angle#/3.141592654)),500-(300*cos(180*angle#/3.141592654))

line 0,500,1280,500

ink rgb(255,0,0),rgb(0,0,0)

box 235,450,240,550
box 1035,450,1040,550

ink rgb(255,255,255),rgb(0,0,0)

for markers=1 to 7
    line 240+markers*100,490,240+markers*100,510
NEXT markers

`Give a basic overview of performance and current enviromental conditions
text 6,5,"cart position : +/- 400 :" 
text 360,5,str$(cart_pos#)
text 6,40,"pole angle : <1.57 /  > 4.71 radians :" 
text 360,40,str$(angle#)

text 700,6,"input state :"
text 900,6,str$(state)

sum#=0
for a=1 to ti-1
    sum#=sum#+times(a)
NEXT a
text 6,80,"avergae num of physics steps failure : "
text 370,80,str$(sum#/(ti-1))

text 6,120,"previous num of physics steps until failure : "
text 370,120,str$(times(ti-1))

if spacekey()=1
for a=1 to ti
    
    line ((a-1)*5)+700,300-times(a-1)*0.04,((a-1)*5)+700,300-times(a)*0.04
    
NEXT a

ink rgb(250,0,0),rgb(250,0,0)
for a=1 to target-1
    line ((a-1)*5)+700,300-moving_aver(a-1)*0.1,(a*5)+700,300-moving_aver(a)*0.1
NEXT a

line 700,300,1100,300
line 700,300,700,100

for a=0 to 30
    line 700+(a*20),300,700+(a*20),305
line 695,300-(a*20),700,300-(a*20)
next a

target=ceil(ti*0.04)

endif

return

``````````````````````
Update_agent:
`learnrate#=1/(ti^0.4)
`Check to see there has been a state change
if prev_state<>state

`If no state change, update force on the cart according to the agent's output
if action_cur=1 then force#=1000.0
if action_cur=2 then force#=-1000.0
if action_cur=3 then force#=0.0

endif

gosub Physics

`Check to see there has been a state change
if prev_state<>state

`Assume state is non termina;
rew#=0.0
target=ceil(ti*0.05)

if angle#>1.5708 and angle#<4.7124 or cart_pos#>400 or cart_pos#<-400 `or steps>400
`If terminal, set reward to one
`   if steps<=400
    rew#=1.0
`  endif
  `Record the time in seconds
  times(ti)=steps

sum#=0

if ti>24
  for a=ti-24 to ti
  sum#=sum#+times(a)
  next a
  target=ceil(ti*0.04)
  moving_aver(target)=sum#*0.04
  
 endif

ti=ti+1
if ti=20000
ti=1
endif

`Next state=failure since outside bounds, next state expected reward must be a value of one, hence exprearray(state,action_nex)---> gamma#*1
  delta_err#=rew#+gamma#-exprearray(prev_state,action_cur)

`update eleg, update softmax

for a=1 to statenu
    for b=1 to actionnu
    if sarsaelegib(a,b)>0.001
    exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
    sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
    endif
    
    next b
NEXT a

gosub init_physics
   gosub clear_traces
   new_episode=1
   
   t#=timer()
endif
endif
`Observe state
`convert to binary, convert to num ---angle 2^4 for 2x, angle vel 2^4 for 4x, angle acc 2^4 for 2x, cart pos 2^4 /50, cart vel 2^4 /15, cart acc 2^3 /40

`Convert "analogue" values to binary input state

if angle#<1.5708 then newangle#=angle#*10.15
if angle#>4.7124 then newangle#=(angle#-3.13)*10.15

analogue=floor(newangle#)
for b=1 to 5
binarydata(b)=binary(analogue,5,b)
next b

analogue=floor((angle_vel#+2)*4)
if analogue<0 then analogue=0
if analogue>15 then analogue=15   
for b=1 to 4
binarydata(b+5)=binary(analogue,4,b)
next b

analogue=floor((cart_pos#+400)*0.04)
if analogue<0 then analogue=0
if analogue>31 then analogue=31  
for b=1 to 5
binarydata(b+9)=binary(analogue,5,b)
next b

analogue=floor((cart_vel#+120)*0.0667)
if analogue<0 then analogue=0
if analogue>15 then analogue=15   
for b=1 to 4
binarydata(b+14)=binary(analogue,4,b)
next b

inputdata=0

`convert from binary
for b=0 to statesize-1
    
   inputdata=inputdata+((2^b)*binarydata(b+1))
    
NEXT b

`given input state, choose action

prev_state=state+0

state=inputdata+0

if prev_state<>state

action_nex=rnd(actionnu-1)+1

if exprearray(state,1)<=exprearray(state,2) 
if exprearray(state,1)<=exprearray(state,3) 
 action_nex=1
endif
endif

if exprearray(state,3)<=exprearray(state,1)
if exprearray(state,3)<=exprearray(state,2)
action_nex=3
endif
endif
`
if exprearray(state,2)<=exprearray(state,1) 
if exprearray(state,2)<=exprearray(state,3)
 action_nex=2
 endif
 endif

target=ceil(ti*0.04)
`text 500,120,str$(moving_aver(target))
probabilty#=(steps*probtarg#*2)/((moving_aver(target))*(moving_aver(target)+1))

if controlkey()=0 and rnd(10000)<probabilty#*10000 then action_nex=rnd(2)+1 : gosub clear_traces
if returnkey()=1 then sync rate 0
if shiftkey()=1 then sync rate 60
`lambda#=1.0-probabilty#

`action_nex=rnd(2)+1
`update

delta_err#=rew#+gamma#*exprearray(state,action_nex)-exprearray(prev_state,action_cur)

if new_episode=0
 sarsaelegib(prev_state,action_cur)=1
 else
 new_episode=0
 endif

`update eleg, update softmax

for a=1 to statenu
    for b=1 to actionnu
    if sarsaelegib(a,b)>0.0001
    exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
    sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
    endif
    
    next b
NEXT a

action_cur=action_nex+0.0

endif

return

init_physics:

`Initialize Physics
mass_cart#=2.0
mass_pole#=9.0
g#=9.81
length#=32.0
time_step#=0.02

if rnd(1)=1
angle#=0.06981+(rnd(18)*0.01745)
else
angle#=6.21337-(rnd(18)*0.01745)
endif

angle_vel#=0.0
angle_acc#=0.0

cart_pos#=0.0
cart_vel#=0.0
cart_acc#=0.0

force#=0.0

steps=0

return

`clear elegibility traces
clear_traces:

for a=1 to statenu
    for b=1 to actionnu
        
        sarsaelegib(a,b)=0.0

NEXT b
next a

return

init_agent:

`Initiate SARSA variables and arrays
`Array to record the number of state changes before failure
dim times(20000) as double float

dim moving_aver(800) as double float

`Sarsa expected reward array
dim exprearray(statenu,actionnu) as double float

`Elegibility trace array
dim sarsaelegib(statenu,actionnu) as double float

total#=0.0

`Array for converting physics values to binary
dim binarydata(statesize) as boolean

action_cur=0
rew#=0.0
new_episode=1

for a=1 to statenu
    for b=1 to actionnu
        
        exprearray(a,b)=base#
        sarsaelegib(a,b)=0.0

NEXT b
next a

`start timing episode

`episode number
ti=1
return

`Function for retrieving binary digits from convering an integer to a binary number
function binary(number,max_places,place)
    
    current=number
     digit=0
     
    for a=0 to (max_places-1)
   
    
        if current-(2^(max_places-1-a))=>0
        
        current=current-(2^(max_places-1-a))
       
        if a=(max_places-place)
        
        digit=1
        exit
       
        Endif
        ENDIF
        
    NEXT a
    
ENDFUNCTION digit

+ Code Snippet


Set display mode 1280,800,32,0
 sync rate 0
set window on
sync on

`SARSA variables
statesize=18
statenu=262144
actionnu=3

greed#=0.4
`lambda#=0.3125
lambda#=0.7
gamma#=1.0
learnrate#=0.1
base#=0.5
probtarg#=2
learning#=1

`Declare varables
total# as double float

`sub-routines
gosub init_physics
gosub init_agent

do

if spacekey()=0 then gosub Update_agent

gosub Draw

 
sync
loop 

```````````````````
physics:

`PHYSICS
ink rgb(0,0,0),rgb(0,0,0)
box 0,0,1280,800

cart_vel#=cart_vel#+(time_step#*cart_acc#)
cart_pos#=cart_pos#+(time_step#*cart_vel#)

angle_vel#=angle_vel#+(time_step#*angle_acc#)
angle#=angle#+(time_step#*angle_vel#)

angle#=(3.141592654*wrapvalue(180*angle#/3.141592654)/180)

angle_acc#=(((g#*sin(180*angle#/3.141592654))+(cos(180*angle#/3.141592654)*(-force#-(mass_pole#*length#*angle_vel#*angle_vel#)))/(mass_pole#+mass_cart#)))/((length#*1.3333333)-(((mass_pole#*cos(180*angle#/3.141592654)*cos(angle#))/(mass_cart#+mass_pole#))))

cart_acc#=(force#+(mass_pole#*length#)*((angle_vel#*angle_vel#*sin(180*angle#/3.141592654))-(angle_acc#*cos(180*angle#/3.141592654))))/(mass_cart#+mass_pole#)

steps=steps+1

return

`````````````````
Draw:

`Draw

ink rgb(255,255,255),rgb(0,0,0)
line cart_pos#+640,500,cart_pos#+640+(300*sin(180*angle#/3.141592654)),500-(300*cos(180*angle#/3.141592654))

line 0,500,1280,500

ink rgb(255,0,0),rgb(0,0,0)

box 235,450,240,550
box 1035,450,1040,550

ink rgb(255,255,255),rgb(0,0,0)

for markers=1 to 7
    line 240+markers*100,490,240+markers*100,510
NEXT markers


`Give a basic overview of performance and current enviromental conditions
text 6,5,"cart position : +/- 400 :" 
text 360,5,str$(cart_pos#)
text 6,40,"pole angle : <1.57 /  > 4.71 radians :" 
text 360,40,str$(angle#)

text 700,6,"input state :"
text 900,6,str$(state)

sum#=0
for a=1 to ti-1
    sum#=sum#+times(a)
NEXT a
text 6,80,"avergae num of physics steps failure : "
text 370,80,str$(sum#/(ti-1))

text 6,120,"previous num of physics steps until failure : "
text 370,120,str$(times(ti-1))


if spacekey()=1
for a=1 to ti
    
    line ((a-1)*5)+700,300-times(a-1)*0.04,((a-1)*5)+700,300-times(a)*0.04
    
NEXT a 

ink rgb(250,0,0),rgb(250,0,0)
for a=1 to target-1
    line ((a-1)*5)+700,300-moving_aver(a-1)*0.1,(a*5)+700,300-moving_aver(a)*0.1
NEXT a



line 700,300,1100,300
line 700,300,700,100

for a=0 to 30
    line 700+(a*20),300,700+(a*20),305
line 695,300-(a*20),700,300-(a*20)
next a

target=ceil(ti*0.04)

endif

return

``````````````````````
Update_agent:
`learnrate#=1/(ti^0.4)
`Check to see there has been a state change
if prev_state<>state

`If no state change, update force on the cart according to the agent's output
if action_cur=1 then force#=1000.0
if action_cur=2 then force#=-1000.0
if action_cur=3 then force#=0.0

endif

gosub Physics

`Check to see there has been a state change
if prev_state<>state

`Assume state is non termina;
rew#=0.0
target=ceil(ti*0.05)

if angle#>1.5708 and angle#<4.7124 or cart_pos#>400 or cart_pos#<-400 `or steps>400
`If terminal, set reward to one
`   if steps<=400
    rew#=1.0
`  endif
  `Record the time in seconds
  times(ti)=steps

sum#=0

 if ti>24
  for a=ti-24 to ti
  sum#=sum#+times(a)
  next a
  target=ceil(ti*0.04)
  moving_aver(target)=sum#*0.04
  
 endif



ti=ti+1
if ti=20000
ti=1
endif

  `Next state=failure since outside bounds, next state expected reward must be a value of one, hence exprearray(state,action_nex)---> gamma#*1
  delta_err#=rew#+gamma#-exprearray(prev_state,action_cur)


`update eleg, update softmax

for a=1 to statenu
    for b=1 to actionnu
    if sarsaelegib(a,b)>0.001
    exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
    sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
    endif
    
    next b
NEXT a
  

  
   gosub init_physics
   gosub clear_traces
   new_episode=1
   
   t#=timer()
endif
endif
`Observe state
`convert to binary, convert to num ---angle 2^4 for 2x, angle vel 2^4 for 4x, angle acc 2^4 for 2x, cart pos 2^4 /50, cart vel 2^4 /15, cart acc 2^3 /40



`Convert "analogue" values to binary input state

if angle#<1.5708 then newangle#=angle#*10.15
if angle#>4.7124 then newangle#=(angle#-3.13)*10.15

analogue=floor(newangle#)
for b=1 to 5
binarydata(b)=binary(analogue,5,b)
next b

analogue=floor((angle_vel#+2)*4)
if analogue<0 then analogue=0
if analogue>15 then analogue=15   
for b=1 to 4
binarydata(b+5)=binary(analogue,4,b)
next b

analogue=floor((cart_pos#+400)*0.04)
if analogue<0 then analogue=0
if analogue>31 then analogue=31  
for b=1 to 5
binarydata(b+9)=binary(analogue,5,b)
next b

analogue=floor((cart_vel#+120)*0.0667)
if analogue<0 then analogue=0
if analogue>15 then analogue=15   
for b=1 to 4
binarydata(b+14)=binary(analogue,4,b)
next b

inputdata=0

`convert from binary
for b=0 to statesize-1
    
   inputdata=inputdata+((2^b)*binarydata(b+1))
    
NEXT b

`given input state, choose action

prev_state=state+0

state=inputdata+0


if prev_state<>state








action_nex=rnd(actionnu-1)+1




if exprearray(state,1)<=exprearray(state,2) 
if exprearray(state,1)<=exprearray(state,3) 
 action_nex=1
endif
endif

if exprearray(state,3)<=exprearray(state,1)
if exprearray(state,3)<=exprearray(state,2)
action_nex=3
endif
endif
`
if exprearray(state,2)<=exprearray(state,1) 
if exprearray(state,2)<=exprearray(state,3)
 action_nex=2
 endif
 endif

target=ceil(ti*0.04)
`text 500,120,str$(moving_aver(target))
probabilty#=(steps*probtarg#*2)/((moving_aver(target))*(moving_aver(target)+1)) 

if controlkey()=0 and rnd(10000)<probabilty#*10000 then action_nex=rnd(2)+1 : gosub clear_traces
if returnkey()=1 then sync rate 0
if shiftkey()=1 then sync rate 60
`lambda#=1.0-probabilty#

`action_nex=rnd(2)+1
`update 

delta_err#=rew#+gamma#*exprearray(state,action_nex)-exprearray(prev_state,action_cur)

if new_episode=0
 sarsaelegib(prev_state,action_cur)=1
 else
 new_episode=0
 endif

`update eleg, update softmax

for a=1 to statenu
    for b=1 to actionnu
    if sarsaelegib(a,b)>0.0001
    exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
    sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
    endif
    
    next b
NEXT a


action_cur=action_nex+0.0

endif


return

init_physics:

`Initialize Physics
mass_cart#=2.0
mass_pole#=9.0
g#=9.81
length#=32.0
time_step#=0.02

if rnd(1)=1
angle#=0.06981+(rnd(18)*0.01745)
else
angle#=6.21337-(rnd(18)*0.01745)
endif

angle_vel#=0.0
angle_acc#=0.0

cart_pos#=0.0
cart_vel#=0.0
cart_acc#=0.0

force#=0.0

steps=0

return

`clear elegibility traces
clear_traces:

for a=1 to statenu
    for b=1 to actionnu
        
        sarsaelegib(a,b)=0.0
        

    NEXT b
next a

return

init_agent:

`Initiate SARSA variables and arrays
`Array to record the number of state changes before failure
dim times(20000) as double float

dim moving_aver(800) as double float

`Sarsa expected reward array
dim exprearray(statenu,actionnu) as double float

`Elegibility trace array
dim sarsaelegib(statenu,actionnu) as double float

total#=0.0

`Array for converting physics values to binary
dim binarydata(statesize) as boolean 



action_cur=0
rew#=0.0
new_episode=1

for a=1 to statenu
    for b=1 to actionnu
        
        exprearray(a,b)=base#
        sarsaelegib(a,b)=0.0

    NEXT b
next a

`start timing episode

`episode number
ti=1
return


`Function for retrieving binary digits from convering an integer to a binary number
function binary(number,max_places,place)
    
    current=number
     digit=0
     
    for a=0 to (max_places-1)
   
    
        if current-(2^(max_places-1-a))=>0
        
        current=current-(2^(max_places-1-a))
       
        if a=(max_places-place)
        
        digit=1
        exit
       
        Endif
        ENDIF
        
    NEXT a
    
ENDFUNCTION digit

controls:
Space key: moving average graph

Hold control key: Turns off AI exploration, improving short term performance

Shiftkey: Sets the frame rate to 60 frames per second for normal speed

Returnkey: Shifts back to accelerated learning at the max fram rate

Its a bit messy, but if anyone is interested i can clean it up. Its essentially a cart and pole problem that tests an AI algorithm called SARSA (lambda). The objective is for the AI to balence the pole as long as possible without it falling over or moving the pole's base (the cart) outside the bounds. The AI assigns fitness values to possible actions when in a given state, as dictated by its inputs, with more "fit" values being more likely to result in the agent transferring to a future state where it recieves a reward.

This is where the algorithm's name comes in; SARSA(lambda) stands for State Action Reward State Action. The fitness values of a past state action pair are updated according to the product of an error function, an exponential time decay function and the learning rate constant. The time decay constant is lamda. The error function is difference between the fitness value of the next state action pair and that of the current of the current, summed with any rewards that may have been recieved in between states.

The next state is chosen using a policy that attempts to balence exploration and exploitation. More exploration will give the agent a better knowledge of the enviroment at the expense of poor short performance but better long term performance. Eploitation gives good short term performance. The probability of choosing a random action, rather than one with a good fitness value, i.e.exploring is a probability which increases depending on the length of the current AI's test run compared to the moving average.

Problems:
- Tabular state based learning algorithm so performs poorly with very large state spaces; each state must tested and its fitness recorded, putting strain on memory.
- In an attempt to reduce the state space I used rather coarse inputs (The inputs passed to the AI must be in binary so some rounding must be done). This speeds up learning but may reduce the AI's consistancy at solving the problem, or may prevent it from solving it alltogether (I have not managed to get the AI to provide a perfect solution to the pole balence problem yet)

I may attempt implementing a neural network using SARSA and backpropagation to reduce the state space size and allow the AI to make inferences, removing the need to visit every state requyired to solve the problem. Any comments are welcome.

Back to top

Profile PM Email