I've been looking into artificial intelligence as a project of mine and cooked this up. The code gives an AI capable of generalised learning; rather than being coded for a specific problem it can teach itself to solve problems the programmer hasn't forseen. Be warned, it is fairly processor and memory intensive and requires a bit of time for it to make decent progress (but you can see it actively learning).
Set display mode 1280,800,32,0
sync rate 0
set window on
sync on
`SARSA variables
statesize=18
statenu=262144
actionnu=3
greed#=0.4
`lambda#=0.3125
lambda#=0.7
gamma#=1.0
learnrate#=0.1
base#=0.5
probtarg#=2
learning#=1
`Declare varables
total# as double float
`sub-routines
gosub init_physics
gosub init_agent
do
if spacekey()=0 then gosub Update_agent
gosub Draw
sync
loop
```````````````````
physics:
`PHYSICS
ink rgb(0,0,0),rgb(0,0,0)
box 0,0,1280,800
cart_vel#=cart_vel#+(time_step#*cart_acc#)
cart_pos#=cart_pos#+(time_step#*cart_vel#)
angle_vel#=angle_vel#+(time_step#*angle_acc#)
angle#=angle#+(time_step#*angle_vel#)
angle#=(3.141592654*wrapvalue(180*angle#/3.141592654)/180)
angle_acc#=(((g#*sin(180*angle#/3.141592654))+(cos(180*angle#/3.141592654)*(-force#-(mass_pole#*length#*angle_vel#*angle_vel#)))/(mass_pole#+mass_cart#)))/((length#*1.3333333)-(((mass_pole#*cos(180*angle#/3.141592654)*cos(angle#))/(mass_cart#+mass_pole#))))
cart_acc#=(force#+(mass_pole#*length#)*((angle_vel#*angle_vel#*sin(180*angle#/3.141592654))-(angle_acc#*cos(180*angle#/3.141592654))))/(mass_cart#+mass_pole#)
steps=steps+1
return
`````````````````
Draw:
`Draw
ink rgb(255,255,255),rgb(0,0,0)
line cart_pos#+640,500,cart_pos#+640+(300*sin(180*angle#/3.141592654)),500-(300*cos(180*angle#/3.141592654))
line 0,500,1280,500
ink rgb(255,0,0),rgb(0,0,0)
box 235,450,240,550
box 1035,450,1040,550
ink rgb(255,255,255),rgb(0,0,0)
for markers=1 to 7
line 240+markers*100,490,240+markers*100,510
NEXT markers
`Give a basic overview of performance and current enviromental conditions
text 6,5,"cart position : +/- 400 :"
text 360,5,str$(cart_pos#)
text 6,40,"pole angle : <1.57 / > 4.71 radians :"
text 360,40,str$(angle#)
text 700,6,"input state :"
text 900,6,str$(state)
sum#=0
for a=1 to ti-1
sum#=sum#+times(a)
NEXT a
text 6,80,"avergae num of physics steps failure : "
text 370,80,str$(sum#/(ti-1))
text 6,120,"previous num of physics steps until failure : "
text 370,120,str$(times(ti-1))
if spacekey()=1
for a=1 to ti
line ((a-1)*5)+700,300-times(a-1)*0.04,((a-1)*5)+700,300-times(a)*0.04
NEXT a
ink rgb(250,0,0),rgb(250,0,0)
for a=1 to target-1
line ((a-1)*5)+700,300-moving_aver(a-1)*0.1,(a*5)+700,300-moving_aver(a)*0.1
NEXT a
line 700,300,1100,300
line 700,300,700,100
for a=0 to 30
line 700+(a*20),300,700+(a*20),305
line 695,300-(a*20),700,300-(a*20)
next a
target=ceil(ti*0.04)
endif
return
``````````````````````
Update_agent:
`learnrate#=1/(ti^0.4)
`Check to see there has been a state change
if prev_state<>state
`If no state change, update force on the cart according to the agent's output
if action_cur=1 then force#=1000.0
if action_cur=2 then force#=-1000.0
if action_cur=3 then force#=0.0
endif
gosub Physics
`Check to see there has been a state change
if prev_state<>state
`Assume state is non termina;
rew#=0.0
target=ceil(ti*0.05)
if angle#>1.5708 and angle#<4.7124 or cart_pos#>400 or cart_pos#<-400 `or steps>400
`If terminal, set reward to one
` if steps<=400
rew#=1.0
` endif
`Record the time in seconds
times(ti)=steps
sum#=0
if ti>24
for a=ti-24 to ti
sum#=sum#+times(a)
next a
target=ceil(ti*0.04)
moving_aver(target)=sum#*0.04
endif
ti=ti+1
if ti=20000
ti=1
endif
`Next state=failure since outside bounds, next state expected reward must be a value of one, hence exprearray(state,action_nex)---> gamma#*1
delta_err#=rew#+gamma#-exprearray(prev_state,action_cur)
`update eleg, update softmax
for a=1 to statenu
for b=1 to actionnu
if sarsaelegib(a,b)>0.001
exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
endif
next b
NEXT a
gosub init_physics
gosub clear_traces
new_episode=1
t#=timer()
endif
endif
`Observe state
`convert to binary, convert to num ---angle 2^4 for 2x, angle vel 2^4 for 4x, angle acc 2^4 for 2x, cart pos 2^4 /50, cart vel 2^4 /15, cart acc 2^3 /40
`Convert "analogue" values to binary input state
if angle#<1.5708 then newangle#=angle#*10.15
if angle#>4.7124 then newangle#=(angle#-3.13)*10.15
analogue=floor(newangle#)
for b=1 to 5
binarydata(b)=binary(analogue,5,b)
next b
analogue=floor((angle_vel#+2)*4)
if analogue<0 then analogue=0
if analogue>15 then analogue=15
for b=1 to 4
binarydata(b+5)=binary(analogue,4,b)
next b
analogue=floor((cart_pos#+400)*0.04)
if analogue<0 then analogue=0
if analogue>31 then analogue=31
for b=1 to 5
binarydata(b+9)=binary(analogue,5,b)
next b
analogue=floor((cart_vel#+120)*0.0667)
if analogue<0 then analogue=0
if analogue>15 then analogue=15
for b=1 to 4
binarydata(b+14)=binary(analogue,4,b)
next b
inputdata=0
`convert from binary
for b=0 to statesize-1
inputdata=inputdata+((2^b)*binarydata(b+1))
NEXT b
`given input state, choose action
prev_state=state+0
state=inputdata+0
if prev_state<>state
action_nex=rnd(actionnu-1)+1
if exprearray(state,1)<=exprearray(state,2)
if exprearray(state,1)<=exprearray(state,3)
action_nex=1
endif
endif
if exprearray(state,3)<=exprearray(state,1)
if exprearray(state,3)<=exprearray(state,2)
action_nex=3
endif
endif
`
if exprearray(state,2)<=exprearray(state,1)
if exprearray(state,2)<=exprearray(state,3)
action_nex=2
endif
endif
target=ceil(ti*0.04)
`text 500,120,str$(moving_aver(target))
probabilty#=(steps*probtarg#*2)/((moving_aver(target))*(moving_aver(target)+1))
if controlkey()=0 and rnd(10000)<probabilty#*10000 then action_nex=rnd(2)+1 : gosub clear_traces
if returnkey()=1 then sync rate 0
if shiftkey()=1 then sync rate 60
`lambda#=1.0-probabilty#
`action_nex=rnd(2)+1
`update
delta_err#=rew#+gamma#*exprearray(state,action_nex)-exprearray(prev_state,action_cur)
if new_episode=0
sarsaelegib(prev_state,action_cur)=1
else
new_episode=0
endif
`update eleg, update softmax
for a=1 to statenu
for b=1 to actionnu
if sarsaelegib(a,b)>0.0001
exprearray(a,b)=exprearray(a,b)+(learnrate#*delta_err#*sarsaelegib(a,b))
sarsaelegib(a,b)=gamma#*lambda#*sarsaelegib(a,b)
endif
next b
NEXT a
action_cur=action_nex+0.0
endif
return
init_physics:
`Initialize Physics
mass_cart#=2.0
mass_pole#=9.0
g#=9.81
length#=32.0
time_step#=0.02
if rnd(1)=1
angle#=0.06981+(rnd(18)*0.01745)
else
angle#=6.21337-(rnd(18)*0.01745)
endif
angle_vel#=0.0
angle_acc#=0.0
cart_pos#=0.0
cart_vel#=0.0
cart_acc#=0.0
force#=0.0
steps=0
return
`clear elegibility traces
clear_traces:
for a=1 to statenu
for b=1 to actionnu
sarsaelegib(a,b)=0.0
NEXT b
next a
return
init_agent:
`Initiate SARSA variables and arrays
`Array to record the number of state changes before failure
dim times(20000) as double float
dim moving_aver(800) as double float
`Sarsa expected reward array
dim exprearray(statenu,actionnu) as double float
`Elegibility trace array
dim sarsaelegib(statenu,actionnu) as double float
total#=0.0
`Array for converting physics values to binary
dim binarydata(statesize) as boolean
action_cur=0
rew#=0.0
new_episode=1
for a=1 to statenu
for b=1 to actionnu
exprearray(a,b)=base#
sarsaelegib(a,b)=0.0
NEXT b
next a
`start timing episode
`episode number
ti=1
return
`Function for retrieving binary digits from convering an integer to a binary number
function binary(number,max_places,place)
current=number
digit=0
for a=0 to (max_places-1)
if current-(2^(max_places-1-a))=>0
current=current-(2^(max_places-1-a))
if a=(max_places-place)
digit=1
exit
Endif
ENDIF
NEXT a
ENDFUNCTION digit
controls:
Space key: moving average graph
Hold control key: Turns off AI exploration, improving short term performance
Shiftkey: Sets the frame rate to 60 frames per second for normal speed
Returnkey: Shifts back to accelerated learning at the max fram rate
Its a bit messy, but if anyone is interested i can clean it up. Its essentially a cart and pole problem that tests an AI algorithm called SARSA (lambda). The objective is for the AI to balence the pole as long as possible without it falling over or moving the pole's base (the cart) outside the bounds. The AI assigns fitness values to possible actions when in a given state, as dictated by its inputs, with more "fit" values being more likely to result in the agent transferring to a future state where it recieves a reward.
This is where the algorithm's name comes in; SARSA(lambda) stands for State Action Reward State Action. The fitness values of a past state action pair are updated according to the product of an error function, an exponential time decay function and the learning rate constant. The time decay constant is lamda. The error function is difference between the fitness value of the next state action pair and that of the current of the current, summed with any rewards that may have been recieved in between states.
The next state is chosen using a policy that attempts to balence exploration and exploitation. More exploration will give the agent a better knowledge of the enviroment at the expense of poor short performance but better long term performance. Eploitation gives good short term performance. The probability of choosing a random action, rather than one with a good fitness value, i.e.exploring is a probability which increases depending on the length of the current AI's test run compared to the moving average.
Problems:
- Tabular state based learning algorithm so performs poorly with very large state spaces; each state must tested and its fitness recorded, putting strain on memory.
- In an attempt to reduce the state space I used rather coarse inputs (The inputs passed to the AI must be in binary so some rounding must be done). This speeds up learning but may reduce the AI's consistancy at solving the problem, or may prevent it from solving it alltogether (I have not managed to get the AI to provide a perfect solution to the pole balence problem yet)
I may attempt implementing a neural network using SARSA and backpropagation to reduce the state space size and allow the AI to make inferences, removing the need to visit every state requyired to solve the problem. Any comments are welcome.