The problem is the castle is one giant object. And for each of the ellipsoids being tested against it, it has to scan across each of those 12,000 polys.
So, looking at the numbers:
1 ellip : 12,000 scans (per frame)
2 ellips: 24,000 scans (per frame)
3 ellips: 36,000 scans (per frame)
To increase the performance of the system, divide your castle into chunks. If each chunk was 500 polygons or less, it'd drive the performance way up. And our new scan list would look similar to this:
1 ellip : 500 scans (per frame)
2 ellips: 1000 scans (per frame)
3 ellips: 1500 scans (per frame)
In short, that's 1500-3000 scans (per frame) compared to 36,000 scans (per frame). That makes an enormous difference in speed.
The collision system will skip the chunks out of range, so that is why you would get the massive performance increase as shown above.