Improve allocation cost for clang-cl

We need 40 seconds to compile Node.cpp in blink on native Windows
(Z620). However, it needed more than 4 minutes on wine + clang-cl on
Linux (Z620). With this improvement, compile of Node.cpp is just 40
seconds on wine + clang-cl, which is as fast as on native Windows.

We have 2 improvement points.

(1) Fix wine performance bug.
In the existing wine implementation sometimes search freelist from
one smaller class because of wrong memory size calculation. Arena in
such class does not fulfill the required size, so it always fail until
moving to next size class. Search unnecessary freelist made compile
really slow.

This improves clang-cl compile speed by 4x (4m -> 1m).

(2) Use fine grained free list class.
clang-cl allocates 200~500 bytes memory quite a lot, and often allocates
around 2000 bytes memory. So, for less than 512 bytes memory, we
introduce 16B grained freelist. For less than 4K bytes memory, we use
256B grained freelist.
Also remove loop from get_freelist_index() function to make it faster.
It's called a lot.

This improves clang-cl speed more (1m -> 40s).

BUG=b/34110848

Change-Id: I467140dc71185d7179ffb09426b39056644b3309
diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
index a908f19..9cd5c5c 100644
--- a/dlls/ntdll/heap.c
+++ b/dlls/ntdll/heap.c
@@ -114,9 +114,27 @@
     ((flags & HEAP_TAIL_CHECKING_ENABLED) || RUNNING_ON_VALGRIND ? ALIGNMENT : 0)
 
 /* Max size of the blocks on the free lists */
+/* NOTE. clang-cl allocates 200~500 bytes memory a lot. So, using fine
+   grained freelist improves clang-cl performance a lot (e.g. 60s -> 40s
+   to compile Node.cpp in blink on Z620.
+
+   heap freelist consists of 3 types of classes.
+   1. small (<= 512B). 16B each. 16B, 32B, 48B, ..., 512B.
+   2. large (<= 4096B). 256B each. 512B, 768B, ..., 4096B.
+   3. huge  (> 4K).
+ */
 static const SIZE_T HEAP_freeListSizes[] =
 {
-    0x10, 0x20, 0x30, 0x40, 0x60, 0x80, 0x100, 0x200, 0x400, 0x1000, ~0UL
+    /* small classes; 32 classes */
+    0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80,
+    0x90, 0xA0, 0xB0, 0xC0, 0xD0, 0xE0, 0xF0, 0x100,
+    0x110, 0x120, 0x130, 0x140, 0x150, 0x160, 0x170, 0x180,
+    0x190, 0x1A0, 0x1B0, 0x1C0, 0x1D0, 0x1E0, 0x1F0, 0x200,
+    /* large classes; 14 classes */
+    0x300, 0x400, 0x500, 0x600, 0x700, 0x800, 0x900,
+    0xA00, 0xB00, 0xC00, 0xD00, 0xE00, 0xF00, 0x1000,
+    /* huge class; 1 class */
+    ~0UL
 };
 #define HEAP_NB_FREE_LISTS  (sizeof(HEAP_freeListSizes)/sizeof(HEAP_freeListSizes[0]))
 
@@ -300,11 +318,21 @@
 /* size is the size of the whole block including the arena header */
 static inline unsigned int get_freelist_index( SIZE_T size )
 {
-    unsigned int i;
-
-    size -= sizeof(ARENA_FREE);
-    for (i = 0; i < HEAP_NB_FREE_LISTS - 1; i++) if (size <= HEAP_freeListSizes[i]) break;
-    return i;
+    /* Note: this index is based on HEAP_freeListSizes. */
+    if (size == 0) {
+        return 0;
+    }
+    /* For small class, 16B each. */
+    if (size <= 0x200) {
+        return (size - 1) / 0x10;
+    }
+    /* For large class, 256B each */
+    if (size <= 0x1000) {
+        return (size - 1) / 0x100 + 30;
+    }
+    /* For huge class. Only 1 index. */
+    /* HEAP_NB_FREE_LISTS should be 47 now. */
+    return HEAP_NB_FREE_LISTS - 1;
 }
 
 /* get the memory protection type to use for a given heap */
@@ -466,7 +494,7 @@
  */
 static inline void HEAP_InsertFreeBlock( HEAP *heap, ARENA_FREE *pArena, BOOL last )
 {
-    FREE_LIST_ENTRY *pEntry = heap->freeList + get_freelist_index( pArena->size + sizeof(*pArena) );
+    FREE_LIST_ENTRY *pEntry = heap->freeList + get_freelist_index( pArena->size );
     if (last)
     {
         /* insert at end of free list, i.e. before the next free list entry */
@@ -992,7 +1020,7 @@
     SUBHEAP *subheap;
     struct list *ptr;
     SIZE_T total_size;
-    FREE_LIST_ENTRY *pEntry = heap->freeList + get_freelist_index( size + sizeof(ARENA_INUSE) );
+    FREE_LIST_ENTRY *pEntry = heap->freeList + get_freelist_index( size );
 
     /* Find a suitable free list, and in it find a block large enough */
 
@@ -1001,7 +1029,8 @@
     {
         ARENA_FREE *pArena = LIST_ENTRY( ptr, ARENA_FREE, entry );
         SIZE_T arena_size = (pArena->size & ARENA_SIZE_MASK) +
-                            sizeof(ARENA_FREE) - sizeof(ARENA_INUSE);
+                             sizeof(ARENA_FREE) - sizeof(ARENA_INUSE);
+
         if (arena_size >= size)
         {
             subheap = HEAP_FindSubHeap( heap, pArena );