Browse Source

Merge commit '2fb95c0' into scrypt

Luke Dashjr 13 years ago
parent
commit
a208d3fbae
15 changed files with 386 additions and 71 deletions
  1. 9 0
      README
  2. 136 0
      SCRYPT-README
  3. 5 5
      configure.ac
  4. 2 2
      diablo120724.cl
  5. 2 2
      diakgcn120724.cl
  6. 86 6
      driver-opencl.c
  7. 5 0
      driver-opencl.h
  8. 3 3
      findnonce.h
  9. 19 4
      miner.c
  10. 12 0
      miner.h
  11. 91 30
      ocl.c
  12. 0 2
      ocl.h
  13. 2 2
      phatk120724.cl
  14. 2 2
      poclbm120724.cl
  15. 12 13
      scrypt120724.cl

+ 9 - 0
README

@@ -152,6 +152,7 @@ Options for both config file and command line:
 --scan-time|-s <arg> Upper bound on time spent scanning current work, in seconds (default: 60)
 --sched-start <arg> Set a time of day in HH:MM to start mining (a once off without a stop time)
 --sched-stop <arg>  Set a time of day in HH:MM to stop mining (will quit without a start time)
+--scrypt            Use the scrypt algorithm for mining (litecoin only)
 --sharelog <arg>    Append share log to file
 --shares <arg>      Quit after mining N shares (default: unlimited)
 --socks-proxy <arg> Set socks4 proxy (host:port)
@@ -198,6 +199,14 @@ GPU only options:
 --worksize|-w <arg> Override detected optimal worksize - one value or comma separated list
 
 
+SCRYPT only options:
+
+--lookup-gap <arg>  Set GPU lookup gap for scrypt mining, comma separated
+--thread-concurrency <arg> Set GPU thread concurrency for scrypt mining, comma separated
+
+See SCRYPT-README for more information regarding litecoin mining.
+
+
 FPGA mining boards(BitForce, Icarus, ModMiner, Ztex) only options:
 
 --scan-serial|-S <arg> Serial port to probe for FPGA mining device

+ 136 - 0
SCRYPT-README

@@ -0,0 +1,136 @@
+If you wish to donate to the author, Con Kolivas, in LTC, please submit your
+donations to:
+
+Lc8TWMiKM7gRUrG8VB8pPNP1Yvt1SGZnoH
+
+Otherwise, please donate in BTC as per the main README.
+
+---
+
+
+Scrypt mining, AKA litecoin mining, for GPU is completely different to sha256
+used for bitcoin mining. The algorithm was originally developed in a manner
+that it was anticipated would make it suitable for mining on CPU but NOT GPU.
+Thanks to some innovative work by Artforz and mtrlt, this was proven to be
+wrong. However, it has very different requirements to bitcoin mining and is a
+lot more complicated to get working well. Note that it is a ram dependent
+workload, and requires you to have enough system ram as well as fast enough
+GPU ram.
+
+There are 5 main parameters to tuning scrypt, 2 of which you MUST set, and
+the others are optional for further fine tuning. When you start scrypt mining
+with the --scrypt option, cgminer will fail IN RANDOM WAYS. They are all due
+to parameters being outside what the GPU can cope with. Not giving cgminer a
+hint as to your GPU type, it will hardly ever perform well.
+
+
+Step 1 on linux:
+export GPU_MAX_ALLOC_PERCENT=100
+If you do not do this, you may find it impossible to scrypt mine. You may find
+a value of 40 is enough and increasing this further has little effect.
+
+export GPU_USE_SYNC_OBJECTS=1
+may help CPU usage a little as well.
+
+--shaders XXX
+
+is a new option where you tell cgminer how many shaders your GPU has. This
+helps cgminer try to choose some meaningful baseline parameters. Use this table
+below to determine how many shaders your GPU has, and note that there are some
+variants of these cards, and nvidia shaders are much much lower and virtually
+pointless trying to mine on.
+
+GPU  Shaders
+7750 512
+7770 640
+7850 1024
+7870 1280
+7950 1792
+7970 2048
+
+6850 960
+6870 1120
+6950 1408
+6970 1536
+6990 (6970x2)
+
+6570 480
+6670 480
+6790 800
+
+6450 160
+
+5670 400
+5750 720
+5770 800
+5830 1120
+5850 1440
+5870 1600
+5970 (5870x2)
+
+These are only used as a rough guide for cgminer, and it is rare that this is
+all you will need to set.
+
+
+--intensity XX
+
+Just like in bitcoin mining, scrypt mining takes an intensity, however the
+scale goes from 0 to 20 to mimic the "Aggression" used in mtrlt's reaper. The
+reason this is crucial is that too high an intensity can actually be
+disastrous with scrypt because it CAN run out of ram. Intensities over 13
+start writing over the same ram and it is highly dependent on the GPU, but they
+can start actually DECREASING your hashrate, or even worse, start producing
+garbage with rejects skyrocketing.
+
+
+Optional parameters to tune:
+-g, --thread-concurrency, --lookup-gap
+
+-g:
+Once you have found the optimal shaders and intensity, you can start increasing
+the -g value till cgminer fails to start. Rarely will you be able to go over
+about -g 4 and each increase in -g only increases hashrate slightly.
+
+--thread-concurrency:
+This tunes the optimal size of work that scrypt can do. It is internally tuned
+by cgminer to be the highest reasonable multiple of shaders that it can
+allocate on your GPU. Ideally it should be a multiple of your shader count.
+vliw5 architecture (R5XXX) would be best at 5x shaders, while VLIW4 (R6xxx and
+R7xxx) are best at 4x. Setting thread concurrency overrides anything you put
+into --shaders.
+
+--lookup-gap
+This tunes a compromise between ram usage and performance. Performance peaks
+at a gap of 2, but increasing the gap can save you some GPU ram, but almost
+always at the cost of significant loss of hashrate. Setting lookup gap
+overrides the default of 2, but cgminer will use the --shaders value to choose
+a thread-concurrency if you haven't chosen one.
+
+
+Overclocking for scrypt mining:
+First of all, do not underclock your memory initially. Scrypt mining requires
+memory speed and on most, but not all, GPUs, lowering memory speed lowers
+mining performance.
+
+Second, absolute engine clock speeds do NOT correlate with hashrate. The ratio
+of engine clock speed to memory matters, so if you set your memory to the
+default value, and then start overclocking as you are running it, you should
+find a sweet spot where the hashrate peaks and then it might actually drop if
+you increase the engine clock speed further. Unless you wish to run with a
+dynamic intensity, do not go over 13 without testing it while it's running to
+see that it increases hashrate AND utility WITHOUT increasing your rejects.
+
+
+Suggested values for 7970 for example:
+export GPU_MAX_ALLOC_PERCENT=100
+--shaders 2048 -g 5 --gpu-engine 1135 --gpu-memclock 1375
+
+
+---
+
+If you wish to donate to the author, Con Kolivas, in LTC, please submit your
+donations to:
+
+Lc8TWMiKM7gRUrG8VB8pPNP1Yvt1SGZnoH
+
+Otherwise, please donate in BTC as per the main README.

+ 5 - 5
configure.ac

@@ -355,11 +355,11 @@ fi
 
 AC_DEFINE_UNQUOTED([CGMINER_PREFIX], ["$prefix/bin"], [Path to bfgminer install])
 
-AC_DEFINE_UNQUOTED([PHATK_KERNNAME], ["phatk120223"], [Filename for phatk kernel])
-AC_DEFINE_UNQUOTED([POCLBM_KERNNAME], ["poclbm120327"], [Filename for poclbm kernel])
-AC_DEFINE_UNQUOTED([DIAKGCN_KERNNAME], ["diakgcn120427"], [Filename for diakgcn kernel])
-AC_DEFINE_UNQUOTED([DIABLO_KERNNAME], ["diablo120328"], [Filename for diablo kernel])
-AC_DEFINE_UNQUOTED([SCRYPT_KERNNAME], ["scrypt120713"], [Filename for scrypt kernel])
+AC_DEFINE_UNQUOTED([PHATK_KERNNAME], ["phatk120724"], [Filename for phatk kernel])
+AC_DEFINE_UNQUOTED([POCLBM_KERNNAME], ["poclbm120724"], [Filename for poclbm kernel])
+AC_DEFINE_UNQUOTED([DIAKGCN_KERNNAME], ["diakgcn120724"], [Filename for diakgcn kernel])
+AC_DEFINE_UNQUOTED([DIABLO_KERNNAME], ["diablo120724"], [Filename for diablo kernel])
+AC_DEFINE_UNQUOTED([SCRYPT_KERNNAME], ["scrypt120724"], [Filename for scrypt kernel])
 
 
 AC_SUBST(JANSSON_LIBS)

+ 2 - 2
diablo120328.cl → diablo120724.cl

@@ -1242,8 +1242,8 @@ void search(
     
     ZA[924] = (ZCh(ZA[922], ZA[920], ZA[918]) + ZA[923]) + ZR26(ZA[922]);
     
-#define FOUND (0x80)
-#define NFLAG (0x7F)
+#define FOUND (0x800)
+#define NFLAG (0x7FF)
 
 #if defined(VECTORS4)
 	bool result = any(ZA[924] == 0x136032EDU);

+ 2 - 2
diakgcn120427.cl → diakgcn120724.cl

@@ -571,8 +571,8 @@ __kernel
 
 	V[7] += V[3] + W[12] + ch(V[0], V[1], V[2]) + rotr26(V[0]);
 
-#define FOUND (0x80)
-#define NFLAG (0x7F)
+#define FOUND (0x800)
+#define NFLAG (0x7FF)
 
 #ifdef VECTORS4
 	if ((V[7].x == 0x136032edU) ^ (V[7].y == 0x136032edU) ^ (V[7].z == 0x136032edU) ^ (V[7].w == 0x136032edU))

+ 86 - 6
driver-opencl.c

@@ -344,6 +344,83 @@ char *set_worksize(char *arg)
 	return NULL;
 }
 
+#ifdef USE_SCRYPT
+char *set_shaders(char *arg)
+{
+	int i, val = 0, device = 0;
+	char *nextptr;
+
+	nextptr = strtok(arg, ",");
+	if (nextptr == NULL)
+		return "Invalid parameters for set lookup gap";
+	val = atoi(nextptr);
+
+	gpus[device++].shaders = val;
+
+	while ((nextptr = strtok(NULL, ",")) != NULL) {
+		val = atoi(nextptr);
+
+		gpus[device++].shaders = val;
+	}
+	if (device == 1) {
+		for (i = device; i < MAX_GPUDEVICES; i++)
+			gpus[i].shaders = gpus[0].shaders;
+	}
+
+	return NULL;
+}
+
+char *set_lookup_gap(char *arg)
+{
+	int i, val = 0, device = 0;
+	char *nextptr;
+
+	nextptr = strtok(arg, ",");
+	if (nextptr == NULL)
+		return "Invalid parameters for set lookup gap";
+	val = atoi(nextptr);
+
+	gpus[device++].lookup_gap = val;
+
+	while ((nextptr = strtok(NULL, ",")) != NULL) {
+		val = atoi(nextptr);
+
+		gpus[device++].lookup_gap = val;
+	}
+	if (device == 1) {
+		for (i = device; i < MAX_GPUDEVICES; i++)
+			gpus[i].lookup_gap = gpus[0].lookup_gap;
+	}
+
+	return NULL;
+}
+
+char *set_thread_concurrency(char *arg)
+{
+	int i, val = 0, device = 0;
+	char *nextptr;
+
+	nextptr = strtok(arg, ",");
+	if (nextptr == NULL)
+		return "Invalid parameters for set thread concurrency";
+	val = atoi(nextptr);
+
+	gpus[device++].thread_concurrency = val;
+
+	while ((nextptr = strtok(NULL, ",")) != NULL) {
+		val = atoi(nextptr);
+
+		gpus[device++].thread_concurrency = val;
+	}
+	if (device == 1) {
+		for (i = device; i < MAX_GPUDEVICES; i++)
+			gpus[i].thread_concurrency = gpus[0].thread_concurrency;
+	}
+
+	return NULL;
+}
+#endif
+
 static enum cl_kernels select_kernel(char *arg)
 {
 	if (!strcmp(arg, "diablo"))
@@ -1212,13 +1289,13 @@ static cl_int queue_diablo_kernel(_clState *clState, dev_blk_ctx *blk, cl_uint t
 #ifdef USE_SCRYPT
 static cl_int queue_scrypt_kernel(_clState *clState, dev_blk_ctx *blk, __maybe_unused cl_uint threads)
 {
-	char *midstate = blk->work->midstate;
+	unsigned char *midstate = blk->work->midstate;
 	cl_kernel *kernel = &clState->kernel;
 	unsigned int num = 0;
 	cl_uint le_target;
 	cl_int status = 0;
 
-	le_target = ~swab32((uint32_t)blk->work->target[7]);
+	le_target = *(cl_uint *)(blk->work->target + 28);
 	clState->cldata = blk->work->data;
 	status = clEnqueueWriteBuffer(clState->commandQueue, clState->CLbuffer0, true, 0, 80, clState->cldata, 0, NULL,NULL);
 
@@ -1237,7 +1314,12 @@ static void set_threads_hashes(unsigned int vectors, unsigned int *threads,
 			       int64_t *hashes, size_t *globalThreads,
 			       unsigned int minthreads, int intensity)
 {
-	*threads = 1 << (15 + intensity);
+	if (opt_scrypt) {
+		if (intensity < 0)
+			intensity = 0;
+		*threads = 1 << intensity;
+	} else
+		*threads = 1 << (15 + intensity);
 	if (*threads < minthreads)
 		*threads = minthreads;
 	*globalThreads = *threads;
@@ -1651,7 +1733,7 @@ static int64_t opencl_scanhash(struct thr_info *thr, struct work *work,
 	unsigned int threads;
 	int64_t hashes;
 
-	if (gpu->dynamic || opt_scrypt)
+	if (gpu->dynamic)
 		blocking = CL_TRUE;
 	else
 		blocking = CL_FALSE;
@@ -1676,8 +1758,6 @@ static int64_t opencl_scanhash(struct thr_info *thr, struct work *work,
 			if (gpu->gpu_us_average > dynamic_us) {
 				if (gpu->intensity > MIN_INTENSITY)
 					--gpu->intensity;
-				else
-					nmsleep(opt_dynamic_interval / 2 ? : 1);
 			} else if (gpu->gpu_us_average < dynamic_us / 2) {
 				if (gpu->intensity < MAX_INTENSITY)
 					++gpu->intensity;

+ 5 - 0
driver-opencl.h

@@ -18,6 +18,11 @@ extern char *set_temp_target(char *arg);
 extern char *set_intensity(char *arg);
 extern char *set_vector(char *arg);
 extern char *set_worksize(char *arg);
+#ifdef USE_SCRYPT
+extern char *set_shaders(char *arg);
+extern char *set_lookup_gap(char *arg);
+extern char *set_thread_concurrency(char *arg);
+#endif
 extern char *set_kernel(char *arg);
 void manage_gpu(void);
 extern void pause_dynamic_threads(int gpu);

+ 3 - 3
findnonce.h

@@ -4,10 +4,10 @@
 #include "config.h"
 
 #define MAXTHREADS (0xFFFFFFFEULL)
-#define MAXBUFFERS (0xFF)
+#define MAXBUFFERS (0xFFF)
 #define BUFFERSIZE (sizeof(uint32_t) * MAXBUFFERS)
-#define FOUND (0x80)
-/* #define NFLAG (0x7F) Just for reference */
+#define FOUND (0x800)
+/* #define NFLAG (0x7FF) Just for reference */
 
 #ifdef HAVE_OPENCL
 extern void precalc_hash(dev_blk_ctx *blk, uint32_t *state, uint32_t *data);

+ 19 - 4
miner.c

@@ -855,6 +855,11 @@ static struct opt_table opt_config_table[] = {
 	OPT_WITH_ARG("--gpu-vddc",
 		     set_gpu_vddc, NULL, NULL,
 		     "Set the GPU voltage in Volts - one value for all or separate by commas for per card"),
+#endif
+#ifdef USE_SCRYPT
+	OPT_WITH_ARG("--lookup-gap",
+		     set_lookup_gap, NULL, NULL,
+		     "Set GPU lookup gap for scrypt mining, comma separated"),
 #endif
 	OPT_WITH_ARG("--intensity|-I",
 		     set_intensity, NULL, NULL,
@@ -965,6 +970,9 @@ static struct opt_table opt_config_table[] = {
 	OPT_WITHOUT_ARG("--scrypt",
 			opt_set_bool, &opt_scrypt,
 			"Use the scrypt algorithm for mining (litecoin only)"),
+	OPT_WITH_ARG("--shaders",
+		     set_shaders, NULL, NULL,
+		     "GPU shaders per card for tuning scrypt, comma separated"),
 #endif
 	OPT_WITH_ARG("--sharelog",
 		     set_sharelog, NULL, NULL,
@@ -1007,6 +1015,11 @@ static struct opt_table opt_config_table[] = {
 			opt_hidden
 #endif
 	),
+#ifdef USE_SCRYPT
+	OPT_WITH_ARG("--thread-concurrency",
+		     set_thread_concurrency, NULL, NULL,
+		     "Set GPU thread concurrency for scrypt mining, comma separated"),
+#endif
 	OPT_WITH_ARG("--url|-o",
 		     set_url, NULL, NULL,
 		     "URL for bitcoin JSON-RPC server"),
@@ -4344,15 +4357,17 @@ bool hashtest(const struct work *work, bool checktarget)
 
 bool test_nonce(struct work *work, uint32_t nonce, bool checktarget)
 {
-	uint32_t *work_nonce = (uint32_t *)(work->data + 64 + 12);
-
 	if (opt_scrypt) {
+		uint32_t *work_nonce = (uint32_t *)(work->data + 64 + 12);
+
 		*work_nonce = nonce;
 		return true;
 	}
 
-	*work_nonce = htobe32(nonce);
-
+	work->data[64 + 12 + 0] = (nonce >> 0) & 0xff;
+	work->data[64 + 12 + 1] = (nonce >> 8) & 0xff;
+	work->data[64 + 12 + 2] = (nonce >> 16) & 0xff;
+	work->data[64 + 12 + 3] = (nonce >> 24) & 0xff;
 
 	return hashtest(work, checktarget);
 }

+ 12 - 0
miner.h

@@ -358,10 +358,17 @@ struct cgpu_info {
 	int virtual_adl;
 	int intensity;
 	bool dynamic;
+
 	cl_uint vwidth;
 	size_t work_size;
 	enum cl_kernels kernel;
+	cl_ulong max_alloc;
 
+#ifdef USE_SCRYPT
+	int lookup_gap;
+	int thread_concurrency;
+	int shaders;
+#endif
 	struct timeval tv_gpustart;;
 	struct timeval tv_gpuend;
 	double gpu_us_average;
@@ -613,8 +620,13 @@ extern void add_pool_details(bool live, char *url, char *user, char *pass);
 
 #define MIN_INTENSITY -10
 #define _MIN_INTENSITY_STR "-10"
+#ifdef USE_SCRYPT
+#define MAX_INTENSITY 20
+#define _MAX_INTENSITY_STR "20"
+#else
 #define MAX_INTENSITY 14
 #define _MAX_INTENSITY_STR "14"
+#endif
 
 extern struct list_head scan_devices;
 extern int nDevs;

+ 91 - 30
ocl.c

@@ -385,6 +385,7 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 {
 	_clState *clState = calloc(1, sizeof(_clState));
 	bool patchbfi = false, prog_built = false;
+	struct cgpu_info *cgpu = &gpus[gpu];
 	cl_platform_id platform = NULL;
 	char pbuff[256], vbuff[255];
 	cl_platform_id* platforms;
@@ -540,16 +541,25 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 	}
 	applog(LOG_DEBUG, "Max work group size reported %d", clState->max_work_size);
 
+	status = clGetDeviceInfo(devices[gpu], CL_DEVICE_MAX_MEM_ALLOC_SIZE , sizeof(cl_ulong), (void *)&cgpu->max_alloc, NULL);
+	if (status != CL_SUCCESS) {
+		applog(LOG_ERR, "Error %d: Failed to clGetDeviceInfo when trying to get CL_DEVICE_MAX_MEM_ALLOC_SIZE", status);
+		return NULL;
+	}
+	applog(LOG_DEBUG, "Max mem alloc size is %u", cgpu->max_alloc);
+
 	/* Create binary filename based on parameters passed to opencl
 	 * compiler to ensure we only load a binary that matches what would
 	 * have otherwise created. The filename is:
 	 * name + kernelname +/- g(offset) + v + vectors + w + work_size + l + sizeof(long) + .bin
+	 * For scrypt the filename is:
+	 * name + kernelname + g + lg + lookup_gap + tc + thread_concurrency + w + work_size + l + sizeof(long) + .bin
 	 */
 	char binaryfilename[255];
 	char filename[255];
 	char numbuf[10];
 
-	if (gpus[gpu].kernel == KL_NONE) {
+	if (cgpu->kernel == KL_NONE) {
 		if (opt_scrypt) {
 			applog(LOG_INFO, "Selecting scrypt kernel");
 			clState->chosen_kernel = KL_SCRYPT;
@@ -571,9 +581,9 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 			applog(LOG_INFO, "Selecting phatk kernel");
 			clState->chosen_kernel = KL_PHATK;
 		}
-		gpus[gpu].kernel = clState->chosen_kernel;
+		cgpu->kernel = clState->chosen_kernel;
 	} else {
-		clState->chosen_kernel = gpus[gpu].kernel;
+		clState->chosen_kernel = cgpu->kernel;
 		if (clState->chosen_kernel == KL_PHATK &&
 		    (strstr(vbuff, "844.4") || strstr(vbuff, "851.4") ||
 		     strstr(vbuff, "831.4") || strstr(vbuff, "898.1") ||
@@ -610,7 +620,7 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 			strcpy(filename, SCRYPT_KERNNAME".cl");
 			strcpy(binaryfilename, SCRYPT_KERNNAME);
 			/* Scrypt only supports vector 1 */
-			gpus[gpu].vwidth = 1;
+			cgpu->vwidth = 1;
 			break;
 		case KL_NONE: /* Shouldn't happen */
 		case KL_DIABLO:
@@ -619,24 +629,61 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 			break;
 	}
 
-	if (gpus[gpu].vwidth)
-		clState->vwidth = gpus[gpu].vwidth;
+	if (cgpu->vwidth)
+		clState->vwidth = cgpu->vwidth;
 	else {
 		clState->vwidth = preferred_vwidth;
-		gpus[gpu].vwidth = preferred_vwidth;
+		cgpu->vwidth = preferred_vwidth;
 	}
 
 	if (((clState->chosen_kernel == KL_POCLBM || clState->chosen_kernel == KL_DIABLO || clState->chosen_kernel == KL_DIAKGCN) &&
 		clState->vwidth == 1 && clState->hasOpenCL11plus) || opt_scrypt)
 			clState->goffset = true;
 
-	if (gpus[gpu].work_size && gpus[gpu].work_size <= clState->max_work_size)
-		clState->wsize = gpus[gpu].work_size;
+	if (cgpu->work_size && cgpu->work_size <= clState->max_work_size)
+		clState->wsize = cgpu->work_size;
 	else if (strstr(name, "Tahiti"))
 		clState->wsize = 64;
 	else
 		clState->wsize = (clState->max_work_size <= 256 ? clState->max_work_size : 256) / clState->vwidth;
-	gpus[gpu].work_size = clState->wsize;
+	cgpu->work_size = clState->wsize;
+
+#ifdef USE_SCRYPT
+	if (opt_scrypt) {
+		cl_ulong ma = cgpu->max_alloc, mt;
+		int pow2 = 0;
+
+		if (!cgpu->lookup_gap) {
+			applog(LOG_DEBUG, "GPU %d: selecting lookup gap of 2", gpu);
+			cgpu->lookup_gap = 2;
+		}
+		if (!cgpu->thread_concurrency) {
+			cgpu->thread_concurrency = ma / 32768 / cgpu->lookup_gap;
+			if (cgpu->shaders && cgpu->thread_concurrency > cgpu->shaders) {
+				cgpu->thread_concurrency -= cgpu->thread_concurrency % cgpu->shaders;
+				if (cgpu->thread_concurrency > cgpu->shaders * 5)
+					cgpu->thread_concurrency = cgpu->shaders * 5;
+			}
+				
+			applog(LOG_DEBUG, "GPU %d: selecting thread concurrency of %u",gpu,  cgpu->thread_concurrency);
+		}
+
+		/* If we have memory to spare, try to find a power of 2 value
+		 * >= required amount to map nicely to an intensity */
+		mt = cgpu->thread_concurrency * 32768 * cgpu->lookup_gap;
+		if (ma > mt) {
+			while (ma >>= 1)
+				pow2++;
+			ma = 1;
+			while (--pow2 && ma < mt)
+				ma <<= 1;
+			if (ma >= mt) {
+				cgpu->max_alloc = ma;
+				applog(LOG_DEBUG, "Max alloc decreased to %lu", cgpu->max_alloc);
+			}
+		}
+	}
+#endif
 
 	FILE *binaryfile;
 	size_t *binary_sizes;
@@ -662,24 +709,21 @@ _clState *initCl(unsigned int gpu, char *name, size_t nameSize)
 		return NULL;
 	}
 
-#ifdef USE_SCRYPT
-	if (opt_scrypt) {
-		clState->lookup_gap = 1;
-		clState->thread_concurrency = 6144;
-	}
-#endif
-
 	strcat(binaryfilename, name);
 	if (clState->goffset)
 		strcat(binaryfilename, "g");
-	strcat(binaryfilename, "v");
-	sprintf(numbuf, "%d", clState->vwidth);
-	strcat(binaryfilename, numbuf);
-	strcat(binaryfilename, "w");
-	sprintf(numbuf, "%d", (int)clState->wsize);
+	if (opt_scrypt) {
+#ifdef USE_SCRYPT
+		sprintf(numbuf, "lg%dtc%d", cgpu->lookup_gap, cgpu->thread_concurrency);
+		strcat(binaryfilename, numbuf);
+#endif
+	} else {
+		sprintf(numbuf, "v%d", clState->vwidth);
+		strcat(binaryfilename, numbuf);
+	}
+	sprintf(numbuf, "w%d", (int)clState->wsize);
 	strcat(binaryfilename, numbuf);
-	strcat(binaryfilename, "l");
-	sprintf(numbuf, "%d", (int)sizeof(long));
+	sprintf(numbuf, "l%d", (int)sizeof(long));
 	strcat(binaryfilename, numbuf);
 	strcat(binaryfilename, ".bin");
 
@@ -743,7 +787,7 @@ build:
 #ifdef USE_SCRYPT
 	if (opt_scrypt)
 		sprintf(CompilerOptions, "-D LOOKUP_GAP=%d -D CONCURRENT_THREADS=%d -D WORKSIZE=%d",
-			(int)clState->lookup_gap, (int)clState->thread_concurrency, (int)clState->wsize);
+			cgpu->lookup_gap, cgpu->thread_concurrency, (int)clState->wsize);
 	else
 #endif
 	{
@@ -930,12 +974,29 @@ built:
 
 #ifdef USE_SCRYPT
 	if (opt_scrypt) {
-		size_t ipt = (1024 / clState->lookup_gap + (1024 % clState->lookup_gap > 0));
-		size_t bufsize = 128 * ipt * clState->thread_concurrency;
-
-		clState->CLbuffer0 = clCreateBuffer(clState->context, CL_MEM_READ_ONLY, 80, NULL, &status);
-		clState->padbuffer8 = clCreateBuffer(clState->context, CL_MEM_READ_WRITE, bufsize, NULL, &status);
+		size_t ipt = (1024 / cgpu->lookup_gap + (1024 % cgpu->lookup_gap > 0));
+		size_t bufsize = 128 * ipt * cgpu->thread_concurrency;
+
+		/* Use the max alloc value which has been rounded to a power of
+		 * 2 greater >= required amount earlier */
+		if (bufsize > cgpu->max_alloc) {
+			applog(LOG_WARNING, "Maximum buffer memory device %d supports says %u, your scrypt settings come to %u",
+			       gpu, cgpu->max_alloc, bufsize);
+		} else
+			bufsize = cgpu->max_alloc;
+		applog(LOG_DEBUG, "Creating scrypt buffer sized %d", bufsize);
 		clState->padbufsize = bufsize;
+		clState->padbuffer8 = clCreateBuffer(clState->context, CL_MEM_READ_WRITE, bufsize, NULL, &status);
+		if (status != CL_SUCCESS) {
+			applog(LOG_ERR, "Error %d: clCreateBuffer (padbuffer8), decrease CT or increase LG", status);
+			return NULL;
+		}
+
+		clState->CLbuffer0 = clCreateBuffer(clState->context, CL_MEM_READ_ONLY, 128, NULL, &status);
+		if (status != CL_SUCCESS) {
+			applog(LOG_ERR, "Error %d: clCreateBuffer (CLbuffer0)", status);
+			return NULL;
+		}
 	}
 #endif
 	clState->outputBuffer = clCreateBuffer(clState->context, CL_MEM_WRITE_ONLY, BUFFERSIZE, NULL, &status);

+ 0 - 2
ocl.h

@@ -18,8 +18,6 @@ typedef struct {
 #ifdef USE_SCRYPT
 	cl_mem CLbuffer0;
 	cl_mem padbuffer8;
-	size_t lookup_gap;
-	size_t thread_concurrency;
 	size_t padbufsize;
 	void * cldata;
 #endif

+ 2 - 2
phatk120223.cl → phatk120724.cl

@@ -387,8 +387,8 @@ void search(	const uint state0, const uint state1, const uint state2, const uint
 	W[117] += W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(59+64)) + s1(64+59)+ ch(59+64),Vals[1],Vals[2]) -
 		(-(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(59+64))  + s1(64+59)+ ch(59+64)));
 
-#define FOUND (0x80)
-#define NFLAG (0x7F)
+#define FOUND (0x800)
+#define NFLAG (0x7FF)
 
 #ifdef VECTORS4
 	bool result = W[117].x & W[117].y & W[117].z & W[117].w;

+ 2 - 2
poclbm120327.cl → poclbm120724.cl

@@ -1311,8 +1311,8 @@ Vals[1]+=(rotr(W[9],17)^rotr(W[9],19)^(W[9]>>10U));
 Vals[1]+=K[59];
 Vals[1]+=Vals[5];
 
-#define FOUND (0x80)
-#define NFLAG (0x7F)
+#define FOUND (0x800)
+#define NFLAG (0x7FF)
 
 #if defined(VECTORS2) || defined(VECTORS4)
 	Vals[2]+=Ma(Vals[6],Vals[5],Vals[7]);

+ 12 - 13
scrypt120713.cl → scrypt120724.cl

@@ -2,10 +2,7 @@
 #define Ch(x,y,z) bitselect(z,y,x)
 #define Maj(x,y,z) Ch((x^z),y,z)
 
-uint4 EndianSwap4(uint4 n)
-{
-	return rotl(n&0x00FF00FF,24U)|rotl(n&0xFF00FF00,8U);
-}
+#define EndianSwap(n) (rotl(n&0x00FF00FF,24U)|rotl(n&0xFF00FF00,8U))
 
 #define Tr2(x)		(rotl(x, 30U) ^ rotl(x, 19U) ^ rotl(x, 10U))
 #define Tr1(x)		(rotl(x, 26U) ^ rotl(x, 21U) ^ rotl(x, 7U))
@@ -552,7 +549,7 @@ void shittify(uint4 B[8])
 	
 #pragma unroll
 	for(uint i=0; i<4; ++i)
-		B[i] = EndianSwap4(tmp[i]); 
+		B[i] = EndianSwap(tmp[i]);
 
 	tmp[0] = (uint4)(B[5].x,B[6].y,B[7].z,B[4].w);
 	tmp[1] = (uint4)(B[6].x,B[7].y,B[4].z,B[5].w);
@@ -561,7 +558,7 @@ void shittify(uint4 B[8])
 	
 #pragma unroll
 	for(uint i=0; i<4; ++i)
-		B[i+4] = EndianSwap4(tmp[i]); 
+		B[i+4] = EndianSwap(tmp[i]);
 }
 
 void unshittify(uint4 B[8])
@@ -574,7 +571,7 @@ void unshittify(uint4 B[8])
 	
 #pragma unroll
 	for(uint i=0; i<4; ++i)
-		B[i] = EndianSwap4(tmp[i]); 
+		B[i] = EndianSwap(tmp[i]);
 
 	tmp[0] = (uint4)(B[7].x,B[6].y,B[5].z,B[4].w);
 	tmp[1] = (uint4)(B[4].x,B[7].y,B[6].z,B[5].w);
@@ -583,7 +580,7 @@ void unshittify(uint4 B[8])
 	
 #pragma unroll
 	for(uint i=0; i<4; ++i)
-		B[i+4] = EndianSwap4(tmp[i]); 
+		B[i+4] = EndianSwap(tmp[i]);
 }
 
 void salsa(uint4 B[8])
@@ -685,8 +682,8 @@ void scrypt_core(uint4 X[8], __global uint4*restrict lookup)
 	unshittify(X);
 }
 
-#define FOUND (0x80)
-#define NFLAG (0x7F)
+#define FOUND (0x800)
+#define NFLAG (0x7FF)
 
 __attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
 __kernel void search(__global const uint4 * restrict input,
@@ -723,13 +720,15 @@ const uint4 midstate0, const uint4 midstate16, const uint target)
 	SHA256(&tmp0,&tmp1, X[4], X[5], X[6], X[7]);
 	SHA256_fixed(&tmp0,&tmp1);
 	SHA256(&ostate0,&ostate1, tmp0, tmp1, (uint4)(0x80000000U, 0U, 0U, 0U), (uint4)(0U, 0U, 0U, 0x300U));
-	
-	if (!(ostate1.w&target))
+
+	bool found = (EndianSwap(ostate1.w) <= target);
+	if (found)
 		output[FOUND] = output[NFLAG & gid] = gid;
 }
 
 /*-
- * Copyright 2009 Colin Percival, 2011 ArtForz, 2011 pooler, 2012 mtrlt
+ * Copyright 2009 Colin Percival, 2011 ArtForz, 2011 pooler, 2012 mtrlt,
+ * 2012 Con Kolivas.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without