Update CUDA runtime version to 8.0 by acherunilam · Pull Request #4 · bryancatanzaro/damascene

acherunilam · 2017-06-13T23:08:08Z

I've migrated all API calls to the new CUDA SDK, and fixed the illegal memory access issue (#2). The program runs successfully on Ubuntu 14.04 with a Tesla K20 GPU. The run time is about 2 seconds when given the default Polynesia image.

For it to work, I'd to create two symlinks - one file at ./lib/libblas.so which points to /usr/lib/libblas.so.3 since BLAS wasn't being detected, and another directory at ./lib/acml which points to the location for the uncompressed ACML package. Additionally, I'd to also set the dynamic library path for it to detect ACML, by adding export LD_LIBRARY_PATH="$HOME/acml5.3.1/ifort64/lib/:$LD_LIBRARY_PATH" to my bashrc.

abcherun@gpus5:~/dev/damascene<update_cuda_runtime>$ ./bin/linux/release/damascene damascene/polynesia.ppm
Using cuda device 1: Tesla K20c
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1
Image found: 321 x 481 pixels
Available 246022144 bytes on GPU
>+< rgbUtoGrayF | 0.733000 | ms
Convolving
Beginning kmeans
	Changes: 189059
	Changes: 82140
	Changes: 53169
	Changes: 40551
	Changes: 32645
	Changes: 25978
	Changes: 23274
	Changes: 19205
	Changes: -215195205
	8 iterations until termination
Kmeans completed
>+< texton | 330.005005 | ms
>+< rgbUtoLab3F | 1.980000 | ms
>+< normalizeLab | 0.015000 | ms
>+< mirrorImage | 0.988000 | ms
Beginning Local cues computation
>+< 	Bgsmooth: | 13.919000 | ms
>+< 	Bg: | 89.497002 | ms
>+< 	Cgsmooth: | 28.606001 | ms
>+< 	Cga: | 101.626999 | ms
>+< 	Cgsmooth: | 28.462000 | ms
>+< 	Cgb: | 100.380997 | ms
>+< 	Tgsmooth: | 26.927000 | ms
>+< 	Tg: | 79.556999 | ms
Completed Local cues
localcues time: 0.378023 seconds
>+< localcues | 378.028992 | ms
>+< combine | 1.131000 | ms

Max time: 0.001299 seconds
Oriented Max time: 0.000520 seconds
Solve time: 0.001848 seconds
>+< nonmaxsupression | 3.734000 | ms
Intervening contour completed
>+< intervene | 16.999001 | ms
Available 160432128 bytes on GPU
Can fit 220 iterations on GPU
lanczos iteration: 0
lanczos iteration: 100
lanczos iteration: 200
Screened Eigenvalues:
8.746726e-08 1.220966e-04 2.492128e-04 6.079139e-04 1.203148e-03 1.603140e-03 2.338310e-03 3.313430e-03 4.052723e-03
Eigenvalue: 8 has too large a residual 2.778898e-03
lanczos iteration: 300
lanczos iteration: 400
Screened Eigenvalues:
8.522160e-08 4.413837e-05 1.205699e-04 2.042214e-04 2.480132e-04 3.003473e-04 4.872549e-04 6.106148e-04 7.674522e-04
Eigenvalue: 8 has too large a residual 1.742319e-03
lanczos iteration: 500
lanczos iteration: 600
lanczos iteration: 700
lanczos iteration: 800
lanczos iteration: 900
Screened Eigenvalues:
6.426722e-08 4.418454e-05 1.360216e-04 2.063825e-04 2.494206e-04 2.955018e-04 4.003884e-04 4.659711e-04 5.760831e-04
Converged
nIterations = 1000
lanczos Iterations : 1.226173 seconds
Eigenvector calculation: 194.000000 microseconds
>+< generalizedEigensolve | 1243.982056 | ms
>+< spectralPb | 4.027000 | ms
>+< StartCalcGpb | 1.020000 | ms
Skeletonizing ...
	Iteration = 1, Image changed
	Iteration = 2, Image changed
	Iteration = 3, Image unchanged
CUDA Status : no error
>+< PostProcess | 4.072000 | ms
>+< Computation time: | 1.986825 | seconds

Do remember to squash the commits before merging :)

…sla K20

…y allocated memory per block for the computeGradient kernel

bryancatanzaro · 2017-06-14T00:21:20Z

This is awesome, thanks for doing this. Let me look at your work here... BTW - why do you want to squash the commits before merging?

hao-lh · 2017-06-14T02:11:43Z

Great work, and here are some concerns:

if we need to upgrade and test it on newest CUDA 8.0 platform.
From the result we can see most of time was spent on generalizedEigensolve/localcues/texton
So can we work on these and see if there is still some improvement on this, see bryan's original paper: Efficient, High-Quality Image Contour Detection.

acherunilam · 2017-06-14T02:23:18Z

@bryancatanzaro I just thought these changes would be better represented in the log if they were presented as one single "Updated CUDA runtime to 8.0" rather than 16 separate "Updated <module1>", "Updated <module2>", etc. Most projects do squash before merging afaik, but it's up to the maintainer of the repo.

EDIT: Changed 7.5 to 8.0

acherunilam · 2017-06-14T03:25:09Z

@hao-lh Correction from my side - this code is compatible with runtime version 8.0, not 7.5. I shall fix the title of the pull request.

As for the scope for improvement, I thought this repo implemented everything that was discussed in "Efficient, High-Quality Image Contour Detection" by Catanzaro et al. Is there any specific optimization that you're referring to?

hao-lh · 2017-06-15T08:40:00Z

@AdithyaBenny Most of bryan's code was written more than five years ago, since parallel computation and CUDA is evolving actively these years, I was wondering if there exists methods for better performance, totally no offense for bryan's original algorithm and your work, just want this code runs faster :)

pkuCactus · 2017-11-03T09:53:34Z

Hi @AdithyaBenny , I use the code you commit, and still encounter the problem that cudaErrorIllegalAddress and the error message is CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)", could you help me, thanks a lot. i'm using titanx and ubuntu 14.04 adn i download the acml5.3.0. thanks.

pkuCactus · 2017-11-03T09:57:20Z

and here is the completely output
` ./bin/linux/release/damascene damascene/polynesia.ppm
Using cuda device 2: GeForce GTX TITAN X
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1
Image found: 321 x 481 pixels
Available 12672958464 bytes on GPU

+< rgbUtoGrayF | 0.244000 | ms
Convolving
Beginning kmeans
Changes: 150860
Changes: 78580
Changes: 50898
Changes: 38726
Changes: 30185
Changes: 25232
Changes: 21250
Changes: 18425
Changes: -179543699
8 iterations until termination
Kmeans completed
+< texton | 237.464996 | ms
+< rgbUtoLab3F | 2.259000 | ms
+< normalizeLab | 0.016000 | ms
+< mirrorImage | 0.858000 | ms
Beginning Local cues computation
+< Bgsmooth: | 7.079000 | ms
+< Bg: | 35.658001 | ms
+< Cgsmooth: | 18.371000 | ms
+< Cga: | 44.307999 | ms
+< Cgsmooth: | 18.410000 | ms
+< Cgb: | 44.462002 | ms
+< Tgsmooth: | 17.982000 | ms
+< Tg: | 39.193001 | ms
Completed Local cues
localcues time: 0.178665 seconds
+< localcues | 178.677994 | ms
+< combine | 1.499000 | ms

Max time: 0.000406 seconds
Oriented Max time: 0.000509 seconds
Solve time: 0.000933 seconds

+< nonmaxsupression | 6.005000 | ms
Intervening contour completed
+< intervene | 7.725000 | ms
Available 12572688384 bytes on GPU
Can fit 18306 iterations on GPU
lanczos iteration: 0
CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)" `

acherunilam added 16 commits June 13, 2017 18:10

Change arch in NVCCFLAGS from sm_10 to sm_35 to be compatible with Te…

cbfc7e7

…sla K20

Update paths to CUDA SDK and ACML

d86f82d

Update the stencilMatrixMultiply module

c2b75b7

Update the combine module

cc484f4

Update the convert module

5f596ec

Update the textons module

4544bf2

Update the gPb module

ae53d74

Update the localcues module

65e505b

Update the intervening module

d4540a5

Update the nonmax module

67fc96f

Update the noReorthog module

5321507

Update the postprocess module

874ed57

Update the sPb module

4bfa089

Update the damascene module

dd73b52

Fix invalid write to shared memory error by specifying the dynamicall…

8578e1f

…y allocated memory per block for the computeGradient kernel

Add missing newline in the output

f1889ea

acherunilam changed the title ~~Update CUDA runtime version to 7.5~~ Update CUDA runtime version to 8.0 Jun 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CUDA runtime version to 8.0#4

Update CUDA runtime version to 8.0#4
acherunilam wants to merge 16 commits into
bryancatanzaro:masterfrom
acherunilam:update_cuda_runtime

acherunilam commented Jun 13, 2017

Uh oh!

bryancatanzaro commented Jun 14, 2017

Uh oh!

hao-lh commented Jun 14, 2017

Uh oh!

acherunilam commented Jun 14, 2017 •

edited

Loading

Uh oh!

acherunilam commented Jun 14, 2017

Uh oh!

hao-lh commented Jun 15, 2017

Uh oh!

pkuCactus commented Nov 3, 2017

Uh oh!

pkuCactus commented Nov 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

acherunilam commented Jun 13, 2017

Uh oh!

bryancatanzaro commented Jun 14, 2017

Uh oh!

hao-lh commented Jun 14, 2017

Uh oh!

acherunilam commented Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acherunilam commented Jun 14, 2017

Uh oh!

hao-lh commented Jun 15, 2017

Uh oh!

pkuCactus commented Nov 3, 2017

Uh oh!

pkuCactus commented Nov 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acherunilam commented Jun 14, 2017 •

edited

Loading