Skip to content

Update CUDA runtime version to 8.0#4

Open
acherunilam wants to merge 16 commits into
bryancatanzaro:masterfrom
acherunilam:update_cuda_runtime
Open

Update CUDA runtime version to 8.0#4
acherunilam wants to merge 16 commits into
bryancatanzaro:masterfrom
acherunilam:update_cuda_runtime

Conversation

@acherunilam

Copy link
Copy Markdown

I've migrated all API calls to the new CUDA SDK, and fixed the illegal memory access issue (#2). The program runs successfully on Ubuntu 14.04 with a Tesla K20 GPU. The run time is about 2 seconds when given the default Polynesia image.

For it to work, I'd to create two symlinks - one file at ./lib/libblas.so which points to /usr/lib/libblas.so.3 since BLAS wasn't being detected, and another directory at ./lib/acml which points to the location for the uncompressed ACML package. Additionally, I'd to also set the dynamic library path for it to detect ACML, by adding export LD_LIBRARY_PATH="$HOME/acml5.3.1/ifort64/lib/:$LD_LIBRARY_PATH" to my bashrc.

abcherun@gpus5:~/dev/damascene<update_cuda_runtime>$ ./bin/linux/release/damascene damascene/polynesia.ppm
Using cuda device 1: Tesla K20c
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1
Image found: 321 x 481 pixels
Available 246022144 bytes on GPU
>+< rgbUtoGrayF | 0.733000 | ms
Convolving
Beginning kmeans
	Changes: 189059
	Changes: 82140
	Changes: 53169
	Changes: 40551
	Changes: 32645
	Changes: 25978
	Changes: 23274
	Changes: 19205
	Changes: -215195205
	8 iterations until termination
Kmeans completed
>+< texton | 330.005005 | ms
>+< rgbUtoLab3F | 1.980000 | ms
>+< normalizeLab | 0.015000 | ms
>+< mirrorImage | 0.988000 | ms
Beginning Local cues computation
>+< 	Bgsmooth: | 13.919000 | ms
>+< 	Bg: | 89.497002 | ms
>+< 	Cgsmooth: | 28.606001 | ms
>+< 	Cga: | 101.626999 | ms
>+< 	Cgsmooth: | 28.462000 | ms
>+< 	Cgb: | 100.380997 | ms
>+< 	Tgsmooth: | 26.927000 | ms
>+< 	Tg: | 79.556999 | ms
Completed Local cues
localcues time: 0.378023 seconds
>+< localcues | 378.028992 | ms
>+< combine | 1.131000 | ms

Max time: 0.001299 seconds
Oriented Max time: 0.000520 seconds
Solve time: 0.001848 seconds
>+< nonmaxsupression | 3.734000 | ms
Intervening contour completed
>+< intervene | 16.999001 | ms
Available 160432128 bytes on GPU
Can fit 220 iterations on GPU
lanczos iteration: 0
lanczos iteration: 100
lanczos iteration: 200
Screened Eigenvalues:
8.746726e-08 1.220966e-04 2.492128e-04 6.079139e-04 1.203148e-03 1.603140e-03 2.338310e-03 3.313430e-03 4.052723e-03
Eigenvalue: 8 has too large a residual 2.778898e-03
lanczos iteration: 300
lanczos iteration: 400
Screened Eigenvalues:
8.522160e-08 4.413837e-05 1.205699e-04 2.042214e-04 2.480132e-04 3.003473e-04 4.872549e-04 6.106148e-04 7.674522e-04
Eigenvalue: 8 has too large a residual 1.742319e-03
lanczos iteration: 500
lanczos iteration: 600
lanczos iteration: 700
lanczos iteration: 800
lanczos iteration: 900
Screened Eigenvalues:
6.426722e-08 4.418454e-05 1.360216e-04 2.063825e-04 2.494206e-04 2.955018e-04 4.003884e-04 4.659711e-04 5.760831e-04
Converged
nIterations = 1000
lanczos Iterations : 1.226173 seconds
Eigenvector calculation: 194.000000 microseconds
>+< generalizedEigensolve | 1243.982056 | ms
>+< spectralPb | 4.027000 | ms
>+< StartCalcGpb | 1.020000 | ms
Skeletonizing ...
	Iteration = 1, Image changed
	Iteration = 2, Image changed
	Iteration = 3, Image unchanged
CUDA Status : no error
>+< PostProcess | 4.072000 | ms
>+< Computation time: | 1.986825 | seconds

Do remember to squash the commits before merging :)

@bryancatanzaro

Copy link
Copy Markdown
Owner

This is awesome, thanks for doing this. Let me look at your work here... BTW - why do you want to squash the commits before merging?

@hao-lh

hao-lh commented Jun 14, 2017

Copy link
Copy Markdown

Great work, and here are some concerns:

  1. if we need to upgrade and test it on newest CUDA 8.0 platform.
  2. From the result we can see most of time was spent on generalizedEigensolve/localcues/texton
    So can we work on these and see if there is still some improvement on this, see bryan's original paper: Efficient, High-Quality Image Contour Detection.

@acherunilam

acherunilam commented Jun 14, 2017

Copy link
Copy Markdown
Author

@bryancatanzaro I just thought these changes would be better represented in the log if they were presented as one single "Updated CUDA runtime to 8.0" rather than 16 separate "Updated <module1>", "Updated <module2>", etc. Most projects do squash before merging afaik, but it's up to the maintainer of the repo.

EDIT: Changed 7.5 to 8.0

@acherunilam

Copy link
Copy Markdown
Author

@hao-lh Correction from my side - this code is compatible with runtime version 8.0, not 7.5. I shall fix the title of the pull request.

As for the scope for improvement, I thought this repo implemented everything that was discussed in "Efficient, High-Quality Image Contour Detection" by Catanzaro et al. Is there any specific optimization that you're referring to?

@acherunilam acherunilam changed the title Update CUDA runtime version to 7.5 Update CUDA runtime version to 8.0 Jun 14, 2017
@hao-lh

hao-lh commented Jun 15, 2017

Copy link
Copy Markdown

@AdithyaBenny Most of bryan's code was written more than five years ago, since parallel computation and CUDA is evolving actively these years, I was wondering if there exists methods for better performance, totally no offense for bryan's original algorithm and your work, just want this code runs faster :)

@pkuCactus

Copy link
Copy Markdown

Hi @AdithyaBenny , I use the code you commit, and still encounter the problem that cudaErrorIllegalAddress and the error message is CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)", could you help me, thanks a lot. i'm using titanx and ubuntu 14.04 adn i download the acml5.3.0. thanks.

@pkuCactus

Copy link
Copy Markdown

and here is the completely output
` ./bin/linux/release/damascene damascene/polynesia.ppm
Using cuda device 2: GeForce GTX TITAN X
Processing: damascene/polynesia.ppm, output in damascene/polynesiaPb.pgm and damascene/polynesia.pb

Eig 9 Tol 0.001000 Texton 1
Image found: 321 x 481 pixels
Available 12672958464 bytes on GPU

+< rgbUtoGrayF | 0.244000 | ms
Convolving
Beginning kmeans
Changes: 150860
Changes: 78580
Changes: 50898
Changes: 38726
Changes: 30185
Changes: 25232
Changes: 21250
Changes: 18425
Changes: -179543699
8 iterations until termination
Kmeans completed
+< texton | 237.464996 | ms
+< rgbUtoLab3F | 2.259000 | ms
+< normalizeLab | 0.016000 | ms
+< mirrorImage | 0.858000 | ms
Beginning Local cues computation
+< Bgsmooth: | 7.079000 | ms
+< Bg: | 35.658001 | ms
+< Cgsmooth: | 18.371000 | ms
+< Cga: | 44.307999 | ms
+< Cgsmooth: | 18.410000 | ms
+< Cgb: | 44.462002 | ms
+< Tgsmooth: | 17.982000 | ms
+< Tg: | 39.193001 | ms
Completed Local cues
localcues time: 0.178665 seconds
+< localcues | 178.677994 | ms
+< combine | 1.499000 | ms

Max time: 0.000406 seconds
Oriented Max time: 0.000509 seconds
Solve time: 0.000933 seconds

+< nonmaxsupression | 6.005000 | ms
Intervening contour completed
+< intervene | 7.725000 | ms
Available 12572688384 bytes on GPU
Can fit 18306 iterations on GPU
lanczos iteration: 0
CUDA error at lanczos.cu:217 code=77(cudaErrorIllegalAddress) "cudaMemcpy(devVector, d_aVectorQQ, nPixels * sizeof(float), cudaMemcpyDeviceToDevice)" `

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants