Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Seventeen or Bust :
WUs stalled out?
Author |
Message |
|
I have 2 WUs that appear to be stalled out: llr_sob_71249911_3 & llr_sob_71263408_2. The time elapsed continues to increment, but the progress bar and time remaining values have not changed in a couple of days. I have several other WUs that are progressing normally.
Suggestions? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,373,019 RAC: 303,129
                               
|
I have 2 WUs that appear to be stalled out: llr_sob_71249911_3 & llr_sob_71263408_2. The time elapsed continues to increment, but the progress bar and time remaining values have not changed in a couple of days. I have several other WUs that are progressing normally.
Suggestions?
Have you tried rebooting your computer?
____________
My lucky number is 75898524288+1 | |
|
|
I had identical symptoms a few weeks ago running GFN WR units on my GPUs. I found that a short-term fix was to just suspend and then resume the task in question. That was unsatisfactory though as they'd just "stall" again, sometimes hours later. There was always a message something like "the boinc client requested we should quit" in the log.
At the time my preferences were set to only run GPU work (two 570s), and I was running a 100% full load of PRPnet on the CPU (4 cores on the FPS challenge). For whatever reason I had previously set my boinc manager preferences to only use 25% of the available cores. When I set that back to 100%, the misbehavior stopped. I know that's not ironclad proof, but it did seem related. Not sure if it was directly due to the percentage, or if boinc thought my computer was "in use" (by prpnet), or a combination. On my other computer, which has only 1 GPU and was set to use 100% of cores (and was running a similar work mix), I never had a problem.
--Gary | |
|
|
1. I have tried exiting BOINC, manually killing the boinctray process, then restarting the program.---> no success
2. I have restarted the computer (shut down, then power back on). --> no success
3. I have no suitable GPUs.
4. In order to see if the problem had something to do with running 6-8 SoB tasks simultaneously, I have suspended most of them and will run a couple at a time to completion. Since I am 7 days away from the deadline for all my SoB tasks, I will leave the problem tasks suspended until all others are complete. I don't want to risk losing "good" tasks by wasting time troubleshooting problematic ones.
I'll report back in a few days. In the meantime, any other suggestions are still welcome. | |
|
|
Well, it appears that all of my SoB WUs want to stall out around the 40% completion mark. Once a WU hits the magic combination of 35-40% complete/~190 hrs elapsed/~40 hrs remaining, they exhibit the following behavior: both the time elapsed and time remaining begin to increment higher, & the % completion bar effectively halts.
I have tried suspending all other tasks except for one SoB task so that it could have as much resources as it wanted, but I woke up the next morning to see that all it did was increase the time elapsed/remaining.
Unless I get some guidance, I will abort these tasks and take a hiatus from this project. I am beginning to feel like I have wasted the >1500 CPU hours accumulated so far on these tasks. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,373,019 RAC: 303,129
                               
|
Well, it appears that all of my SoB WUs want to stall out around the 40% completion mark. Once a WU hits the magic combination of 35-40% complete/~190 hrs elapsed/~40 hrs remaining, they exhibit the following behavior: both the time elapsed and time remaining begin to increment higher, & the % completion bar effectively halts.
I have tried suspending all other tasks except for one SoB task so that it could have as much resources as it wanted, but I woke up the next morning to see that all it did was increase the time elapsed/remaining.
Unless I get some guidance, I will abort these tasks and take a hiatus from this project. I am beginning to feel like I have wasted the >1500 CPU hours accumulated so far on these tasks.
As far as I know, the problem you're having is unique. Therefore, there isn't any solid advice I can give you because it's not clear what the problem is.
If you have already aborted those tasks, I'd suggest trying to run some shorter LLR tasks, such as SGS, PPS, or PPSE, all of which take no more than a couple of hours. It might be helpful to know if the problem occurs at 40% or at 190 hours.
If you have not yet aborted the tasks and want to try to identify the problem, take a look inside some of the text files in the boinc directory. There might be something in there that would indicate what's happening.
The first file to look at is in the slot directory where one of the SoB tasks is running. This filename is stderr.txt. Please post the contents of that file.
The second file can be found by looking in a file called llr.out in the same slot directory. Inside the llr.out file will be something like this:
<soft_link>../../projects/www.primegrid.com/llr_sr5_189626089_2_0</soft_link>
The highlighted portion is the name of the output file this task will create, and can be found in the boinc/projects/www.primegrid.com directory. The actual file name will be different than what you see in my example. If this output file exits in the www.primegrid.com directory, please post the contents of that file. It might not exist, however.
____________
My lucky number is 75898524288+1 | |
|
|
Here are the complete contents of the stderr.txt file for task llr_sob_71270569_2:
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
02:30:47 (4332): No heartbeat from core client for 30 sec - exiting
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting
Major OS version: 6; Minor OS version: 1
FFT length: 1920K
Here are the complete contents of the output file for the same task:
1000000000:P:0:2:25755459 21295606
For the record, this task currently shows 39.181% complete, 196:29:xx elapsed, 46:49:xx remaining, and both times are increasing at about 50% real-time speed (i.e., 1 sec every 2 actual secs).
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,373,019 RAC: 303,129
                               
|
Here are the complete contents of the output file for the same task:
1000000000:P:0:2:25755459 21295606
That looks like the input file, not the output file.
The stderr.txt file didn't have anything unusual in it.
____________
My lucky number is 75898524288+1 | |
|
|
The second file can be found by looking in a file called llr.out in the same slot directory. Inside the llr.out file will be something like this:
<soft_link>../../projects/www.primegrid.com/llr_sr5_189626089_2_0</soft_link>
The highlighted portion is the name of the output file this task will create, and can be found in the boinc/projects/www.primegrid.com directory. The actual file name will be different than what you see in my example. If this output file exits in the www.primegrid.com directory, please post the contents of that file. It might not exist, however.
I just noticed something. When I open the llr.out, it gives me the following:
<soft_link>../../projects/www.primegrid.com/llr_sob_71270569_2_0</soft_link>
Note the bolded/red portion above.
When I view the contents of the /projects/www.primegrid.com directory, I see what I assume are the output files for the 12 tasks I currently have on board. However, each file does not have the "_x_y" suffix (as bolded above) in its filename. Is this significant? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,373,019 RAC: 303,129
                               
|
The second file can be found by looking in a file called llr.out in the same slot directory. Inside the llr.out file will be something like this:
<soft_link>../../projects/www.primegrid.com/llr_sr5_189626089_2_0</soft_link>
The highlighted portion is the name of the output file this task will create, and can be found in the boinc/projects/www.primegrid.com directory. The actual file name will be different than what you see in my example. If this output file exits in the www.primegrid.com directory, please post the contents of that file. It might not exist, however.
I just noticed something. When I open the llr.out, it gives me the following:
<soft_link>../../projects/www.primegrid.com/llr_sob_71270569_2_0</soft_link>
Note the bolded/red portion above.
When I view the contents of the /projects/www.primegrid.com directory, I see what I assume are the output files for the 12 tasks I currently have on board. However, each file does not have the "_x_y" suffix (as bolded above) in its filename. Is this significant?
Very significant.
The files without the _x_y suffix are the input files and the files with the _x_y suffix are the output files.
Under normal circumstances, the output files won't be there while the task is still running, but they might be there if something unusual happened during processing. If they're not there, then there's nothing to be learned from them.
____________
My lucky number is 75898524288+1 | |
|
|
There are definitely no output files in my project directory.
Since I last posted the progress of this WU, the progress bar is now at 40%, with 203 hrs elapsed and 49 hrs remaining. For a dual Xeon E5520 w/ 24GB RDIMMs things should be (and have been) speedier than this.
I am going to abort all 12 WUs and revert to crunching SIMAP again.
Thanks for your attempts to troubleshoot this problem with me. It is unfortunate that so much time was wasted. | |
|
Message boards :
Seventeen or Bust :
WUs stalled out? |