Author |
Message |
|
All,
The last three WUs sent to me have resulted in errors. This is from a desktop that has had few problems with SoB. How do I diagnose this?
____________
Thanks,
Jim
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,165,326 RAC: 1,015,136
                               
|
I'm not much of a Mac person, so I can't help too much, but I can say that it doesn't look like a hardware problem. You're not getting computation errors. That, of course, is good news.
It looks like something external to the app is causing the tasks to be terminated. Three of the four tasks get a "signal 4", which I believe is what shut down the tasks. The fourth has an error message about BOINC being unable to write one of the files in the slot directory. That last error happened before our app even started to run.
Regardless of the OS, there's only a few things that would prevent BOINC from being able to write to a file: The disk is full, the directory permissions on the BOINC directories are incorrect, anti-virus software (or something else) is interfering, or there's some sort of problem (hardware or drivers) with the disks.
It's easy to check if you're running out of disk space, and re-installing BOINC often fixes the permission problems.
Beyond that, I don't have much other advice, sorry.
____________
My lucky number is 75898524288+1 |
|
|
|
Hi Jim,
These are really mysterious…
This task http://www.primegrid.com/result.php?resultid=628546564 fails with an error related to copying files at startup, but had accumulated 146,000s of runtime. It seems that in this failure mode you don't get the stderr text returned, but it seems that LLR was running. The actual cause of the fault is probably something like Mike suggested.
The other two e.g. http://www.primegrid.com/result.php?resultid=635893994 fail with "process got signal 4". Signal 4 is an 'Illegal Instruction', which is just plain weird since I can't think of any reason why LLR or BOINC should try to execute any instruction that is not supported by your CPU (Haswell). I also spotted a PSP task with the same error http://www.primegrid.com/result.php?resultid=635327367
The only thing that I can think that might link these two is that you have some disk corruption, and your LLR binary got subtly broken, as well as maybe some other things in the BOINC directory. I would recommend you uninstall BOINC, run a full Disk check with Apple's 'Disk Utility', and then once you have fixed any errors that reports, reinstall BOINC and try again.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! |
|
|
|
All is well since I performed the HD check and the reinstall. However, I have not run an Sob or a PSP wu yet.
Jim |
|
|
|
I got another PSP failure with a signal 4. For right now, I am stopping my PSP and Sob runs until I know how to fix this. Are there any diagnostics I can run?
____________
Thanks,
Jim
|
|
|
|
Iain,
My disk checking software found a file called ss.config.xml in the Boinc data directory and says it is not a .xml file. Could this be the source or at least related to my problem?
____________
Thanks,
Jim
|
|
|
|
Iain,
My disk checking software found a file called ss.config.xml in the Boinc data directory and says it is not a .xml file. Could this be the source or at least related to my problem?
____________
Thanks,
Jim
|
|
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 989,246,873 RAC: 201,706
                     
|
Just so you know, Iain is away on holiday this week. |
|
|
|
Thanks,
Jim |
|
|
|
The SS_config.XML is used for screen saver. Can you view the file (preferred in something like a texted editor) and paste the contents into a message here? It probably would not be considered a normal XML file in the type of search you did.
Sig 4 is an illegal instruction. Were you running another no PrimeGrid project at the same time as the SOB?
I see you are using Beta software, but I do not think that should make a difference. When you installed BOINC, did you tell it to allow to use the screen saver? This may be where the conflict arises when the screen saver is running on a different project and sends an instruction that may cause SOB to have an issue with trying to swap memory at the same time as the data moving around is pretty huge.
So my suggestion would be to reinstall BOINC, turning off the option for the screen saver and see what happens then.
|
|
|
|
Hey Bear..
Here are the xml file contents:
<default_gfx_duration>90</default_gfx_duration>
<science_gfx_duration>90</science_gfx_duration>
<science_gfx_change_interval>30</science_gfx_change_interval>
I run Einstein, Test4Theory and Atlas simultaneously with Primegrid.
This problem happened before running the Beta software so I doubt that is an issue.
I have noted a couple of things...
It started occurring after the LLR upgrade to 7.02. However, TRP and ESP seem to work fine.
I also upgraded my OS on July 2, one day before the 7.02 versions were installed. Boy, I hope that is not the issue!
What'cha think?
____________
Thanks,
Jim
|
|
|
|
Those are standard options for the screen saver. I personally turn off the screen saver part during the install. It has been known to have some problems with some projects. It is an option to not allow it to run and may be something to try, especially since I know Einstein is pretty big about using the screen saver. I am not as knowledgeable about the other projects.
So for now, I'd suggest setting your screen saver to off and just have the monitor go blank (unsure how to do that on the MAC without getting behind one again), then try again. That will rule out my suspicion of a collision happening. It may be part of the new LLR version that is utilizing memory different way and causing that.
|
|
|
|
I generally run at about 75% memory usage total , sometimes more. BOINC itself can get to 6 GB if three ATLAS tasks are running on my 4-core machine.
____________
Thanks,
Jim
|
|
|
|
Just for grins, I aborted all of my WUs and I am now running Primegrid: PSP only. For me , that is about a 2-3 day job. So we'll see what happens. |
|
|
|
All four of my PSP WUs ended in error. Here is the event log:
Thu Jul 30 01:14:12 2015 | PrimeGrid | Computation for task psp_llr_240041469_2 finished
Thu Jul 30 01:14:12 2015 | PrimeGrid | Output file psp_llr_240041469_2_0 for task psp_llr_240041469_2 absent
Thu Jul 30 01:14:13 2015 | PrimeGrid | Computation for task psp_llr_240041267_2 finished
Thu Jul 30 01:14:13 2015 | PrimeGrid | Output file psp_llr_240041267_2_0 for task psp_llr_240041267_2 absent
Thu Jul 30 01:14:14 2015 | PrimeGrid | Computation for task psp_llr_240041745_1 finished
Thu Jul 30 01:14:14 2015 | PrimeGrid | Output file psp_llr_240041745_1_0 for task psp_llr_240041745_1 absent
Thu Jul 30 01:14:15 2015 | PrimeGrid | Computation for task psp_llr_240041728_0 finished
Thu Jul 30 01:14:15 2015 | PrimeGrid | Output file psp_llr_240041728_0_0 for task psp_llr_240041728_0 absent
Thu Jul 30 01:14:21 2015 | | Suspending computation - CPU is busy
Thu Jul 30 01:14:31 2015 | | Resuming computation
Thu Jul 30 01:24:59 2015 | PrimeGrid | Sending scheduler request: To report completed tasks.
Thu Jul 30 01:24:59 2015 | PrimeGrid | Reporting 4 completed tasks
Thu Jul 30 01:24:59 2015 | PrimeGrid | Not requesting tasks: "no new tasks" requested via Manager
Thu Jul 30 01:25:01 2015 | PrimeGrid | Scheduler request completed
Notice that at the same time (within seconds), my non-BOINC CPU usage exceeded my limit which is set to 50%. In the stderrs. signal 4 is cited as the cause.
Not sure what to do next. Any and all suggested debug steps gratefully accepted.
____________
Thanks,
Jim
|
|
|
|
FYI...
It turns out that for PSP and SoB, all work units aborted at 1:15 am. I have a daily backup program that finishes around that time every night. Therefore, my new working theory is that there is some incompatibility between PSP or Sob and my backup software.
I have now made that backup software an exclusive application. We'll see if that helps.
I don't know where the problem lurks. I did an MacOS upgrade on July 2 and PG went to version 7.02 of those two app son July 3.
Will update as needed.
____________
Thanks,
Jim
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,165,326 RAC: 1,015,136
                               
|
FYI...
It turns out that for PSP and SoB, all work units aborted at 1:15 am. I have a daily backup program that finishes around that time every night. Therefore, my new working theory is that there is some incompatibility between PSP or Sob and my backup software.
I have now made that backup software an exclusive application. We'll see if that helps.
I don't know where the problem lurks. I did an MacOS upgrade on July 2 and PG went to version 7.02 of those two app son July 3.
Will update as needed.
You can try configuring the backup program to ignore the BOINC data directory, or at least the slot directories. Chances are that backups of the slot directories are pointless anyway. It's really tricky restoring BOINC from a backup without losing the tasks and it usually doesn't work.
____________
My lucky number is 75898524288+1 |
|
|
|
Michael,
Good thought. I will do that.
____________
Thanks,
Jim
|
|
|
|
Well, I had BOINC stopped from running during my backup and had quit backing up the BOINC library. My 3 PSP tasks aborted anyway with signal 4 errors. I'm going to have to quit running PSP and Sob WUs some update happens.
Sigh...
Jim |
|
|
|
Sorry to be late to the party. But words "signal 4" made me to skim some interwebs and I'm glad to report that... "signal 4" is SIGILL (illegal instruction) for darwin too. But! I've found some noise that handling of SIGTRAP (debugging thingy) by darwin isn't reliable. And it was suggested to replace SIGTRAP by SIGILL instead. Does it trigger anything?
____________
I'm counting for science,
Points just make me sick. |
|
|
|
I think I knew that, but I surely didn't research it like you did. What now is interesting is that I've just upgraded to the latest version of Mac OS. All of the smaller 7.2 WUs seem to be fine. So what that may imply is that I don't have a 7.2 problem, but I do have a problem with very long 7.2 WUs. I have turned on PSP tasks so see if the oS upgrade helped.
ALso, the last three PSP WUs that failed...they failed all at the same time, within a second. Not sure what that implies.
____________
Thanks,
Jim
|
|
|
|
To the admins: Does this discussion need to be moved to the PSP message boards?
____________
Thanks,
Jim
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,165,326 RAC: 1,015,136
                               
|
To the admins: Does this discussion need to be moved to the PSP message boards?
Your choice if you want to move it or not; I don't mind if it stays here. However, this isn't a PSP problem, or an SoB problem. It's a generic LLR problem (or a generic BOINC problem) that you have which appears to be more likely to occur on longer tasks. The best place for it, in my opinion, is either "Number Crunching" or "Problems and Help".
(Functionally, there's no difference between a PPSE-LLR task and and an SoB-LLR task. It's the same software, testing the same type of number, and using the same algorithm. The only difference is the size of the number.)
____________
My lucky number is 75898524288+1 |
|
|
|
If you don't mind lets leave it here. Seems to not bother others.
____________
Thanks,
Jim
|
|
|
|
I guessed that the LLR algorithm was the same for all. I didn't know, for instance, if memory requirements grew as a function of the size of the prime candidate or what.
____________
Thanks,
Jim
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,165,326 RAC: 1,015,136
                               
|
I guessed that the LLR algorithm was the same for all. I didn't know, for instance, if memory requirements grew as a function of the size of the prime candidate or what.
Different types of numbers use different algorithms.
The amount of memory used should be roughly proportional to the size of the number being tested (I think). However, I don't think that's related to your problem. The fact that you had a bunch of tasks fail at once points at something external causing the problem.
There are slight differences in the type of FFT used, but that varies within each project in an unpredictable manner and you shouldn't notice any particular pattern between projects; i.e., that wouldn't be responsible for seeing problems with SoB but not PPSE.
By far, the most likely cause for what you're seeing is simply that the tests are longer in duration.
Assume there's something that causes this error every 24 hours and that PPSE tasks take 10 minutes and SoB tasks take 3 days.
100% of the SoB tasks would have errors. 0.7% of the PPSE tasks would have errors. I suspect that if you run short tests for an extended period of time you'll see that a small percentage of them error out with the same error you're seeing on the long tasks.
____________
My lucky number is 75898524288+1 |
|
|
|
Just to throw this out there, but how about trying the new LLR 3.8.16 that is being tested now?
I do agree with Michael it seems to be external, since they all seem to happen simultaneously. Backup program, virus scan, or any number of other programs that may schedule themselves to run. It's been a while since I ran a MAC, but I believe there is a way to look at some type of task scheduler of auto run items. Also check any event type logs the OS may also keep and see if something shows up at the particular time it has the issue. If nothing else, find some type of monitoring program to watch and record processes as they run and see if it can catch something.
All of this takes time, determination and patience. If the issue is found and able to be rectified, then it will be worth it.
|
|
|
|
TheDawgz had a similar problem.
We could run PPSE without error but any larger FFT's would fail 99.9% of the time.
Turned out to be bad memory (took running memtest for a day or so) replaced the memory and no more problems.
____________
There's someone in our head but it's not us. |
|
|
|
Thanks for all of the responses.
Well, ESP LLR works, so that is a good thing.
I also kinda suspect that LLR time duration is the culprit. However, I have run PSP and Sob for a few years and this is the first time for these problems. I have also considered temperature issues, it is a Virginia summer, but I usually cut down the number if CPUs that I use in warm weather and I never run them at more than 50% capacity.
I have a memory tester. I'll run it in the background. I'll also take a look at LLR 3.8.16.
Thanks for all of the inputs, folks. I am impatient but persistent (odd combo) so I will spend more effort tracking this down.
Jim |
|
|
|
I have also turned PSP LLR back on to see if anything has changed under the new operating system. |
|
|
|
I think I knew that, but I surely didn't research it like you did. What now is interesting is that I've just upgraded to the latest version of Mac OS. All of the smaller 7.2 WUs seem to be fine. So what that may imply is that I don't have a 7.2 problem, but I do have a problem with very long 7.2 WUs. I have turned on PSP tasks so see if the oS upgrade helped.
ALso, the last three PSP WUs that failed...they failed all at the same time, within a second. Not sure what that implies.
It turns out that a PPS Sieve failed also at the same time with a signal 4. I run the cuda version.
So, it is not a 7.0.2 issue. I'm not sure what it is.
Jim |
|
|
|
It is not a Primegrid issue either. I have Einstein@home failures also. I have had no failures on Atlas or VHC which use vBox, but they don't have the highest priority either. So folks, thanks for all of the suggestions and I am sorry I led us down this path'
Good prime hunting to all.
____________
Thanks,
Jim
|
|
|
|
I had four tasks fail at the same time last night with Signal 4 errors. I have included the log file entries around that time. Much as I hate to do this, I'm going to deselect SoB and PSP llrs as I am wasting too many cycles. Hope the log file data helps someone.
Wed Sep 16 22:49:52 2015 | PrimeGrid | Sending scheduler request: To fetch work.
Wed Sep 16 22:49:52 2015 | PrimeGrid | Requesting new tasks for NVIDIA GPU
Wed Sep 16 22:49:54 2015 | PrimeGrid | Scheduler request completed: got 0 new tasks
Thu Sep 17 00:04:58 2015 | PrimeGrid | Computation for task psp_llr_240043093_1 finished
Thu Sep 17 00:04:58 2015 | PrimeGrid | Output file psp_llr_240043093_1_0 for task psp_llr_240043093_1 absent
Thu Sep 17 00:04:59 2015 | PrimeGrid | Computation for task psp_llr_240043164_3 finished
Thu Sep 17 00:04:59 2015 | PrimeGrid | Output file psp_llr_240043164_3_0 for task psp_llr_240043164_3 absent
Thu Sep 17 00:05:00 2015 | PrimeGrid | Computation for task llr_sob_250283264_0 finished
Thu Sep 17 00:05:00 2015 | PrimeGrid | Output file llr_sob_250283264_0_0 for task llr_sob_250283264_0 absent
Thu Sep 17 00:05:01 2015 | PrimeGrid | Computation for task llr_sob_250283131_3 finished
Thu Sep 17 00:05:01 2015 | PrimeGrid | Output file llr_sob_250283131_3_0 for task llr_sob_250283131_3 absent
Thu Sep 17 00:05:07 2015 | | Suspending computation - CPU is busy
Thu Sep 17 00:05:17 2015 | | Resuming computation
Thu Sep 17 00:07:50 2015 | | Project communication failed: attempting access to reference site
Thu Sep 17 00:07:51 2015 | | Internet access OK - project servers may be temporarily down.
Thu Sep 17 00:20:13 2015 | PrimeGrid | Sending scheduler request: To report completed tasks.
Thu Sep 17 00:20:13 2015 | PrimeGrid | Reporting 4 completed tasks
Thu Sep 17 00:20:13 2015 | PrimeGrid | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full)
Thu Sep 17 00:20:15 2015 | PrimeGrid | Scheduler request completed
____________
Thanks,
Jim
|
|
|
|
Problem went away when I upgraded to Mac OS latest version. This is good.
____________
Thanks,
Jim
|
|
|