Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Seventeen or Bust :
Completed, marked as invalid?
Author |
Message |
|
I did a SOB, two wingmen got credit but mine was invalid...
Why is a work unit marked as invalid?
| |
|
|
Your result differed from the result of the two wingmen. Somewhere along the way your computer made an error.
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
Your result differed from the result of the two wingmen. Somewhere along the way your computer made an error.
99.99% of the time, your answer would be right. This was the 0.01%. Credit has now been granted for this result.
In this particular case, the host returned the correct result, but had some errors occur prior to completing the computation. It's not clear what caused those errors (antivirus programs, file permissions, backup programs, or whatever), but the validator saw the errors and rejected the result because of them. That's what the validator is supposed to do.
Even though this was caused by an error on the client computer, the validator could potentially be made smart enough to recognize that the client returned a valid result along with the error and process the good part of the result. Such an improvement is on our list of things to do, but it's not an easy change and I don't have an ETA for when it will happen.
____________
My lucky number is 75898524288+1 | |
|
|
For a moment I thought I was going crazy, I checked my tasks again and it was completed and validated.... then I came back to the forum and saw your reply, thanks for that!
I also had a SOB 'error while computing' several weeks ago, what would most likely cause an error?
My last few SOB have been ok and I want to run more, but not if they will error or not validate.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
For a moment I thought I was going crazy, I checked my tasks again and it was completed and validated.... then I came back to the forum and saw your reply, thanks for that!
I also had a SOB 'error while computing' several weeks ago, what would most likely cause an error?
My last few SOB have been ok and I want to run more, but not if they will error or not validate.
I couldn't find the SoB from weeks ago, but one of your pending results did have the same problem. It would have been rejected eventually. It's fixed too.
I also went and fixed every result currently on the server that has the same problem. There were 15 total.
Go ahead and crunch whatever you wish and don't worry about this problem. It will be fixed before you return any of the results. (I expect it to be fixed today unless it's a lot more complicated than it looks, as long as I have the time to fix it.)
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
This is now fixed.
____________
My lucky number is 75898524288+1 | |
|
|
Your result differed from the result of the two wingmen. Somewhere along the way your computer made an error.
99.99% of the time, your answer would be right. This was the 0.01%. Credit has now been granted for this result.
In this particular case, the host returned the correct result, but had some errors occur prior to completing the computation. It's not clear what caused those errors (antivirus programs, file permissions, backup programs, or whatever), but the validator saw the errors and rejected the result because of them. That's what the validator is supposed to do.
Hi Michael.
Can you take a look at my result and the wingmen here?
I'm wondering why I didn't get any credits, I was only one day over the deadline, there is no apparent reason in the log why and one of my wingmen even had constantly "no heartbeat"-messages.
____________
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
Hi Michael.
Can you take a look at my result and the wingmen here?
I'm wondering why I didn't get any credits, I was only one day over the deadline, there is no apparent reason in the log why and one of my wingmen even had constantly "no heartbeat"-messages.
I'll refer you to Michael Millerick's 99.9% answer in the second post in this thread. Actually, since the 0.01% that his answer didn't cover was corrected back in February, I suppose his answer is more like 100..0% correct now.
The usual answer applies here: your computer malfunctioned, for whatever reason, and did not run the calculation correctly. It returned the wrong result. Heat and/or overclocking is the usual suspect for such errors.
____________
My lucky number is 75898524288+1 | |
|
|
The usual answer applies here: your computer malfunctioned, for whatever reason, and did not run the calculation correctly. It returned the wrong result. Heat and/or overclocking is the usual suspect for such errors.
Hm, when the result is really incorrect it is quite more confusing to me. My computer (AMD X6) did run that task without any heat and overclocking (under a Virtual Box-Linux-system), not even heartbeat-problems like the other one had happened (which is normally a problem, I had such things in the past, they were never validated afaik!). I even shut down the Virtual system normally every time and there was no unusual system crash I can think of.
If there was a problem why is there nothing in the log about what could caused the error?
Maybe you guys should think about a better way to track down problems on such long tasks.
I now only can think of that my Virtual Box-system doesn't run correctly with that.
I'll risk it and try another task on my other computer which has the same virtual system, if that returns an incorrect result also then I know why.
____________
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
It wasn't any kind of error that could be detected by the software. Somewhere, during the many days of calculations, your computer made an error in a calculation. This was a hardware error of some sort. It wasn't vbox, it wasn't an abrupt shutdown. The CPU malfunctioned at least once during the many billions of instructions it executed while running that task.
Only you, with the computer right in front of you, are in a position to try and diagnose what went wrong and how to prevent it from happening again in the future.
Two other computers returned matching results; yours returned a different value for the calculation.
Prime95 is used by overclockeers as a stress test to verify that their computers can run stably. Our LLR program uses the same code inside as Prime95, and is as much of a torture test as Prime95 is. This one time, your computer didn't run the test successfully.
The best advice I can give you is that this would be a good time to blow the dust out of your computer hardware; if the heatsink and other parts are coated or clogged with dust that will cause problems.
If problems such as this persist and can't be corrected, you can try running sieves on that CPU. They are a lot less stressful and the CPU will run cooler.
That's the best advice anyone can give you.
____________
My lucky number is 75898524288+1 | |
|
|
The best advice I can give you is that this would be a good time to blow the dust out of your computer hardware; if the heatsink and other parts are coated or clogged with dust that will cause problems.
Well, as I said, it couldn't be caused by heat and/or overclocking. I don't do overclocking anyway and I already cleaned my system(s) recently (some days before the task was downloaded & started), it stands in a place where it is relatively cool. I never had heat problems since I'm crunching, so that's a little laugh for me. ;-)
If problems such as this persist and can't be corrected, you can try running sieves on that CPU.
Well, whatever caused the calculation error, it happened the first time on a PG task since a long time for me, I usually didn't have problems with LLRs. Bad luck that it happened on a long SoB... :-(
Anyway, thanks for the help. Maybe I'm posting what's happening with the other task I just started in a few days. ;-)
____________
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
| |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
If your units were marked as invalid, i suggest to invest into ECC-RAM. Only one Bit needs to tilt and your entire calculation is gone to /dev/null.
Especially if you are working with Virtual Machines, use ever ECC-RAM. ECC-RAM can detect singlebit- and multibit-errors and can at least correct singlebit-errors. In case of a multibit-error your cpu will get a NMI and stop the host. This is better than living with a creeping data corruption over seconds, minutes, hours because VMs jumping through all available memory as long as they are running. If one bit is unstable, this will cause errors in all running VMs in shortest time. It depends on the instruction itself, if you see a BSOD/Oops or simple nothing at all, but a calculation will be wrong.
If you use Intel cpu's this is something tricky because firstly Intel supports ECC-RAM only for their business cpu's and chipsets and secondly Core_i5/i7 (desktop, mobile) are officially not supported in Xeon-chipsets. AMD is more relaxed. If a cpu supports ECC-RAM, the corresponding chipset will be no showstopper...
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
The best advice I can give you is that this would be a good time to blow the dust out of your computer hardware; if the heatsink and other parts are coated or clogged with dust that will cause problems.
Well, as I said, it couldn't be caused by heat and/or overclocking. I don't do overclocking anyway and I already cleaned my system(s) recently (some days before the task was downloaded & started), it stands in a place where it is relatively cool. I never had heat problems since I'm crunching, so that's a little laugh for me. ;-)
If problems such as this persist and can't be corrected, you can try running sieves on that CPU.
Well, whatever caused the calculation error, it happened the first time on a PG task since a long time for me, I usually didn't have problems with LLRs. Bad luck that it happened on a long SoB... :-(
Anyway, thanks for the help. Maybe I'm posting what's happening with the other task I just started in a few days. ;-)
If you've eliminated the usual suspects, then the only thing to do is keep running and see how frequently the errors occur, if at all. There are completely unexplainable errors. Background radiation can flip a bit anywhere in the system and cause an error like this. That's not the only possibility, of course, but it's a good example of an error that's not indicative of a faulty component. Still, it's especially painful when it happens during such a long computation.
If that's the only error you've had, and it doesn't repeat itself, I wouldn't worry too much.
Ron's suggestion about ECC ram is a possibility, but it's expensive and while it will protect you against memory errors, it doesn't protect you against errors in the CPU or motherboard. (There was a time, long ago, when all IBM PCs used parity memory, while Apple used cheaper non-parity memory. Alas, that's one trend that Apple set that definitely took the industry in the wrong direction, in my opinion.)
You might want to crunch shorter LLR tasks on that computer for a while to see if there's more errors.
____________
My lucky number is 75898524288+1 | |
|
|
Perhaps you can write the residue into the stderr output?! So every user can compare his result with the other results.
____________
DeleteNull | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
Perhaps you can write the residue into the stderr output?! So every user can compare his result with the other results.
It's intentionally not in there. If you think about it, it should be obvious why.
____________
My lucky number is 75898524288+1 | |
|
|
Maybe I'm posting what's happening with the other task I just started in a few days. ;-)
Ok, to conclude it, my task mentioned earlier finished and was validated successfull.
So at least I can exclude that the first task with I started the discussion earlier failed because of using VirtualBox, which I feared first. ;-)
Probably it really was simple a calculation error from the processor. Even computers are only human then, LOL.
I'm a little sad that I started the task before the extra credits were introduced, obviously it didn't get them. Was probably the last time for a while that I ever did a SoB... ;-)
____________
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,382,143 RAC: 302,741
                               
|
I'm a little sad that I started the task before the extra credits were introduced, obviously it didn't get them. Was probably the last time for a while that I ever did a SoB... ;-)
No reason to be sad.
Your timing was perfect. All workunits initially validated since the change get the credit bonuses. It's when the initial validation occurs, not when the task started. That task received the bonuses.
____________
My lucky number is 75898524288+1 | |
|
|
That task received the bonuses.
Really? Didn't seem to me at first sight but when I calculate it you're right - cool. :-)
If the credit bonus on the SoBs now is 65% as mentioned in the other thread the base credit would be about 12,813. Not quite that what I reached before with two, three other tasks with Windows, but this is my first (and probably last) task on Linux on the same computer. The VirtualBox-Linux mostly slows down a little bit while running, but I'm quite satisfied with that result finally.
____________
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
| |
|
|
The conclusion: no matter how many credits you received, important - the feeling that you got a bonus! ;-)
____________
| |
|
|
Unfortunately that WU is being needlessly crunched by a third computer :-( I've been in that boat before myself, and I understand why it happens. Oh well...
But congrats on finishing a valid SoB WU! It's a long haul...
--Gary | |
|
Message boards :
Seventeen or Bust :
Completed, marked as invalid? |