Message boards :
Seventeen or Bust :
"verification failed", again
Author |
Message |
|
23-Oct-2014 21:53:14 [SZTAKI Desktop Grid] Computation for task 9d7a4224-bb89-4c99-b93f-16e20e2179a3_1ddc098a-4334-417c-9004-d4079f463b62_1d8d80fd-1c0d-4189-99a1-a62f2d38bf6b_2 finished
23-Oct-2014 21:53:14 [PrimeGrid] [error] Signature verification failed for primegrid_llr_wrapper_6.24_x86_64-pc-linux-gnu
23-Oct-2014 21:53:15 [PrimeGrid] Computation for task llr_sob_222371062_0 finished
23-Oct-2014 21:53:15 [PrimeGrid] Output file llr_sob_222371062_0_0 for task llr_sob_222371062_0 absent
23-Oct-2014 21:53:15 [PrimeGrid] Computation for task llr_sob_222370007_3 finished
23-Oct-2014 21:53:15 [PrimeGrid] Output file llr_sob_222370007_3_0 for task llr_sob_222370007_3 absent
23-Oct-2014 21:53:15 [PrimeGrid] Computation for task llr_sob_222371608_0 finished
23-Oct-2014 21:53:15 [PrimeGrid] Output file llr_sob_222371608_0_0 for task llr_sob_222371608_0 absent
(Times are UTC.) I've swiped through stdoutdae.txt and, I believe, that's what has happened: There were two PB WUs running, 21:46:53 SZTAKI has pushed out of queue some unknown WU (probably, of PG). 21:53:14 SZTAKI finished, then "verification failed" aborted promptly all SOB WUs at once (wrapper has been re-downloaded ~3.5hour later). What debugging options I should enable for verification being more verbose?
____________
I'm counting for science,
Points just make me sick. | |
|
|
I don't know which flag to enable that might help. Maybe Mike might know. But here are all the options you can use in your cc_config.xml file:
<cc_config>
<log_flags>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<task>1</task>
<android_debug>0</android_debug>
<app_msg_receive>0</app_msg_receive>
<app_msg_send>0</app_msg_send>
<async_file_debug>0</async_file_debug>
<benchmark_debug>0</benchmark_debug>
<checkpoint_debug>0</checkpoint_debug>
<coproc_debug>0</coproc_debug>
<cpu_sched>0</cpu_sched>
<cpu_sched_debug>0</cpu_sched_debug>
<cpu_sched_status>0</cpu_sched_status>
<dcf_debug>0</dcf_debug>
<disk_usage_debug>0</disk_usage_debug>
<file_xfer_debug>1</file_xfer_debug>
<gui_rpc_debug>0</gui_rpc_debug>
<heartbeat_debug>0</heartbeat_debug>
<http_debug>0</http_debug>
<http_xfer_debug>0</http_xfer_debug>
<mem_usage_debug>0</mem_usage_debug>
<network_status_debug>0</network_status_debug>
<notice_debug>0</notice_debug>
<poll_debug>0</poll_debug>
<priority_debug>0</priority_debug>
<proxy_debug>0</proxy_debug>
<rr_simulation>0</rr_simulation>
<rrsim_detail>0</rrsim_detail>
<sched_op_debug>0</sched_op_debug>
<scrsave_debug>0</scrsave_debug>
<slot_debug>0</slot_debug>
<state_debug>0</state_debug>
<statefile_debug>0</statefile_debug>
<suspend_debug>0</suspend_debug>
<task_debug>0</task_debug>
<time_debug>0</time_debug>
<trickle_debug>0</trickle_debug>
<unparsed_xml>0</unparsed_xml>
<work_fetch_debug>0</work_fetch_debug>
</log_flags>
<options>
<abort_jobs_on_exit>0</abort_jobs_on_exit>
<allow_multiple_clients>0</allow_multiple_clients>
<allow_remote_gui_rpc>0</allow_remote_gui_rpc>
<client_new_version_text></client_new_version_text>
<client_version_check_url>http://boinc.berkeley.edu/download.php?xml=1</client_version_check_url>
<client_download_url>http://boinc.berkeley.edu/download.php</client_download_url>
<disallow_attach>0</disallow_attach>
<dont_check_file_sizes>0</dont_check_file_sizes>
<dont_contact_ref_site>0</dont_contact_ref_site>
<exit_after_finish>0</exit_after_finish>
<exit_before_start>0</exit_before_start>
<exit_when_idle>0</exit_when_idle>
<fetch_minimal_work>0</fetch_minimal_work>
<fetch_on_update>0</fetch_on_update>
<force_auth>default</force_auth>
<http_1_0>0</http_1_0>
<http_transfer_timeout>300</http_transfer_timeout>
<http_transfer_timeout_bps>10</http_transfer_timeout_bps>
<max_event_log_lines>200</max_event_log_lines>
<max_file_xfers>12</max_file_xfers>
<max_file_xfers_per_project>8</max_file_xfers_per_project>
<max_stderr_file_size>0</max_stderr_file_size>
<max_stdout_file_size>0</max_stdout_file_size>
<max_tasks_reported>0</max_tasks_reported>
<ncpus>4</ncpus>
<network_test_url>http://www.google.com/</network_test_url>
<no_alt_platform>0</no_alt_platform>
<no_gpus>0</no_gpus>
<no_info_fetch>0</no_info_fetch>
<no_priority_change>0</no_priority_change>
<os_random_only>0</os_random_only>
<proxy_info>
<socks_server_name></socks_server_name>
<socks_server_port>80</socks_server_port>
<http_server_name></http_server_name>
<http_server_port>80</http_server_port>
<socks5_user_name></socks5_user_name>
<socks5_user_passwd></socks5_user_passwd>
<http_user_name></http_user_name>
<http_user_passwd></http_user_passwd>
<no_proxy></no_proxy>
</proxy_info>
<rec_half_life_days>10.000000</rec_half_life_days>
<report_results_immediately>1</report_results_immediately>
<run_apps_manually>0</run_apps_manually>
<save_stats_days>30</save_stats_days>
<skip_cpu_benchmarks>1</skip_cpu_benchmarks>
<simple_gui_only>0</simple_gui_only>
<start_delay>0</start_delay>
<stderr_head>1</stderr_head>
<suppress_net_info>0</suppress_net_info>
<unsigned_apps_ok>0</unsigned_apps_ok>
<use_all_gpus>1</use_all_gpus>
<use_certs>0</use_certs>
<use_certs_only>0</use_certs_only>
<vbox_window>0</vbox_window>
</options>
</cc_config>
Change the 0 to 1 to enable each option and back to 0 to disable.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 483,897,845 RAC: 622,373
                               
|
Sounds like something is messing with the files in your BOINC folder.
I recommend setting "no new tasks", waiting until everything is completed, then uninstalling BOINC, deleting the entire C:\ProgramData\BOINC directory, and reinstalling BOINC.
____________
My lucky number is 75898524288+1 | |
|
|
I've never seen "the f-word" and "directory" being used in one message. Hint: it's x86_64-pc-linux-gnu down here.
Anyway, you might be correct -- it could be alpha particle after all. Or worse -- it could be GNOME People messing with my kernel. I think only thing I can do is mirroring downloads and after mismatch happens again (if ever) comparing with re-downloads. I doubt there would be any difference.
____________
I'm counting for science,
Points just make me sick. | |
|
|
Anyway, tonight one of DIMMs has weared out (it was painful). I've got the system back online but this one made a quirk. I still can see in terminal that at 09:41 UTC it was ~8.78day in for ~39.54% done. Then about 10:00 UTC (probably -- immediately) it's back at 0%. Searched through stdoutdae.txt -- nothing. Contents of lresults.txt:
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Iter: 10956544/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Iter: 1/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
My reconstruction is this has happened exactly when DIMM flipped me (surrounded with casual restarts):
wrapper: running primegrid_llr -d
FFT length: 2560K
04:49:44 (23471): No heartbeat from core client for 30 sec - exiting
BOINC llr wrapper
Using Jean Penne's llr
Should I assume worse and abort?
____________
I'm counting for science,
Points just make me sick. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 483,897,845 RAC: 622,373
                               
|
Anyway, tonight one of DIMMs has weared out (it was painful). I've got the system back online but this one made a quirk. I still can see in terminal that at 09:41 UTC it was ~8.78day in for ~39.54% done. Then about 10:00 UTC (probably -- immediately) it's back at 0%. Searched through stdoutdae.txt -- nothing. Contents of lresults.txt:
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Iter: 10956544/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iter: 10956288/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Iter: 1/27510481, ERROR: ROUND OFF (0.5) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
My reconstruction is this has happened exactly when DIMM flipped me (surrounded with casual restarts):
wrapper: running primegrid_llr -d
FFT length: 2560K
04:49:44 (23471): No heartbeat from core client for 30 sec - exiting
BOINC llr wrapper
Using Jean Penne's llr
Should I assume worse and abort?
Yes. Given all the errors and the history of validation errors, there's a good possibility that the calculation is irrevocably corrupted.
____________
My lucky number is 75898524288+1 | |
|
|
Guess what? Ought to abort that too. However, one of three made it through.
____________
I'm counting for science,
Points just make me sick. | |
|
Message boards :
Seventeen or Bust :
"verification failed", again |