aMule Forum
English => en_Bugs => Topic started by: stoatwblr on August 07, 2011, 03:46:46 AM
-
SVN 10594
If amule is killed for any reason, the known_met (or known64_met) seems to lose a lot of AICH/MD4 entries it previously had (but usually not all)...
This results in a lot of unnecessary rehashing. Not good if there are a lot of files on the shared side of things.
-
Funny that, if you throw your harddrive out of a window, it loses them all.
-
That's why we should store that kind of stuff in a database instead of a met.
-
That's why we should store that kind of stuff in a database instead of a met.
Alternatively, create new mets and copy them in
Even just shutting down amule seems to result in a few entries being trashed.
Currently known2 is 733Mb while known is about 200k
-
That's why we should store that kind of stuff in a database instead of a met.
You do realize that the .met format is a database?
-
Kind of. Technically any file with data in it can be considered to be a "database". I had something more robust and sophisticated in mind, like SQLite. With an index. So you don't have to go sequentially through the whole thing every time in CAICHHashSet::SaveHashSet() . And transactions.
And I don't like that CAICHSyncTask::Entry() opens the met for read/write. It should read only. Writing is only for
a) creating the header if it's empty (useless, already happens in CAICHHashSet::SaveHashSet() )
b) truncating on error - I'd rather close and reopen it in that special case!
Maybe this is causing the problem?
Anyway, if known2 is 733Mb it's gone way beyond what it was intended for. I'd delete it and start all over from scratch. Let it rehash everything shared for once and be done.
-
Anyway, if known2 is 733Mb it's gone way beyond what it was intended for. I'd delete it and start all over from scratch. Let it rehash everything shared for once and be done.
I suspected you might suggest that.
How big was it intended to grow?
There are only a couple hundred files exported at the moment - I've cut things right back but the exports keep getting trashed.
-
Even just shutting down amule seems to result in a few entries being trashed.
Specifically: Shutting down Amuled (either with HUP or a shutdown command via amulecmd) results in corruption of hashes of files that were downloading at the instant of the shutdown, resulting in them having to be rehashed before downloads can resume.
On the side of hashing of files being shared, I noticed that known2's mtime seems to only be updated every 60 minutes. It looks like there needs to be better clean closing of the file at shutdown - and as Stu suggested, only open read/write when absolutely needed.
-
I don't think rehashing of downloading files is related to known.met at all.
Look at the logfile after shutdown. Is there "aMule shutdown completed." at the end?
What about the timestamps? The .part.met must all be same or newer as the .part .
-
Look at the logfile after shutdown. Is there "aMule shutdown completed." at the end?
[
Yes.
What about the timestamps? The .part.met must all be same or newer as the .part .
I'll have to check this next time I shutdown.
-
What about the timestamps? The .part.met must all be same or newer as the .part .
Yes, they all are.
-
fwiw after a few days it's back up to ~260Mb and back to the same problem.
-
known2 eventually grew to around 550MB, then amuled sat there spinning its wheels at 100% cpu for several hours, with known.met growing to 5-7Mb then shrinking back to zero.
Everything exited apparently cleanly on a HUP, however when restarting the following mesage was logged.
$ grep -i known logfile
!2011-08-27 18:24:31: WARNING: Known file list corrupted, contains invalid header.
2011-08-27 18:24:31: Creditfile loaded, 7986 clients are known
2011-08-27 18:26:41: Found 384 known shared files, 18066 unknown
Amuled is again sitting at 100% with known.met cycling in size while known2 stays static.
Yes I really do have that many files on share (distros and suchlike)
As Stu says, it appears the met format can't cope with stuff this big. Perhaps some sort of run/compile time alternative such as a mysql interface needs to be considered.
Opinions?
-
After a little more digging...
It may not be changing known2, but it _IS_ recalculating all the hashes on shared files.
Is this expected behaviour if it thinks the header is corrupt?
-
You know, sharing 18000 files makes no sense. But what happens here shouldn't happen of course.
I just shared two wxWidgets folders with 18000 files. Yes, behavior is silly. Hashing is supposed to be a background task, but every 30 files known.met is rewritten. Completely. This looks like its size is growing/shrinking. And the writing is done in the foreground, which makes the whole hashing-in-background pointless. GUI is completely blocked.
10601 fixes the problem for me (by updating known.met only after 300Mb and not after 30 files anymore). Please try it.
(And I see I'm tired - I also slipped in my patch to open known2_64.met in read only mode for reading, which was supposed to be an extra revision. :-[ )
-
I'll give it a ty when it shows up on the SVN tarball page :)
-
http://code.google.com/p/amule/source/checkout
-
Ta.
The switch to backgrounding should solve another issue I complained about a long time ago (slow startup with lots of shares enabled)
-
10601 fixes the problem for me (by updating known.met only after 300Mb and not after 30 files anymore). Please try it.
Known.met now seems to be growing to about 500kB before being truncated.
-
So far so good, thanks.
One other thing I noticed but forgot to mention.
If amule(d) is told to shutdown, it insists on finishing the entire rehashing list first instead of just closing off at the file it's working on and closing off known.met cleanly, then resuming wher it left off when next started.
If there are a lot of files left to rehash, that can result in an apparent "runaway" state.
-
The switch to backgrounding should solve another issue I complained about a long time ago (slow startup with lots of shares enabled)
No. Shared files are searched in foreground still. aMule takes a good minute for this for me with 18000 shares.
There is a performance boost in 10602, but apparently only in Windows. Not in Ubuntu though. Maybe in other platforms too.
Known.met now seems to be growing to about 500kB before being truncated.
So far so good, thanks.
Huh? known.met should never be truncated. It gets the appearance however if you check its size while it's being rewritten.
-
That's what I'm doing.
Anyway... after applying the update, same problem (known.met being reported as corrupted and scanning starting over from scratch)
I've crosschecked by dropping the shares down to ~1800 files, waited until it finished scanning them and issued a shutdown command via amulcmd.
Amuled's been sitting there for about 45 mins so far eating 100% CPU. I'm pretty sure that if I HUP it, known.met will be reported as corrupt again.
Yes, it's Ubuntu (Natty) and I usually run amuled, not amule.
One other thing I spotted - while hashes are being generated, downloads and speed drops to virtually zero, even if there was pretty good throughput before the hashing started. Uploads don't seem to be affected as much.
-
Amuled finally exited 2 hours after being given the shutdown command - during that period no files in the share area were opened but it was using 100+% cpu and constantly writing to known.met
Is it possible that known.met file is being run through a second time after hash generation for some reason?
-
Please make a debug build of amuled and activate Verbose Debug-Logging for
General
Hasher
AICH-Hasher
KnownFileList
ThreadScheduler
That's in amule.conf:
[eMule]
VerboseDebug=1
[Debug]
Cat_General=1
Cat_Hasher=1
Cat_AICH-Hasher=1
Cat_KnownFileList=1
Cat_ThreadScheduler=1
Try if you find something useful in the log. You can also mail the log to me if you trust me with your shares. Don't post it here however.
-
Not much being logged at all :( It seems that logging stops after a while.
Still getting "Known file list corrupted, contains invalid header."
How does amule attempt rto recover from this? known2_64.met hasn't been touched in several days but it still seems to want to rehash every single file in known.met
-
known2_64.met hasn't been touched in several days but it still seems to want to rehash every single file in known.met
That's because known.met contains the eD2k, known2_64.met the AICH hashsets for the known files. Neither can be recovered from the other.
-
There must be a better way of doing this. What about using sqlite or mysql, or some form of atomic updates?
-
Even as a starting point, doing the rehashing in a new file and then renaming it into place would solve a lot of issues.
ie:
cp known*.met known*.met.new
rehash the */new files
cp known*.met known.met.old
mv known*.met.new known*.met.new
These files are read-only for the most part during normal operation.
This would be a pig on FAT32 systems but it's fairly painless on *nixen and NTFS and it gets around any issues caused by system crashes or SIGKILL during writes.
-
Rehashing is done in memory. The met files are then written in a single step.
-
I know, but the fact remains that there is a race condition.
Better to write the file out and then rename it into position.
I had similar problems with sendmail db files with a few hundred domains to handle. In that instance it resultes in spam filters failing for a few seconds or mail not being delieverd properly - only a few seconds, but enough to cause problems.
-
I know, but the fact remains that there is a race condition.
Where should this race condition be?
-
Try 10613, it improves amulegui's performance when too many files are shared.
-
Or rather 10616, which doesn't perpetually rehash files duplicate more than twice, in case this is what you are doing...
(It still will rehash them once, but then add them to duplicates correctly at last and be done.)
-
I know, but the fact remains that there is a race condition.
Where should this race condition be?
The race condition exists while known or known2 are being written out - on my system known is taking about 15 seconds to be fully populated.
By writing the files out and then rotating them into place afterwards, the window for system events or sigkills leaving an incomplete known file is _much_ shorter.
For whatever reason known2 seems to be much more robust.
I know you say that known shouldn't be being truncated, but something is causing the entire file to be rewritten in place. As long as this occurs there's a possibility that it can be corrupted with no valid copy on the system.
-
Or rather 10616, which doesn't perpetually rehash files duplicate more than twice, in case this is what you are doing...
(It still will rehash them once, but then add them to duplicates correctly at last and be done.)
There's only one copy of the files onboard.
Amuled certainly seems to be hashing every file twice for some reason - but on the second pass lsof shows none of the shared files are being touched - just known.met being continually being rewritten.
-
Or rather 10616[/quote/
Starts quickly, then starts logging the following error over and over again. Amulegui and amulecmd can't connect.
total jobs = 100, too many jobs (about 10 at a time, every 10 seconds - thousands so far)
Google searches suggest it's coming out of libupnp.
-
Google searches suggest it's coming out of libupnp.
See http://comments.gmane.org/gmane.linux.upnp-sdk.general/271 - there was no followup on this though.
I've been seeing the "too many jobs" error message a bit recently, but usually oinly after about 12-24 hours.
The files I'm currently sharing are 150-300Mb apiece. One of the end targets is to use this method for scientific data distribution - long-term that'd be a few million files at around 200Mb each (planetary albedo data for climate modelling)...
-
ok, 10616:
2011-09-19 19:35:17: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:35:17: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:36:45: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:36:45: amule.cpp(1275): KnownFileList: Safe adding file to sharedlist: X
2011-09-19 19:36:45: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:36:45: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:38:15: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:38:15: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:39:41: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:39:42: amule.cpp(1275): KnownFileList: Safe adding file to sharedlist: Y
2011-09-19 19:39:42: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:39:42: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:41:09: KnownFileList.cpp(165): KnownFileList: finished saving known.met
Note the amount of time spent saving each known.met - I've benchmarked the disk system at something in excess of 75Mb/s (11 drive ZFS RAIDZ2 with SSD read and write cache), so taking this long to write out ~9Mb is a bit odd.
This long write period is why I suggest writing out, then renaming into place.
-
One thing which has shown up with the extra logging is that what's being written out to known.met is a _long_ way behind what's being opened by amuled (shown with lsof)
So it's not that things are being rehashed twice, but that rehashing to known.met is going on long after the files have been closed. Why is it lagging so far?
-
taking this long to write out ~9Mb is a bit odd
QUITE a bit, yes. My known.met is ~5MB and took 4s to save, with optimize off.
Are you building with optimize on btw? If not you should try that. You can build with debug AND optimize at the same time.
What's probably happening:
- background hashing finishes, posts event
- foreground event queue processes event and starts writing known.met
- background hashing finishes, posts event
- background hashing finishes, posts event
- background hashing finishes, posts event
- foreground event queue finishes writing known.met
- foreground event queue does other important things
- background hashing finishes, posts event
- foreground event queue processes event and starts writing known.met
...
So events stack up and up, because background hashes faster than foreground can process.
Oh, and each foreground event will block (because knownFileListMutex is locked), and so spawn a new event handler instance. :o
The core of the problem is - why does writing take so long? Is your system thrashing (continuously swapping aMule's memory in and out) ?
-
The core of the problem is - why does writing take so long? Is your system thrashing (continuously swapping aMule's memory in and out) ?
Not at all, no swapping happening at all and right now there's 1500Mb free (8Gb system). Other processes are writing very quickly.
On the CPU front, one thread is chewing 99.9% (of one cpu), whilst 8 other threads are using 0.0
CPU is Intel core2 duo @ 2.6GHz
-
Current build parameters:
./configure --enable-optimize --enable-amule-daemon --enable-amulecmd --enable-webserver --enable-amule-gui --enable-cas --enable-wxcas --enable-alc --enable-alcc --enable-xas --enable-geoip --enable-mmap --enable-fileview --with-zlib --enable-ccache
-
I've moved the amule directory to a dedicated SSD.
Writes are a lot faster, but it's still taking 4 seconds to write known.met, while bonnie can easily hit 100Mb/s (multithreaded) on this drive.
-
The data has to be prepared in a complicated way which takes quite some CPU. You can't compare that against the raw transfer rate.
So how did that affect the overall performance? Did the problem go away?
-
It's gone away for the moment, but returns if a normal HDD is used.
I'd imagine that as more shares are added the problem will return even with SSD.
-
I think the intense hard disk usage during hashing in background is playing hell with the slow saving of the known file list in the foreground.
To fix this I propose to move the knownfiles->SafeAddKFile() and the knownfiles->Save() from CamuleApp::OnFinishedHashing() to CHashingTask::Entry(). Both are mutex-protected.
If SafeAddKFile() fails no event is posted.
Also switch from the hashed size trigger to a fixed time (like 5 min) to trigger a Save() . Timer becomes a member of CKnownFileList (remember last save/load).
So we will have
void CamuleApp::OnFinishedHashing(CHashingEvent& evt)
{
<unchanged>
if (owner) {
<unchanged>
} else {
sharedfiles->SafeAddKFile(result);
}
}
I can make a patch, but I don't like to commit this at the moment shortly before a release.
-
Yeah, definitely not for before a release.
Technically, hashed size IS a timer. If you take X time to hash every MB, you'll take X*Y to hash Y megabytes. The reason it's a hashed size counter is because this way it's consistent across small and big files, except big files should only save when they finish. Small files will be saved when a bunch is done. This sounds like the optimal way to me, tbh.
-
What you describe as optimal would still be the same. Only difference, on a fast system we save now more often than on a slow system. With the timer, it will just check after each hashed file if the last save was more than 5 min ago. If yes it will save. (Timer will be reset on each save, and before starting a rehash.)
The time spent feels like an easier to grasp quantity than the megabytes to me.
Anyway, the important part is saving the known.met in background. It really takes too long for a foreground activity with such large shares.
-
This still leaves a window of opportunity for the file to be trashed by a SIGKILL or system crash while the old version is gone and new version isn't there. :(
-
I've changed the write mode in 10620. It now writes to "known.met.new" and renames it after closing.
Though I still don't believe this will solve your actual problem. Still it's good practice.
-
Better yet, use 10621.
-
Better yet, use 10621.
Upload it please to http://amule.sourceforge.net/tarballs/tarballs.xml ;D
-
http://code.google.com/p/amule/source/checkout
(and that change is probably useless for you unless you have similar problems as stoatwblr)
-
http://code.google.com/p/amule/source/checkout
(and that change is probably useless for you unless you have similar problems as stoatwblr)
Thanks for the info! :)
Don't worry, I know that but I always want to test the last version in order to
"check" all the new features! ;D
Hey, this change is very useful.
Please also consider applying write_safe to other critical or big files (nodes.dat, key_index.dat for example).
Not everybody is running amule on reliable, fast systems. Dirty shutdowns, or OOM crashes, can happen .
I think the same!
-
Please also consider applying write_safe to other critical or big files (nodes.dat, key_index.dat for example).
Anybody else agree/disagree ?
-
nodes.dat: yes
*_index.dat: no
-
Better yet, use 10621.
Works perfectly. :)
(I ran through a few crashes to verify all ok)
Thanks