aMule Forum

English => en_Bugs => Topic started by: stoatwblr on August 07, 2011, 03:46:46 AM

Title: known_files a bit fragile....
Post by: stoatwblr on August 07, 2011, 03:46:46 AM: SVN 10594

If amule is killed for any reason, the known_met (or known64_met) seems to lose a lot of AICH/MD4 entries it previously had (but usually not all)...

This results in a lot of unnecessary rehashing. Not good if there are a lot of files on the shared side of things.
Title: Re: known_files a bit fragile....
Post by: Kry on August 07, 2011, 06:35:40 AM: Funny that, if you throw your harddrive out of a window, it loses them all.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 07, 2011, 12:36:03 PM: That's why we should store that kind of stuff in a database instead of a met.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 07, 2011, 05:41:42 PM: Quote from: Stu Redman on August 07, 2011, 12:36:03 PM
That's why we should store that kind of stuff in a database instead of a met.

Alternatively, create new mets and copy them in

Even just shutting down amule seems to result in a few entries being trashed.

Currently known2 is 733Mb while known is about 200k
Title: Re: known_files a bit fragile....
Post by: Kry on August 07, 2011, 07:00:46 PM: Quote from: Stu Redman on August 07, 2011, 12:36:03 PM
That's why we should store that kind of stuff in a database instead of a met.

You do realize that the .met format is a database?
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 07, 2011, 10:23:52 PM: Kind of. Technically any file with data in it can be considered to be a "database". I had something more robust and sophisticated in mind, like SQLite. With an index. So you don't have to go sequentially through the whole thing every time in CAICHHashSet::SaveHashSet() . And transactions.

And I don't like that CAICHSyncTask::Entry() opens the met for read/write. It should read only. Writing is only for
a) creating the header if it's empty (useless, already happens in CAICHHashSet::SaveHashSet() )
b) truncating on error - I'd rather close and reopen it in that special case!

Maybe this is causing the problem?

Anyway, if known2 is 733Mb it's gone way beyond what it was intended for. I'd delete it and start all over from scratch. Let it rehash everything shared for once and be done.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 07, 2011, 11:29:45 PM: Quote from: Stu Redman on August 07, 2011, 10:23:52 PM
Anyway, if known2 is 733Mb it's gone way beyond what it was intended for. I'd delete it and start all over from scratch. Let it rehash everything shared for once and be done.

I suspected you might suggest that.

How big was it intended to grow?

There are only a couple hundred files exported at the moment - I've cut things right back but the exports keep getting trashed.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 08, 2011, 02:59:12 AM: Quote from: stoatwblr on August 07, 2011, 05:41:42 PM
Even just shutting down amule seems to result in a few entries being trashed.

Specifically: Shutting down Amuled (either with HUP or a shutdown command via amulecmd) results in corruption of hashes of files that were downloading at the instant of the shutdown, resulting in them having to be rehashed before downloads can resume.

On the side of hashing of files being shared, I noticed that known2's mtime seems to only be updated every 60 minutes. It looks like there needs to be better clean closing of the file at shutdown - and as Stu suggested, only open read/write when absolutely needed.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 08, 2011, 09:09:50 PM: I don't think rehashing of downloading files is related to known.met at all.

Look at the logfile after shutdown. Is there "aMule shutdown completed." at the end?
What about the timestamps? The .part.met must all be same or newer as the .part .
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 09, 2011, 01:23:12 AM: Quote from: Stu Redman on August 08, 2011, 09:09:50 PM
Look at the logfile after shutdown. Is there "aMule shutdown completed." at the end?
[

Yes.

Quote
What about the timestamps? The .part.met must all be same or newer as the .part .

I'll have to check this next time I shutdown.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 09, 2011, 10:35:42 PM: Quote from: Stu Redman on August 08, 2011, 09:09:50 PM
What about the timestamps? The .part.met must all be same or newer as the .part .

Yes, they all are.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 24, 2011, 03:03:49 AM: fwiw after a few days it's back up to ~260Mb and back to the same problem.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 27, 2011, 07:35:22 PM: known2 eventually grew to around 550MB, then amuled sat there spinning its wheels at 100% cpu for several hours, with known.met growing to 5-7Mb then shrinking back to zero.

Everything exited apparently cleanly on a HUP, however when restarting the following mesage was logged.

$ grep -i known logfile
!2011-08-27 18:24:31: WARNING: Known file list corrupted, contains invalid header.
2011-08-27 18:24:31: Creditfile loaded, 7986 clients are known
2011-08-27 18:26:41: Found 384 known shared files, 18066 unknown

Amuled is again sitting at 100% with known.met cycling in size while known2 stays static.

Yes I really do have that many files on share (distros and suchlike)

As Stu says, it appears the met format can't cope with stuff this big. Perhaps some sort of run/compile time alternative such as a mysql interface needs to be considered.

Opinions?
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 27, 2011, 07:53:06 PM: After a little more digging...

It may not be changing known2, but it _IS_ recalculating all the hashes on shared files.

Is this expected behaviour if it thinks the header is corrupt?
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 27, 2011, 10:39:44 PM: You know, sharing 18000 files makes no sense. But what happens here shouldn't happen of course.

I just shared two wxWidgets folders with 18000 files. Yes, behavior is silly. Hashing is supposed to be a background task, but every 30 files known.met is rewritten. Completely. This looks like its size is growing/shrinking. And the writing is done in the foreground, which makes the whole hashing-in-background pointless. GUI is completely blocked.

10601 fixes the problem for me (by updating known.met only after 300Mb and not after 30 files anymore). Please try it.

(And I see I'm tired - I also slipped in my patch to open known2_64.met in read only mode for reading, which was supposed to be an extra revision. :-[ )
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 27, 2011, 11:03:38 PM: I'll give it a ty when it shows up on the SVN tarball page :)
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 27, 2011, 11:23:04 PM: http://code.google.com/p/amule/source/checkout
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 28, 2011, 01:03:40 AM: Ta.

The switch to backgrounding should solve another issue I complained about a long time ago (slow startup with lots of shares enabled)
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 28, 2011, 04:21:47 AM: Quote from: Stu Redman on August 27, 2011, 10:39:44 PM
10601 fixes the problem for me (by updating known.met only after 300Mb and not after 30 files anymore). Please try it.

Known.met now seems to be growing to about 500kB before being truncated.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 28, 2011, 09:43:18 AM: So far so good, thanks.

One other thing I noticed but forgot to mention.

If amule(d) is told to shutdown, it insists on finishing the entire rehashing list first instead of just closing off at the file it's working on and closing off known.met cleanly, then resuming wher it left off when next started.

If there are a lot of files left to rehash, that can result in an apparent "runaway" state.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 28, 2011, 09:29:53 PM: Quote from: stoatwblr on August 28, 2011, 01:03:40 AM
The switch to backgrounding should solve another issue I complained about a long time ago (slow startup with lots of shares enabled)
No. Shared files are searched in foreground still. aMule takes a good minute for this for me with 18000 shares.
There is a performance boost in 10602, but apparently only in Windows. Not in Ubuntu though. Maybe in other platforms too.

Quote
Known.met now seems to be growing to about 500kB before being truncated.
Quote
So far so good, thanks.
Huh? known.met should never be truncated. It gets the appearance however if you check its size while it's being rewritten.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 29, 2011, 02:06:32 AM: That's what I'm doing.

Anyway... after applying the update, same problem (known.met being reported as corrupted and scanning starting over from scratch)

I've crosschecked by dropping the shares down to ~1800 files, waited until it finished scanning them and issued a shutdown command via amulcmd.
Amuled's been sitting there for about 45 mins so far eating 100% CPU. I'm pretty sure that if I HUP it, known.met will be reported as corrupt again.

Yes, it's Ubuntu (Natty) and I usually run amuled, not amule.

One other thing I spotted - while hashes are being generated, downloads and speed drops to virtually zero, even if there was pretty good throughput before the hashing started. Uploads don't seem to be affected as much.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 29, 2011, 02:50:17 AM: Amuled finally exited 2 hours after being given the shutdown command - during that period no files in the share area were opened but it was using 100+% cpu and constantly writing to known.met

Is it possible that known.met file is being run through a second time after hash generation for some reason?
Title: Re: known_files a bit fragile....
Post by: Stu Redman on August 29, 2011, 10:59:23 PM: Please make a debug build of amuled and activate Verbose Debug-Logging for
General
Hasher
AICH-Hasher
KnownFileList
ThreadScheduler

That's in amule.conf:
Code: [Select]
[eMule] VerboseDebug=1 [Debug] Cat_General=1 Cat_Hasher=1 Cat_AICH-Hasher=1 Cat_KnownFileList=1 Cat_ThreadScheduler=1
Try if you find something useful in the log. You can also mail the log to me if you trust me with your shares. Don't post it here however.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on August 31, 2011, 04:05:38 AM: Not much being logged at all :( It seems that logging stops after a while.

Still getting "Known file list corrupted, contains invalid header."

How does amule attempt rto recover from this? known2_64.met hasn't been touched in several days but it still seems to want to rehash every single file in known.met
Title: Re: known_files a bit fragile....
Post by: GonoszTopi on September 06, 2011, 10:43:29 AM: Quote from: stoatwblr on August 31, 2011, 04:05:38 AM
known2_64.met hasn't been touched in several days but it still seems to want to rehash every single file in known.met
That's because known.met contains the eD2k, known2_64.met the AICH hashsets for the known files. Neither can be recovered from the other.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 09, 2011, 07:04:46 PM: There must be a better way of doing this. What about using sqlite or mysql, or some form of atomic updates?
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 17, 2011, 03:08:35 PM: Even as a starting point, doing the rehashing in a new file and then renaming it into place would solve a lot of issues.

ie:

cp known*.met known*.met.new
rehash the */new files
cp known*.met known.met.old
mv known*.met.new known*.met.new

These files are read-only for the most part during normal operation.

This would be a pig on FAT32 systems but it's fairly painless on *nixen and NTFS and it gets around any issues caused by system crashes or SIGKILL during writes.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 17, 2011, 05:56:37 PM: Rehashing is done in memory. The met files are then written in a single step.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 17, 2011, 08:13:26 PM: I know, but the fact remains that there is a race condition.

Better to write the file out and then rename it into position.

I had similar problems with sendmail db files with a few hundred domains to handle. In that instance it resultes in spam filters failing for a few seconds or mail not being delieverd properly - only a few seconds, but enough to cause problems.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 18, 2011, 01:05:02 PM: Quote from: stoatwblr on September 17, 2011, 08:13:26 PM
I know, but the fact remains that there is a race condition.
Where should this race condition be?
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 18, 2011, 02:26:32 PM: Try 10613, it improves amulegui's performance when too many files are shared.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 18, 2011, 09:32:23 PM: Or rather 10616, which doesn't perpetually rehash files duplicate more than twice, in case this is what you are doing...
(It still will rehash them once, but then add them to duplicates correctly at last and be done.)
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 01:50:46 AM: Quote from: Stu Redman on September 18, 2011, 01:05:02 PM
Quote from: stoatwblr on September 17, 2011, 08:13:26 PM
I know, but the fact remains that there is a race condition.
Where should this race condition be?

The race condition exists while known or known2 are being written out - on my system known is taking about 15 seconds to be fully populated.

By writing the files out and then rotating them into place afterwards, the window for system events or sigkills leaving an incomplete known file is _much_ shorter.

For whatever reason known2 seems to be much more robust.

I know you say that known shouldn't be being truncated, but something is causing the entire file to be rewritten in place. As long as this occurs there's a possibility that it can be corrupted with no valid copy on the system.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 01:53:28 AM: Quote from: Stu Redman on September 18, 2011, 09:32:23 PM
Or rather 10616, which doesn't perpetually rehash files duplicate more than twice, in case this is what you are doing...
(It still will rehash them once, but then add them to duplicates correctly at last and be done.)

There's only one copy of the files onboard.

Amuled certainly seems to be hashing every file twice for some reason - but on the second pass lsof shows none of the shared files are being touched - just known.met being continually being rewritten.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 02:27:18 PM: Quote from: Stu Redman on September 18, 2011, 09:32:23 PM
Or rather 10616[/quote/

Starts quickly, then starts logging the following error over and over again. Amulegui and amulecmd can't connect.

total jobs = 100, too many jobs (about 10 at a time, every 10 seconds - thousands so far)

Google searches suggest it's coming out of libupnp.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 02:35:20 PM: Quote from: stoatwblr on September 19, 2011, 02:27:18 PM
Google searches suggest it's coming out of libupnp.

See http://comments.gmane.org/gmane.linux.upnp-sdk.general/271 - there was no followup on this though.

I've been seeing the "too many jobs" error message a bit recently, but usually oinly after about 12-24 hours.

The files I'm currently sharing are 150-300Mb apiece. One of the end targets is to use this method for scientific data distribution - long-term that'd be a few million files at around 200Mb each (planetary albedo data for climate modelling)...
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 08:46:07 PM: ok, 10616:

2011-09-19 19:35:17: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:35:17: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:36:45: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:36:45: amule.cpp(1275): KnownFileList: Safe adding file to sharedlist: X
2011-09-19 19:36:45: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:36:45: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:38:15: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:38:15: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:39:41: KnownFileList.cpp(165): KnownFileList: finished saving known.met
2011-09-19 19:39:42: amule.cpp(1275): KnownFileList: Safe adding file to sharedlist: Y
2011-09-19 19:39:42: amule.cpp(1281): KnownFileList: Failsafe for crash on file hashing creation
2011-09-19 19:39:42: KnownFileList.cpp(131): KnownFileList: start saving known.met
2011-09-19 19:41:09: KnownFileList.cpp(165): KnownFileList: finished saving known.met

Note the amount of time spent saving each known.met - I've benchmarked the disk system at something in excess of 75Mb/s (11 drive ZFS RAIDZ2 with SSD read and write cache), so taking this long to write out ~9Mb is a bit odd.

This long write period is why I suggest writing out, then renaming into place.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 09:36:13 PM: One thing which has shown up with the extra logging is that what's being written out to known.met is a _long_ way behind what's being opened by amuled (shown with lsof)

So it's not that things are being rehashed twice, but that rehashing to known.met is going on long after the files have been closed. Why is it lagging so far?
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 19, 2011, 11:00:40 PM: Quote from: stoatwblr on September 19, 2011, 08:46:07 PM
taking this long to write out ~9Mb is a bit odd
QUITE a bit, yes. My known.met is ~5MB and took 4s to save, with optimize off.

Are you building with optimize on btw? If not you should try that. You can build with debug AND optimize at the same time.

What's probably happening:
- background hashing finishes, posts event
- foreground event queue processes event and starts writing known.met
- background hashing finishes, posts event
- background hashing finishes, posts event
- background hashing finishes, posts event
- foreground event queue finishes writing known.met
- foreground event queue does other important things
- background hashing finishes, posts event
- foreground event queue processes event and starts writing known.met
...

So events stack up and up, because background hashes faster than foreground can process.
Oh, and each foreground event will block (because knownFileListMutex is locked), and so spawn a new event handler instance. :o
The core of the problem is - why does writing take so long? Is your system thrashing (continuously swapping aMule's memory in and out) ?
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 11:12:04 PM: Quote from: Stu Redman on September 19, 2011, 11:00:40 PM

The core of the problem is - why does writing take so long? Is your system thrashing (continuously swapping aMule's memory in and out) ?

Not at all, no swapping happening at all and right now there's 1500Mb free (8Gb system). Other processes are writing very quickly.

On the CPU front, one thread is chewing 99.9% (of one cpu), whilst 8 other threads are using 0.0

CPU is Intel core2 duo @ 2.6GHz
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 19, 2011, 11:21:24 PM: Current build parameters:

./configure --enable-optimize --enable-amule-daemon --enable-amulecmd --enable-webserver --enable-amule-gui --enable-cas --enable-wxcas --enable-alc --enable-alcc --enable-xas --enable-geoip --enable-mmap --enable-fileview --with-zlib --enable-ccache
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 20, 2011, 11:56:13 PM: I've moved the amule directory to a dedicated SSD.

Writes are a lot faster, but it's still taking 4 seconds to write known.met, while bonnie can easily hit 100Mb/s (multithreaded) on this drive.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 21, 2011, 11:00:17 PM: The data has to be prepared in a complicated way which takes quite some CPU. You can't compare that against the raw transfer rate.
So how did that affect the overall performance? Did the problem go away?
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 23, 2011, 09:26:24 PM: It's gone away for the moment, but returns if a normal HDD is used.

I'd imagine that as more shares are added the problem will return even with SSD.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 23, 2011, 10:08:44 PM: I think the intense hard disk usage during hashing in background is playing hell with the slow saving of the known file list in the foreground.
To fix this I propose to move the knownfiles->SafeAddKFile() and the knownfiles->Save() from CamuleApp::OnFinishedHashing() to CHashingTask::Entry(). Both are mutex-protected.
If SafeAddKFile() fails no event is posted.

Also switch from the hashed size trigger to a fixed time (like 5 min) to trigger a Save() . Timer becomes a member of CKnownFileList (remember last save/load).

So we will have
Code: [Select]
void CamuleApp::OnFinishedHashing(CHashingEvent& evt) { <unchanged> if (owner) { <unchanged> } else { sharedfiles->SafeAddKFile(result); } }
I can make a patch, but I don't like to commit this at the moment shortly before a release.
Title: Re: known_files a bit fragile....
Post by: Kry on September 23, 2011, 10:34:15 PM: Yeah, definitely not for before a release.

Technically, hashed size IS a timer. If you take X time to hash every MB, you'll take X*Y to hash Y megabytes. The reason it's a hashed size counter is because this way it's consistent across small and big files, except big files should only save when they finish. Small files will be saved when a bunch is done. This sounds like the optimal way to me, tbh.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on September 23, 2011, 11:19:15 PM: What you describe as optimal would still be the same. Only difference, on a fast system we save now more often than on a slow system. With the timer, it will just check after each hashed file if the last save was more than 5 min ago. If yes it will save. (Timer will be reset on each save, and before starting a rehash.)
The time spent feels like an easier to grasp quantity than the megabytes to me.

Anyway, the important part is saving the known.met in background. It really takes too long for a foreground activity with such large shares.
Title: Re: known_files a bit fragile....
Post by: stoatwblr on September 24, 2011, 05:34:41 PM: This still leaves a window of opportunity for the file to be trashed by a SIGKILL or system crash while the old version is gone and new version isn't there. :(
Title: Re: known_files a bit fragile....
Post by: Stu Redman on October 02, 2011, 01:23:15 PM: I've changed the write mode in 10620. It now writes to "known.met.new" and renames it after closing.
Though I still don't believe this will solve your actual problem. Still it's good practice.
Title: Re: known_files a bit fragile....
Post by: Stu Redman on October 04, 2011, 09:30:35 PM: Better yet, use 10621.
Title: Re: known_files a bit fragile....
Post by: x_x_kamikaze_x_x on October 04, 2011, 09:33:50 PM: Quote from: Stu Redman on October 04, 2011, 09:30:35 PM
Better yet, use 10621.

Upload it please to http://amule.sourceforge.net/tarballs/tarballs.xml ;D
Title: Re: known_files a bit fragile....
Post by: Stu Redman on October 04, 2011, 10:28:51 PM: http://code.google.com/p/amule/source/checkout

(and that change is probably useless for you unless you have similar problems as stoatwblr)
Title: Re: known_files a bit fragile....
Post by: x_x_kamikaze_x_x on October 05, 2011, 08:19:29 PM: Quote from: Stu Redman on October 04, 2011, 10:28:51 PM
http://code.google.com/p/amule/source/checkout
(and that change is probably useless for you unless you have similar problems as stoatwblr)

Thanks for the info! :)
Don't worry, I know that but I always want to test the last version in order to
"check" all the new features! ;D

Quote from: iz0bbz on October 05, 2011, 10:10:26 AM
Hey, this change is very useful.
Please also consider applying write_safe to other critical or big files (nodes.dat, key_index.dat for example).
Not everybody is running amule on reliable, fast systems. Dirty shutdowns, or OOM crashes, can happen .

I think the same!
Title: Re: known_files a bit fragile....
Post by: Stu Redman on October 06, 2011, 08:36:10 PM: Quote from: iz0bbz on October 05, 2011, 10:10:26 AM
Please also consider applying write_safe to other critical or big files (nodes.dat, key_index.dat for example).
Anybody else agree/disagree ?
Title: Re: known_files a bit fragile....
Post by: GonoszTopi on October 07, 2011, 08:30:35 AM: nodes.dat: yes
*_index.dat: no
Title: Re: known_files a bit fragile....
Post by: stoatwblr on October 10, 2011, 12:19:45 AM: Quote from: Stu Redman on October 04, 2011, 09:30:35 PM
Better yet, use 10621.

Works perfectly. :)

(I ran through a few crashes to verify all ok)

Thanks