Topic: Guess missing parts on "Well Known Incomplete Files" (Read 4030 times)

zekkerj · « **on:** July 06, 2006, 06:11:38 PM »

Hi,

Well, this is not exactly a "feature request", but a "request for comments" on an idea. If this is the wrong forum, please forgive me.

From times to times, we come to deal with a "Well Known Incomplete File". In this situation, this download will never end; so, any attempt to guess the missing parts, even that this takes years, will success before the download ends...

My idea: create an external program, or a plugin to aMule, that looks in a partial download and identify missing chunks. So, it starts to modify this chunk, hoping this eventually come to find the correct data.

I know this is a very slow approach, and that will be cpu consuming. But I have a computer that is dedicated to aMule; this task can be run on a low priority thread; and, as I said, 10 years is better than "forever"...

lionel77 · « **Reply #1 on:** July 06, 2006, 07:07:23 PM »

Nice idea. Unfortunately, it would not work.

The problem is that for a given set of data, there is a unique hash associated with it, but the reverse is not true: You can find a number of completely different data sets that will provide you with the same hash, given that the length of the data set is greater than the length of the hash. Actually, the greater the length of the data set (relative to the hash length) the more data sets you will be able to find that have the same hash. If this was not the case we would be dealing with some Insanely Great compression.

So employing your idea would most certainly make you wind up with some garbage in your file that just happened to produce the same hash as the original data. The only situation in which it could work is if you are dealing with a very small text file that has only been partially downloaded, because there the length of the data set would not be much greater than the length of the hash. But that is hardly a useful scenario.

zekkerj · « **Reply #2 on:** July 06, 2006, 07:14:28 PM »

But I'm not willing to fullfill a complete download, only small pieces of partially downloaded data.

I know that there is a possibility of hash collision, but not for "similar data". And even if I got so unlucky to find a false positive, there are another hashs, that should not collide.

lionel77 · « **Reply #3 on:** July 06, 2006, 09:03:14 PM »

Problem is that in this context "small pieces" would mean something like 1 or 2k. What you probably want to do is to "guess" something like 200k or more of missing data, and for that you would get a shit load (and that's a big understatement here) of false positives.

Plus, just think for a second how many different possibilities there would be just for 200k of missing data: 2 raised to the power of 1,638,400 (<-200*1024*8 ). Even if you forget about the whole false positive issue, the time it would take to do those computations would pretty much converge against infinity.

kreegee · « **Reply #4 on:** July 06, 2006, 10:51:52 PM »

besides the shitload you get (i don't really think you get that many - litte collision is one of the quality-characteristics of modern hash-algorithms), it's as good as impossible to calculate trillionions of hashs just to get 100 missing KB.

for 100KB you would need to calculate 2^853'606'400 hashes (100KB are 853'606'400 bites) , and now that's gonna take a shitload of time (even with all computer of the world alltogether) - that's most likely more hashes then cpu-cycles runned since the start of computation.
:]

edit: memo to myself - ready the WHOLE previous post, and not just the first paragraph.

GonoszTopi · « **Reply #5 on:** July 07, 2006, 10:50:48 AM »

Ok, do some real math here, not just the raw estimations seen above, to show why it is impossible.

First, kreegee:

Quote

(100KB are 853'606'400 bites)

you're strongly advised to buy a new calculator, 100*1024*8 = 819 200 here.

Now, to the point:

In a "Well known incomplete" file you have at least 1 part missing - that is 9 728 000 bytes, which is 77 824 000 bits. That means you have 2^77824000 variations. Now just some apporximations: 2^10 ~ 10^3 - so this number has approximately 7782400 (seven million seven hundred eighty two thousend four hundred) zeroes after the 1. In the worst case you'll have to check all, in the best case only the first one. Let's take an average and halve this number - so the average number of variations you have to test has 7782399 zeroes after the leading 1 (optimal case).

Assuming your computer can calculate 1 million hashes every second, which is - unfortunately - not true, and one year consists of 31558149 seconds (one sidereal year, see http://en.wikipedia.org/wiki/Year), your computer can calculate approximately 3*10^13 hashes every year, continuously running. At this speed, you'd need 598646 (five hundred ninety eight thousand six hundred fourty six) years to finish just only one part!

As computers also evolve meanwhile, we can assume an exponential growth in speed (up to infinity, just for the sake of the example). So let's say your computer gets 10 times faster every year (it won't, most probably). Then it would still require approximately 1985 years to complete just one part.

This is somehow too long for me. And please note, that all approximations I did above resulted in shorter time required to complete one part.

If you happen to ever write such a program, good luck, I won't use it.

After saving this post, I noticed that the number of variations is incorrectly calculated, it should have 23 347 200 zeroes after the leading 1. I won't redo the calc, so all the above can be taken as if one part was only 3 242 666 bytes long - which is actually only one third of a part.

lionel77 · « **Reply #6 on:** July 07, 2006, 10:56:35 AM »

kreegee, there is not that much the hash algorithm could do in this scenario if you went though all the possibilities for a given amount of data. Look at it this way: a classic ed2k link contains 32 alphanumeric characters as part of the hash, the newer ed2k links contain an additional 32 for a total of 64 alphanumeric chars. There are 36 (26+10) different alphanumeric characters, so there are 36^64 different possibilities for the new ed2k-links (if you just consider the hash parts). 36 is roughly 2^5, so the number of possibilities expressed with base 2 is very roughly 2^320.

Let's compare this to your 100k (which is actually only 100*1024*8=819,200 bits) example:
So there are 2^819,200 different possible sets of 100k, and only 2^320 possible ed2k-links (see above). That means that on average there would be about 2^818,880 false positives for a given ed2k-link. I'm not sure that that still qualifies as "not that many"...

EDIT:
Looks like someone snuck by me unnoticed while I was trying to represent 100K using my fingers...

aMule Forum

News:

Author Topic: Guess missing parts on "Well Known Incomplete Files" (Read 4030 times)

zekkerj

Guess missing parts on "Well Known Incomplete Files"

lionel77

Re: Guess missing parts on "Well Known Incomplete Files"

zekkerj

Re: Guess missing parts on "Well Known Incomplete Files"

lionel77

Re: Guess missing parts on "Well Known Incomplete Files"

kreegee

Re: Guess missing parts on "Well Known Incomplete Files"

GonoszTopi

Re: Guess missing parts on "Well Known Incomplete Files"

lionel77

Re: Guess missing parts on "Well Known Incomplete Files"