Bug #36806
Performance issue with large plainlist
| Status: | Closed | Start date: | 2012-05-03 | |
|---|---|---|---|---|
| Priority: | Must have | Due date: | ||
| Assignee: | Ivan Dharma Kartolo | % Done: | 100% |
|
| Category: | - | Spent time: | - | |
| Target version: | - | |||
| TYPO3 Version: | PHP Version: | |||
| Votes: | 0 |
Description
When having a large plainlist (e.g. 15000+ recipients), the function tx_directmail_static::cleanPlainList($plainlist) in class.tx_directmail_static.php is very slow, and with lists with 50.000+ recipients the server times out (runtime is 30+ sec).
With inspiration (read: mostly copy/paste) from this comment, http://dk.php.net/manual/en/function.array-unique.php#97285 the function can be re-written and made a lot faster.
This increased the performance for cleanPlainList() from timeout to ~3 seconds for a list of 50000+ records.
I included a patch to let the rest of you test this optimization. Let me know how your test goes, and if there are any questions feel free to ask.
Test details:
Allowed Memory Size: 128M
Timeout: 30 sec
plainlist size: 50000+ recipients
Associated revisions
- Bug #36806: Speed boosting on cleaning duplicates in plain list array (thx to Tomas Norre Mikkelsen)
- Bug #36806: Speed boosting on cleaning duplicates in plain list array (thx to Tomas Norre Mikkelsen)
git-svn-id: https://svn.typo3.org/TYPO3v4/Extensions/direct_mail/branches/swiftmailer@61804 735d13b6-9817-0410-8766-e36946ffe9aa
- Bug #36806: Speed boosting on cleaning duplicates in plain list array (thx to Tomas Norre Mikkelsen)
- Bug #36806: Speed boosting on cleaning duplicates in plain list array (thx to Tomas Norre Mikkelsen)
git-svn-id: https://svn.typo3.org/TYPO3v4/Extensions/direct_mail/branches/swiftmailer@61813 735d13b6-9817-0410-8766-e36946ffe9aa
History
Updated by Ivan Dharma Kartolo about 1 year ago
- Status changed from New to Under Review
- Assignee set to Tomas Norre Mikkelsen
Hi,
thanks for the patch.. i guess the foreach block should be removed also, right?
I also did have some thought on this problem and there's one another solution I haven't test yet
array_flip(array_flip(array_reverse($input,true)));
it could be much faster, since through the flipping only one value will be preserved.
The reverse is there to take the last element and removing all the previous elements (if the key is matter).
What do you think? can you check with your 50k Data and check if its faster?
either way, I take this patch in swiftmailer branch, since it will be the next release.
Updated by Tomas Norre Mikkelsen about 1 year ago
Hi.
First no problem, i just had the problem and found a solution, had to share :)
Actually i'm not sure if the foreach can be skipped or not. But will test it of course..
Will test your suggestion too, but cannot test this before monday at work.
When will next version be released?
Updated by Ivan Dharma Kartolo about 1 year ago
- File compareArrayUnique.php added
compared the two implementation with using "only" an array consisting 10k records.
array flipping shows a speed up to 3 times
1 start: 1336169168.2174 1 end: 1336169168.2537 1: 0.036357879638672 1 count:9742 2 start: 1336169173.2538 2 end: 1336169173.2669 2: 0.013092041015625 2 count:9742
1 => array_map("unserialize", array_unique(array_map("serialize", $plainlist)));
2 => array_flip(array_flip(array_reverse($plainlist,true)));
attached is my testing implementation
cant tell you about the ETA... It's better to take a little more time on testing, than releasing half baked software :) since integrating swiftmailer almost restructuring the dmailer class, need to test it thoroughly :)
Updated by Ivan Dharma Kartolo about 1 year ago
Tomas Norre Mikkelsen wrote:
First no problem, i just had the problem and found a solution, had to share :)
of course, thanks for pointing the problem :)
Actually i'm not sure if the foreach can be skipped or not. But will test it of course..
yes, the foreach is not needed anymore :)
Updated by Tomas Norre Mikkelsen about 1 year ago
Hi,
I just tested with the array_flip() function, and there are challenges regarding the data structure.
The $plainlist = array() you are testing with have the "wrong" structure compared to the direct_mail data structure.
The array from direct_mail looks like this:
0 =>
array
'email' => string 'test0@domain.tld' (length=16)
'name' => string '' (length=0)
1 =>
array
'email' => string 'test1@domain.tld' (length=16)
'name' => string '' (length=0)
2 =>
array
'email' => string 'test2@domain.tld' (length=16)
'name' => string '' (length=0)
3 =>
array
'email' => string 'test3@domain.tld' (length=16)
'name' => string '' (length=0)
4 =>
array
'email' => string 'test4@domain.tld' (length=16)
'name' => string '' (length=0)
Compare that to your test array:
0 => 'test4@domain.tld'
1 => 'test5@domain.tld'
2 => 'test6@domain.tld'
3 => 'test7@domain.tld'
4 => 'test8@domain.tld'
That's the reason we have the foreach, because of the multidimensional array. So with the direct_mail data structure the array_flip is not the solution.
Updated by Ivan Dharma Kartolo about 1 year ago
Hi Tomas,
yes, you're right. my solution only works with a one dimension array. but the array_map solution filters only array value with the exact same structure (email and name). Following example:
array(
0 => array(
'email' => 'test0@mail.com',
'name' => 'test0'
),
1 => array(
'email' => 'test1@mail.com',
'name' => 'test1'
),
2 => array(
'email' => 'test0@mail.com',
'name' => 'test2'
),
3 => array(
'email' => 'test0@mail.com',
'name' => 'test0'
),
);
the array_map solution only removes the fourth element. What about the third element? does it count as a duplicate?
Updated by Tomas Norre Mikkelsen about 1 year ago
Hi,
I get you point, but what is the idea about have one email register multiple times with different names?
In my opinion this should be controlled at submission not at filtering.
Updated by Ivan Dharma Kartolo about 1 year ago
- Status changed from Under Review to Closed
- Assignee changed from Tomas Norre Mikkelsen to Ivan Dharma Kartolo
- % Done changed from 0 to 100
Committed in SVN Branch r61804.
I left the foreach out, because we're taking the second level array as one value and only when the mail and name (in the second level array) is identical, we assume this is a duplicate. it means, the same mail but different name will NOT be taken as an identical.
thanks for the patch :)