forums › forums › SQLyog › Sync tools, Migration, Scheduled Backup and Notifications › Help With Where During Sync
- This topic is empty.
-
AuthorPosts
-
-
May 24, 2011 at 5:50 pm #12343RichMember
Hello,
I have had a hard time figuring out a WHERE clause that will let me limit the row set that is checked for synchronization.
I have a table where the key columns are a varchar and a timestamp, so I only included those columns in the sync job.
I want to limit the sync to timestamp values that occurred in the last 24 hours.
The only one I've been able to come up with is:
TO_DAYS(NOW()) – TO_DAYS(datatime) < 1 This works when testing it in the query tab to return only the rows within the last 24 hours, however when I use the above WHERE clause in a sync job, the time to sync a 2.2M row db table increases substantially (from ~ 137s using no Where clause to 606s; both are too long), whereas I am trying to achieve considerably faster times by considering a tiny fraction of the rows rather than all of them. Is the call to Now() bogging down the process? or the function calls TO_DAYS() are compute intensive? Is there an efficient way to involve only the last 24 hours of rows in the sync operation? Thanks, Rich Running v8.3.2 on Windows 2000 Server, cannot upgrade versions on this OS
-
May 25, 2011 at 7:53 am #32286peterlaursenParticipant
I wonder if you have an index on this DATETIME column? If not a lot of full table scans could happen.
Anyway it is a little difficult for us to understand. Maybe you could post the SHOW CREATE TABLE output for this table as well as the exact WHERE.clauses you use. Next we can frame queries that you can profile/EXPLAIN.
-
May 25, 2011 at 2:00 pm #32287RichMember'peterlaursen' wrote:
I wonder if you have an index on this DATETIME column? If not a lot of full table scans could happen.
Anyway it is a little difficult for us to understand. Maybe you could post the SHOW CREATE TABLE output for this table as well as the exact WHERE.clauses you use. Next we can frame queries that you can profile/EXPLAIN.
I meant to make clear that a primary key is made up of the 'Datatime' (which is timestamp) and 'Tagname' (which is varchar) columns. I am under the impression that they are then indexed.
The SHOW CREATE TABLE has the following result:
CREATE TABLE `tag_data15mins` (
`TAGNAME` varchar(255) NOT NULL,
`DATATIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`VALUE` float DEFAULT NULL,
`EXPIRY` timestamp NULL DEFAULT NULL,
`LASTFETCHED` timestamp NULL DEFAULT NULL,
`ERROR_DESC` varchar(512) DEFAULT NULL,
`QUALITY` float DEFAULT NULL,
`WWRESOLUTION` float DEFAULT NULL,
`WWRETRIEVALMODE` varchar(255) DEFAULT NULL,
PRIMARY KEY (`TAGNAME`,`DATATIME`),
CONSTRAINT `TAG_DATA15MINS_FK` FOREIGN KEY (`TAGNAME`) REFERENCES `tagnames` (`TAGNAME`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
I tried the the following query in sqlYog, which would be ok for my sync job (but extended the time as mentioned in original post), however I would really like to always be looking back a full 24 hours:
SELECT * , TO_DAYS(NOW()) – TO_DAYS(datatime) AS diff FROM tag_data15mins
WHERE tagname LIKE 'WBSF_INCIN1_BREACHOUTLET'
AND TO_DAYS(NOW()) – TO_DAYS(datatime) < 1
ORDER BY datatime DESC
-
May 25, 2011 at 3:48 pm #32288peterlaursenParticipant
I am not sure if this index will work with such where clause. I think it will not. The DATETIME column is not first/leftmost in the index.
Please refer http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html “MySQL cannot use an index if the columns do not form a leftmost prefix of the index.”
Could you please try to add an index on that column alone? I know it could take some time to build the index.
Also try to EXPLAIN a query using a WHERE clause on the column. Is the index used?
-
May 25, 2011 at 9:48 pm #32289RichMember'peterlaursen' wrote:
I am not sure if this index will work with such where clause. I think it will not. The DATETIME column is not first/leftmost in the index.
Please refer http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html “MySQL cannot use an index if the columns do not form a leftmost prefix of the index.”
Could you please try to add an index on that column alone? I know it could take some time to build the index.
Also try to EXPLAIN a query using a WHERE clause on the column. Is the index used?
OK, the following shows the index present (i.e., dateTimeNdx):
CREATE TABLE `tag_data5mins` (
`TAGNAME` varchar(255) NOT NULL,
`DATATIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`VALUE` float DEFAULT NULL,
`EXPIRY` timestamp NULL DEFAULT NULL,
`LASTFETCHED` timestamp NULL DEFAULT NULL,
`ERROR_DESC` varchar(512) DEFAULT NULL,
`QUALITY` float DEFAULT NULL,
`WWRESOLUTION` float DEFAULT NULL,
`WWRETRIEVALMODE` varchar(255) DEFAULT NULL,
PRIMARY KEY (`TAGNAME`,`DATATIME`),
KEY `dateTimeNdx` (`DATATIME`),
CONSTRAINT `TAG_DATA5MINS_FK` FOREIGN KEY (`TAGNAME`) REFERENCES `tagnames` (`TAGNAME`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
and in the 'possible_keys' column of the EXPLAIN result, the following
PRIMARY,dateTimeNdx
indicates that yes, it is “possibly” being used
-
May 31, 2011 at 2:50 pm #32290RichMember
I have placed an index on the datatime column in both source and target database, yet still the sync time for a table with 2251881 rows is 595s using the WHERE clause when only 68 rows are different.
A sync without using the WHERE clause is 153s.
Is there anything else you can help me do to diagnose and figure out a way to speed this up by limiting the rows?
Thanks
Rich
-
May 31, 2011 at 3:06 pm #32291peterlaursenParticipant
We will reply wiht priority tomorrow.
Did you ever tell your server version? I think I remember some bugs in some MySQL version with indexes on DATETIME columns.
-
June 1, 2011 at 8:10 am #32292vishal.prMember
Hi Rich,
Please tell what all columns from the table `tag_data5mins` is used in the Columns to Compare option. Also the exact where clause you supplied in the SQL Where (nothing more, nothing less). Once you provide these details we can frame a query that gets the rows to sync and you can execute the same in your environment to see how much time it takes. We can then help you to optimize the WHERE clause used.
Regards,
Vishal PR
-
June 1, 2011 at 12:45 pm #32293RichMember'vishal.pr' wrote:
Hi Rich,
Please tell what all columns from the table `tag_data5mins` is used in the Columns to Compare option.
Also the exact where clause you supplied in the SQL Where (nothing more, nothing less).
TO_DAYS(NOW()) – TO_DAYS(datatime) <= 1
Once you provide these details we can frame a query that gets the rows to sync and you can execute the same in your environment to see how much time it takes. We can then help you to optimize the WHERE clause used.
Regards,
Vishal PR
Hi,
Only two columns: Datatime and Tagname
Where Clause: TO_DAYS(NOW()) – TO_DAYS(datatime) <= 1
Also, the mySQL version is 5.1 on both servers
-
June 1, 2011 at 2:17 pm #32294peterlaursenParticipant
Even if it may not be relevant please provide the *full* version (not 5.1 but 5.1.57 or whatever).
But now I understand (I should before) that the WHERE uses a calculation. I think this is the problem. Because the result of the calculation is not stored and thus not indexed either. Some databases have 'virtual columns' (that can be indexed) for handling this. The standard MySQL server does not (but latest MariaDB has).
Anyway I'll let Vishal and our testers comment too.
-
June 1, 2011 at 2:32 pm #32295RichMember'peterlaursen' wrote:
Even if it may not be relevant please provide the *full* version (not 5.1 but 5.1.57 or whatever).
But now I understand (I should before) that the WHERE uses a calculation. I think this is the problem. Because the result of the calculation is not stored and thus not indexed either. Some databases have 'virtual columns' (that can be indexed) for handling this. The standard MySQL server does not (but latest MariaDB has).
Anyway I'll let Vishal and our testers comment too.
5.1.43
-
June 1, 2011 at 3:05 pm #32296vishal.prMember
Can you execute the following query in your environment and see how much time it is taking?
Code:SELECT
`TAGNAME`,
`DATATIME`,
`VALUE`,
`EXPIRY`,
`LASTFETCHED`,
`ERROR_DESC`,
`QUALITY`,
`WWRESOLUTION`,
`WWRETRIEVALMODE`,
MD5(
CONCAT_WS('#', `TAGNAME`, `DATATIME`)
) AS __crc
FROM
`tag_data5mins`
WHERE (1 = 1)
AND (TO_DAYS(NOW()) – TO_DAYS(DATATIME) <= 1)
ORDER BY `TAGNAME` ASC,
`DATATIME` ASC;**Query formatted using SQLyog SQL Formatter.
Regards,
Vishal PR
-
June 1, 2011 at 5:24 pm #32297RichMember
Hi Vishal,
It takes ~ 3.78s consistently
Rich
-
June 2, 2011 at 11:43 am #32298peterlaursenParticipant
OK .. I think we can conclude here.
1)
The query that Vishal asked you to execute is a simplified query that Data Sync executes. It may be executed multiple times. It will calculate such checksums on CHUNKS of rows. Let us say CHUNK = 2000 rows. If The first CHUNK reveals no difference of data (as per the checksum) we will proceed to next chunk. So it you have 200,000 rows in table it will be executed 100 times even if data are completely identical. And if each execution take 4 seconds it wil be 400 seconds in total to get the information from MySQL. Moreover: if the calculation in a CHUNK reveals differences of data the query will be executed again in a recursive manner for smaller CHUNKs in order to identify exactly what rows are different on source and target. So in conclusion: how many times such query is executed depends on 'how much different' source and target are and how the differences are distributed in the table.
2)
The execution time for the sample query show that it is not optimized well by the MySQL server. Because it uses a calculation in the WHERE clause, the result of the calculation is not stored and thus not indexed. Even if EXPLAIN tells that an index is used I believe it will only be for one/first operand of the calculation. I think taht for each (indexed) 1st operand it will still do a full table scan for comparing with the second operand. If you have SQLyog ULTIMATE profiling the query will reveal more information (in particular check the increase of 'full table scans' for the sample query).
3)
Option for improvement (as I can see) include:
* If WHERE is less efficient than NOWHERE, don't use WHERE.
* explore the COLUMNS option in Data Sync. The COLUMNS option defines on what columns checksums shoule be calculated (and NOT what columns are sync'ed. All columns are always synced). Specifying only a few and small columns necessary to ensure uniqueness will improve performance of checksums calculation.
* Tuning of global server parameters could improve a lot. Some buffer settings etc. in MySQL are ridculously low with default parameters and it may result in swapping.
* If you have access (and can manage) to rewrite the application you may store (and index) the TIMEDIFF in the table. If you want to avoid redundant information you may drop one of the existing columns.
* Consider replacing MySQL 5.1 with MariaDB 5.2 (it is a drop-in replacement), define and index a 'virtual column' on the TIMEDIFF. However I cannot promise how mcuh it would help in this scenario. You will have to try for yourself in your environment.
* Consider Replication rather than Data Sync for this.
-
June 2, 2011 at 7:48 pm #32299RichMember'peterlaursen' wrote:
OK .. I think we can conclude here.
1)
The query that Vishal asked you to execute is a simplified query that Data Sync executes. It may be executed multiple times. It will calculate such checksums on CHUNKS of rows. Let us say CHUNK = 2000 rows. If The first CHUNK reveals no difference of data (as per the checksum) we will proceed to next chunk. So it you have 200,000 rows in table it will be executed 100 times even if data are completely identical. And if each execution take 4 seconds it wil be 400 seconds in total to get the information from MySQL. Moreover: if the calculation in a CHUNK reveals differences of data the query will be executed again in a recursive manner for smaller CHUNKs in order to identify exactly what rows are different on source and target. So in conclusion: how many times such query is executed depends on 'how much different' source and target are and how the differences are distributed in the table.
I guess I am naive in how I thought this would be implemented. I would have expected the sync tool to get an initial result set by executing the query with the where condition, then do the sync'ing operations based strictly on that result set, ignoring all rows not meeting the condition.
'peterlaursen' wrote:2)
The execution time for the sample query show that it is not optimized well by the MySQL server. Because it uses a calculation in the WHERE clause, the result of the calculation is not stored and thus not indexed. Even if EXPLAIN tells that an index is used I believe it will only be for one/first operand of the calculation. I think taht for each (indexed) 1st operand it will still do a full table scan for comparing with the second operand. If you have SQLyog ULTIMATE profiling the query will reveal more information (in particular check the increase of 'full table scans' for the sample query).
3)
Option for improvement (as I can see) include:
* If WHERE is less efficient than NOWHERE, don't use WHERE.
* explore the COLUMNS option in Data Sync. The COLUMNS option defines on what columns checksums shoule be calculated (and NOT what columns are sync'ed. All columns are always synced). Specifying only a few and small columns necessary to ensure uniqueness will improve performance of checksums calculation.
* Tuning of global server parameters could improve a lot. Some buffer settings etc. in MySQL are ridculously low with default parameters and it may result in swapping.
* If you have access (and can manage) to rewrite the application you may store (and index) the TIMEDIFF in the table. If you want to avoid redundant information you may drop one of the existing columns.
* Consider replacing MySQL 5.1 with MariaDB 5.2 (it is a drop-in replacement), define and index a 'virtual column' on the TIMEDIFF. However I cannot promise how mcuh it would help in this scenario. You will have to try for yourself in your environment.
* Consider Replication rather than Data Sync for this.
I use only the key columns (2) for compare.
I'm not really disposed to tune the server params although I guess that's how people eventually get educated.
I would like to use Replication but this operation has to be done over an ssh port with the slave outside of the secure network such that it cannot be pointed to the master. If I could find a source of info that would let me replicate such that the master points to the slaves rather than vice versa, I would do that (if can do over ssh)
Thanks for putting significant effort into helping investigate this.
Rich
-
-
AuthorPosts
- You must be logged in to reply to this topic.