Unsupported Screen Size: The viewport size is too small for the theme to render properly.

Help With Where During Sync

  • This topic is empty.
Viewing 14 reply threads
  • Author
    Posts
    • #12343
      Rich
      Member

      Hello,

      I have had a hard time figuring out a WHERE clause that will let me limit the row set that is checked for synchronization.

      I have a table where the key columns are a varchar and a timestamp, so I only included those columns in the sync job.

      I want to limit the sync to timestamp values that occurred in the last 24 hours.

      The only one I've been able to come up with is:

      TO_DAYS(NOW()) – TO_DAYS(datatime) < 1 This works when testing it in the query tab to return only the rows within the last 24 hours, however when I use the above WHERE clause in a sync job, the time to sync a 2.2M row db table increases substantially (from ~ 137s using no Where clause to 606s; both are too long), whereas I am trying to achieve considerably faster times by considering a tiny fraction of the rows rather than all of them. Is the call to Now() bogging down the process? or the function calls TO_DAYS() are compute intensive? Is there an efficient way to involve only the last 24 hours of rows in the sync operation? Thanks, Rich Running v8.3.2 on Windows 2000 Server, cannot upgrade versions on this OS

    • #32286
      peterlaursen
      Participant

      I wonder if you have an index on this DATETIME column? If not a lot of full table scans could happen.

      Anyway it is a little difficult for us to understand. Maybe you could post the SHOW CREATE TABLE output for this table as well as the exact WHERE.clauses you use. Next we can frame queries that you can profile/EXPLAIN.

    • #32287
      Rich
      Member
      'peterlaursen' wrote:

      I wonder if you have an index on this DATETIME column? If not a lot of full table scans could happen.

      Anyway it is a little difficult for us to understand. Maybe you could post the SHOW CREATE TABLE output for this table as well as the exact WHERE.clauses you use. Next we can frame queries that you can profile/EXPLAIN.

      I meant to make clear that a primary key is made up of the 'Datatime' (which is timestamp) and 'Tagname' (which is varchar) columns. I am under the impression that they are then indexed.

      The SHOW CREATE TABLE has the following result:

      CREATE TABLE `tag_data15mins` (

      `TAGNAME` varchar(255) NOT NULL,

      `DATATIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

      `VALUE` float DEFAULT NULL,

      `EXPIRY` timestamp NULL DEFAULT NULL,

      `LASTFETCHED` timestamp NULL DEFAULT NULL,

      `ERROR_DESC` varchar(512) DEFAULT NULL,

      `QUALITY` float DEFAULT NULL,

      `WWRESOLUTION` float DEFAULT NULL,

      `WWRETRIEVALMODE` varchar(255) DEFAULT NULL,

      PRIMARY KEY (`TAGNAME`,`DATATIME`),

      CONSTRAINT `TAG_DATA15MINS_FK` FOREIGN KEY (`TAGNAME`) REFERENCES `tagnames` (`TAGNAME`) ON DELETE CASCADE

      ) ENGINE=InnoDB DEFAULT CHARSET=latin1

      I tried the the following query in sqlYog, which would be ok for my sync job (but extended the time as mentioned in original post), however I would really like to always be looking back a full 24 hours:

      SELECT * , TO_DAYS(NOW()) – TO_DAYS(datatime) AS diff FROM tag_data15mins

      WHERE tagname LIKE 'WBSF_INCIN1_BREACHOUTLET'

      AND TO_DAYS(NOW()) – TO_DAYS(datatime) < 1

      ORDER BY datatime DESC

    • #32288
      peterlaursen
      Participant

      I am not sure if this index will work with such where clause. I think it will not. The DATETIME column is not first/leftmost in the index.

      Please refer http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html “MySQL cannot use an index if the columns do not form a leftmost prefix of the index.”

      Could you please try to add an index on that column alone? I know it could take some time to build the index.

      Also try to EXPLAIN a query using a WHERE clause on the column. Is the index used?

    • #32289
      Rich
      Member
      'peterlaursen' wrote:

      I am not sure if this index will work with such where clause. I think it will not. The DATETIME column is not first/leftmost in the index.

      Please refer http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html “MySQL cannot use an index if the columns do not form a leftmost prefix of the index.”

      Could you please try to add an index on that column alone? I know it could take some time to build the index.

      Also try to EXPLAIN a query using a WHERE clause on the column. Is the index used?

      OK, the following shows the index present (i.e., dateTimeNdx):

      CREATE TABLE `tag_data5mins` (

      `TAGNAME` varchar(255) NOT NULL,

      `DATATIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

      `VALUE` float DEFAULT NULL,

      `EXPIRY` timestamp NULL DEFAULT NULL,

      `LASTFETCHED` timestamp NULL DEFAULT NULL,

      `ERROR_DESC` varchar(512) DEFAULT NULL,

      `QUALITY` float DEFAULT NULL,

      `WWRESOLUTION` float DEFAULT NULL,

      `WWRETRIEVALMODE` varchar(255) DEFAULT NULL,

      PRIMARY KEY (`TAGNAME`,`DATATIME`),

      KEY `dateTimeNdx` (`DATATIME`),

      CONSTRAINT `TAG_DATA5MINS_FK` FOREIGN KEY (`TAGNAME`) REFERENCES `tagnames` (`TAGNAME`) ON DELETE CASCADE

      ) ENGINE=InnoDB DEFAULT CHARSET=latin1

      and in the 'possible_keys' column of the EXPLAIN result, the following

      PRIMARY,dateTimeNdx

      indicates that yes, it is “possibly” being used

    • #32290
      Rich
      Member

      I have placed an index on the datatime column in both source and target database, yet still the sync time for a table with 2251881 rows is 595s using the WHERE clause when only 68 rows are different.

      A sync without using the WHERE clause is 153s.

      Is there anything else you can help me do to diagnose and figure out a way to speed this up by limiting the rows?

      Thanks

      Rich

    • #32291
      peterlaursen
      Participant

      We will reply wiht priority tomorrow.

      Did you ever tell your server version? I think I remember some bugs in some MySQL version with indexes on DATETIME columns.

    • #32292
      vishal.pr
      Member

      Hi Rich,

      Please tell what all columns from the table `tag_data5mins` is used in the Columns to Compare option. Also the exact where clause you supplied in the SQL Where (nothing more, nothing less). Once you provide these details we can frame a query that gets the rows to sync and you can execute the same in your environment to see how much time it takes. We can then help you to optimize the WHERE clause used.

      Regards,

      Vishal PR

    • #32293
      Rich
      Member
      'vishal.pr' wrote:

      Hi Rich,

      Please tell what all columns from the table `tag_data5mins` is used in the Columns to Compare option.

      Also the exact where clause you supplied in the SQL Where (nothing more, nothing less).

      TO_DAYS(NOW()) – TO_DAYS(datatime) <= 1

      Once you provide these details we can frame a query that gets the rows to sync and you can execute the same in your environment to see how much time it takes. We can then help you to optimize the WHERE clause used.

      Regards,

      Vishal PR

      Hi,

      Only two columns: Datatime and Tagname

      Where Clause: TO_DAYS(NOW()) – TO_DAYS(datatime) <= 1

      Also, the mySQL version is 5.1 on both servers

    • #32294
      peterlaursen
      Participant

      Even if it may not be relevant please provide the *full* version (not 5.1 but 5.1.57 or whatever).

      But now I understand (I should before) that the WHERE uses a calculation. I think this is the problem. Because the result of the calculation is not stored and thus not indexed either. Some databases have 'virtual columns' (that can be indexed) for handling this. The standard MySQL server does not (but latest MariaDB has).

      Anyway I'll let Vishal and our testers comment too.

    • #32295
      Rich
      Member
      'peterlaursen' wrote:

      Even if it may not be relevant please provide the *full* version (not 5.1 but 5.1.57 or whatever).

      But now I understand (I should before) that the WHERE uses a calculation. I think this is the problem. Because the result of the calculation is not stored and thus not indexed either. Some databases have 'virtual columns' (that can be indexed) for handling this. The standard MySQL server does not (but latest MariaDB has).

      Anyway I'll let Vishal and our testers comment too.

      5.1.43

    • #32296
      vishal.pr
      Member

      Can you execute the following query in your environment and see how much time it is taking?

      Code:
      SELECT
      `TAGNAME`,
      `DATATIME`,
      `VALUE`,
      `EXPIRY`,
      `LASTFETCHED`,
      `ERROR_DESC`,
      `QUALITY`,
      `WWRESOLUTION`,
      `WWRETRIEVALMODE`,
      MD5(
      CONCAT_WS('#', `TAGNAME`, `DATATIME`)
      ) AS __crc
      FROM
      `tag_data5mins`
      WHERE (1 = 1)
      AND (TO_DAYS(NOW()) – TO_DAYS(DATATIME) <= 1)
      ORDER BY `TAGNAME` ASC,
      `DATATIME` ASC;

      **Query formatted using SQLyog SQL Formatter.

      Regards,

      Vishal PR

    • #32297
      Rich
      Member

      Hi Vishal,

      It takes ~ 3.78s consistently

      Rich

    • #32298
      peterlaursen
      Participant

      OK .. I think we can conclude here.

      1)

      The query that Vishal asked you to execute is a simplified query that Data Sync executes. It may be executed multiple times. It will calculate such checksums on CHUNKS of rows. Let us say CHUNK = 2000 rows. If The first CHUNK reveals no difference of data (as per the checksum) we will proceed to next chunk. So it you have 200,000 rows in table it will be executed 100 times even if data are completely identical. And if each execution take 4 seconds it wil be 400 seconds in total to get the information from MySQL. Moreover: if the calculation in a CHUNK reveals differences of data the query will be executed again in a recursive manner for smaller CHUNKs in order to identify exactly what rows are different on source and target. So in conclusion: how many times such query is executed depends on 'how much different' source and target are and how the differences are distributed in the table.

      2)

      The execution time for the sample query show that it is not optimized well by the MySQL server. Because it uses a calculation in the WHERE clause, the result of the calculation is not stored and thus not indexed. Even if EXPLAIN tells that an index is used I believe it will only be for one/first operand of the calculation. I think taht for each (indexed) 1st operand it will still do a full table scan for comparing with the second operand. If you have SQLyog ULTIMATE profiling the query will reveal more information (in particular check the increase of 'full table scans' for the sample query).

      3)

      Option for improvement (as I can see) include:

      * If WHERE is less efficient than NOWHERE, don't use WHERE.

      * explore the COLUMNS option in Data Sync. The COLUMNS option defines on what columns checksums shoule be calculated (and NOT what columns are sync'ed. All columns are always synced). Specifying only a few and small columns necessary to ensure uniqueness will improve performance of checksums calculation.

      * Tuning of global server parameters could improve a lot. Some buffer settings etc. in MySQL are ridculously low with default parameters and it may result in swapping.

      * If you have access (and can manage) to rewrite the application you may store (and index) the TIMEDIFF in the table. If you want to avoid redundant information you may drop one of the existing columns.

      * Consider replacing MySQL 5.1 with MariaDB 5.2 (it is a drop-in replacement), define and index a 'virtual column' on the TIMEDIFF. However I cannot promise how mcuh it would help in this scenario. You will have to try for yourself in your environment.

      * Consider Replication rather than Data Sync for this.

    • #32299
      Rich
      Member
      'peterlaursen' wrote:

      OK .. I think we can conclude here.

      1)

      The query that Vishal asked you to execute is a simplified query that Data Sync executes. It may be executed multiple times. It will calculate such checksums on CHUNKS of rows. Let us say CHUNK = 2000 rows. If The first CHUNK reveals no difference of data (as per the checksum) we will proceed to next chunk. So it you have 200,000 rows in table it will be executed 100 times even if data are completely identical. And if each execution take 4 seconds it wil be 400 seconds in total to get the information from MySQL. Moreover: if the calculation in a CHUNK reveals differences of data the query will be executed again in a recursive manner for smaller CHUNKs in order to identify exactly what rows are different on source and target. So in conclusion: how many times such query is executed depends on 'how much different' source and target are and how the differences are distributed in the table.

      I guess I am naive in how I thought this would be implemented. I would have expected the sync tool to get an initial result set by executing the query with the where condition, then do the sync'ing operations based strictly on that result set, ignoring all rows not meeting the condition.

      'peterlaursen' wrote:

      2)

      The execution time for the sample query show that it is not optimized well by the MySQL server. Because it uses a calculation in the WHERE clause, the result of the calculation is not stored and thus not indexed. Even if EXPLAIN tells that an index is used I believe it will only be for one/first operand of the calculation. I think taht for each (indexed) 1st operand it will still do a full table scan for comparing with the second operand. If you have SQLyog ULTIMATE profiling the query will reveal more information (in particular check the increase of 'full table scans' for the sample query).

      3)

      Option for improvement (as I can see) include:

      * If WHERE is less efficient than NOWHERE, don't use WHERE.

      * explore the COLUMNS option in Data Sync. The COLUMNS option defines on what columns checksums shoule be calculated (and NOT what columns are sync'ed. All columns are always synced). Specifying only a few and small columns necessary to ensure uniqueness will improve performance of checksums calculation.

      * Tuning of global server parameters could improve a lot. Some buffer settings etc. in MySQL are ridculously low with default parameters and it may result in swapping.

      * If you have access (and can manage) to rewrite the application you may store (and index) the TIMEDIFF in the table. If you want to avoid redundant information you may drop one of the existing columns.

      * Consider replacing MySQL 5.1 with MariaDB 5.2 (it is a drop-in replacement), define and index a 'virtual column' on the TIMEDIFF. However I cannot promise how mcuh it would help in this scenario. You will have to try for yourself in your environment.

      * Consider Replication rather than Data Sync for this.

      I use only the key columns (2) for compare.

      I'm not really disposed to tune the server params although I guess that's how people eventually get educated.

      I would like to use Replication but this operation has to be done over an ssh port with the slave outside of the secure network such that it cannot be pointed to the master. If I could find a source of info that would let me replicate such that the master points to the slaves rather than vice versa, I would do that (if can do over ssh)

      Thanks for putting significant effort into helping investigate this.

      Rich

Viewing 14 reply threads
  • You must be logged in to reply to this topic.