Optimizing PostGIS/Geoserver schema for huge dataset

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Optimizing PostGIS/Geoserver schema for huge dataset

Andrew Gaydos

Hi,

I'm trying to efficiently serve forecast streamflow WMS images from a PostGIS database using Geoserver 2.9x. My dataset (currently > 5TB) consists of static polylines (2.3 million of them) that each have a number of values that vary over the time dimension, as well as a forecast time dimension and a model run dimension.

To manage this dataset, I've set up a static table containing the geometries of each polyline, with an id assigned to each polyline. Additionally I've set up a 'times' table which assigns a unique id to each time/forecast time/model run combination. Thirdly, I've put the data in a huge partitioned table that includes the index of each polyline (into the static table) as well as an index into the times table. The data table is partitioned by the time id, so each partition contains only values applicable at a particular time/forecast time/model combination.

So I have:

static table: (2.3 million rows)

                                                       Table     "public.staticflow"
   Column    |              Type              |                          Modifiers                          | Storage | Stats target | Description 
--------------+--------------------------------+-------------------------------------------------------------+---------+--------------+-------------
id           | integer                        | not null default nextval('staticflow_id_seq'::regclass) | plain   |              | 
shape_length | double precision               |                                                             | plain   |              | 
strorder     | integer                        |                                                             | plain   |              | 
wkb_geometry | geometry(MultiLineString,4269) |                                                             | main    |              | 
Indexes:
   "staticflow_pkey" PRIMARY KEY, btree (id)
   "staticflow_id_idx" btree (id)
   "staticflow_wkb_geometry_geom_idx" gist (wkb_geometry) CLUSTER

times table: (10,000 rows)

                                                     Table "public.flow_times"
  Column   |            Type             |                           Modifiers                           | Storage  | Stats target | Description 
-----------+-----------------------------+---------------------------------------------------------------+----------+--------------+-------------
id        | integer                     | not null default nextval('flow_times_id_seq'::regclass)      | plain    |              | 
validtime | timestamp without time zone |                                                               | plain    |              | 
inittime  | timestamp without time zone |                                                               | plain    |              | 
model     | character varying(30)       |                                                               | extended |              | 
Indexes:
    "flow_times_pkey" PRIMARY KEY, btree (id)
    "flow_times_id_idx" btree (id)
    "flow_times_inittime_idx" btree (inittime)
    "flow_times_model_idx" btree (model)
    "flow_times_validtime_idx" btree (validtime)

data table: (10,000 partitions, each with 2.3 million rows)

Table "public.flow"
   Column   |  Type   | Modifiers | Storage | Stats target | Description 
------------+---------+-----------+---------+--------------+-------------
 station_id | integer |           | plain   |              | 
 timeid     | integer |           | plain   |              | 
 streamflow | real    |           | plain   |              | 
 velocity   | real    |           | plain   |              | 
Child tables: flow_15192,
              flow_15193,
              flow_15194,
             ...

each partition has the index/constraints, e.g. for flow_15192: Indexes: "flow_15192_station_id_idx" btree (station_id) Check constraints: "flow_15192_timeid_check" CHECK (timeid = 15192) Inherits: flow

To serve this with GeoServer, I created a postgres view that puts this all together, and use a CQL filter with the time id (obtained through a separate query) to query the WMS against this view.

create flow_view SELECT staticflow.id,
    flow_times.id AS timeid,
    flow_times.validtime,
    flow_times.inittime,
    flow_times.model,
    staticflow.strorder,
    flow.streamflow,
    flow.velocity,
    staticflow.shape_length,
    staticflow.wkb_geometry
   FROM staticflow,
    flow_times,
    flow
  WHERE flow.station_id = staticflow.id AND flow_times.id = flow.timeid;

I've looked at the query plan for the queries that GeoServer is sending to Postgres, e.g:

explain analyze SELECT "streamflow",encode(ST_AsBinary(ST_Simplify(ST_Force2D("wkb_geometry"), 1.0357088025898521E-4)),'base64') as "wkb_geometry" FROM "public"."flow_view" WHERE  ("timeid" = 81388 AND "timeid" IS NOT NULL  AND "wkb_geometry" && ST_GeomFromText('POLYGON ((-115.13723373413087 41.01267720039345, -115.13723373413087 41.04660520510985, -115.09225845336915 41.04660520510985, -115.09225845336915 41.01267720039345, -115.13723373413087 41.01267720039345))', 4269) AND (("streamflow" > -1.0E-4 AND "streamflow" IS NOT NULL  AND "streamflow" < 5.0) OR ("streamflow" > 4.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 10.0) OR ("streamflow" > 9.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 15.0) OR ("streamflow" > 14.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 20.0) OR ("streamflow" > 19.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 25.0) OR ("streamflow" > 24.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 30.0) OR ("streamflow" > 29.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 100000.0)));


     QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.70..163.24 rows=11 width=465) (actual time=0.101..0.200 rows=9 loops=1)
   ->  Index Only Scan using conus_flow_times_id_idx on conus_flow_times  (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1)
         Index Cond: ((id IS NOT NULL) AND (id = 81388))
         Heap Fetches: 1
   ->  Nested Loop  (cost=0.41..154.71 rows=11 width=469) (actual time=0.045..0.104 rows=9 loops=1)
         ->  Index Scan using staticflow_wkb_geometry_geom_idx on staticflow  (cost=0.41..52.62 rows=12 width=465) (actual time=0.026..0.058 rows=9 loops=1)
               Index Cond: (wkb_geometry && '0103000020AD100000010000000500000001000070C8C85CC01F8C10689F81444001000070C8C85CC095E2CB28F785444001000090E7C55CC095E2CB28F785444001000090E7C55CC01F8C10689F81444001000070C8C85CC01F8C10689F814440'::geometry)
         ->  Append  (cost=0.00..8.49 rows=2 width=12) (actual time=0.004..0.005 rows=1 loops=9)
               ->  Seq Scan on conus_flow  (cost=0.00..0.00 rows=1 width=12) (actual time=0.000..0.000 rows=0 loops=9)
                     Filter: ((streamflow IS NOT NULL) AND (timeid = 81388) AND (staticflow.id = station_id) AND (((streamflow > (-0.0001)::double precision) AND (streamflow < 5::double precision)) OR ((streamflow > 4.9999::double precision) AND (streamflow < 10::double precision)) OR ((streamflow > 9.9999::double precision) AND (streamflow < 15::double precision)) OR ((streamflow > 14.9999::double precision) AND (streamflow < 20::double precision)) OR ((streamflow > 19.9999::double precision) AND (streamflow < 25::double precision)) OR ((streamflow > 24.9999::double precision) AND (streamflow < 30::double precision)) OR ((streamflow > 29.9999::double precision) AND (streamflow < 100000::double precision))))
               ->  Index Scan using conus_flow_81388_station_id_idx on conus_flow_81388  (cost=0.43..8.49 rows=1 width=12) (actual time=0.004..0.004 rows=1 loops=9)
                     Index Cond: (station_id = staticflow.id)
                     Filter: ((streamflow IS NOT NULL) AND (timeid = 81388) AND (((streamflow > (-0.0001)::double precision) AND (streamflow < 5::double precision)) OR ((streamflow > 4.9999::double precision) AND (streamflow < 10::double precision)) OR ((streamflow > 9.9999::double precision) AND (streamflow < 15::double precision)) OR ((streamflow > 14.9999::double precision) AND (streamflow < 20::double precision)) OR ((streamflow > 19.9999::double precision) AND (streamflow < 25::double precision)) OR ((streamflow > 24.9999::double precision) AND (streamflow < 30::double precision)) OR ((streamflow > 29.9999::double precision) AND (streamflow < 100000::double precision))))
 Planning time: 3041.998 ms
 Execution time: 1.192 ms

(the streamflow filter is for styling the polylines based on streamflow value (using an SLD). Removing this does not seem to speed up the queries significantly.)

My questions are

  1. It seems that for every session, there is a one-time penalty for the first query (several minutes) after which queries tend to run much quicker (about 10 seconds for all the tiles to be served). What is going on here?
  2. Is there a way to optimize GeoServer's queries against this schema, or a more efficient query to try?
  3. other postgres optimizations that might help?

I'm pretty new to both GeoServer and PostGIS and have a sinking feeling that I could be structuring this dataset and queries more efficiently, but I've run out of ideas and don't have any postgres experts at work to ask, so I'm posting here.

Thanks for any insight!

-Andy

_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andy Colson-2
On 03/31/2017 11:38 AM, Andrew Gaydos wrote:

> Hi,
>
>
>
> My questions are
>
>  1. It seems that for every session, there is a one-time penalty for the first query (several minutes) after which queries tend to run much quicker (about 10 seconds for all the tiles to be served). What is going on here?
>  2. Is there a way to optimize GeoServer's queries against this schema, or a more efficient query to try?
>  3. other postgres optimizations that might help?
>
> I'm pretty new to both GeoServer and PostGIS and have a sinking feeling that I could be structuring this dataset and queries more efficiently, but I've run out of ideas and don't have any postgres experts at work to ask, so I'm posting here.
>
> Thanks for any insight!
>
> -Andy

Andy's Unite!

err.. anyway, Here is the problem:

> data table: (10,000 partitions, each with 2.3 million rows)


Lots of partitions will kill planning time. Look at the very bottom of:
https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html

Do you have your heart set on lots of partitions?  How'd you feel about 100? or maybe 1000?

-Andy
_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andrew Gaydos
Thanks for the help!

I originally tried putting everything into a single non-partitioned table but the performance was horrible! Since each set of 2.3M rows shares the same timestamp, I thought this would be a good way to divide up the data when partitioning - I set a constraint on each, e.g.

table 1: constraint: timeid=101
table 2: constraint: timeid=102
etc.

I could try grouping times into a single table, e.g.

table 1: constraint: 100 <= timeid < 110
table 2: constraint: 110 <= timeid < 120
etc.

so that would give me 1000 partitions of 24 million rows each.

Is this what you were suggesting? What do you think the optimal balance of partitions and rows would be? 100 partitions of 240 million rows each? 10 partitions of 2.4 billion rows each? At some point I think I would run into the insufferable performance I was getting with a single table, though.

Actually, now that I check the number of partitions is closer to 17,000, and number of rows per is 2.7M, so 46 billion rows altogether...

Thanks again!
-Andy

On Fri, Mar 31, 2017 at 6:15 PM, Andy Colson <[hidden email]> wrote:
On 03/31/2017 11:38 AM, Andrew Gaydos wrote:
Hi,



My questions are

 1. It seems that for every session, there is a one-time penalty for the first query (several minutes) after which queries tend to run much quicker (about 10 seconds for all the tiles to be served). What is going on here?
 2. Is there a way to optimize GeoServer's queries against this schema, or a more efficient query to try?
 3. other postgres optimizations that might help?

I'm pretty new to both GeoServer and PostGIS and have a sinking feeling that I could be structuring this dataset and queries more efficiently, but I've run out of ideas and don't have any postgres experts at work to ask, so I'm posting here.

Thanks for any insight!

-Andy

Andy's Unite!

err.. anyway, Here is the problem:

data table: (10,000 partitions, each with 2.3 million rows)


Lots of partitions will kill planning time. Look at the very bottom of:
https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html

Do you have your heart set on lots of partitions?  How'd you feel about 100? or maybe 1000?

-Andy
_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users


_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andy Colson-2
This says 100:
http://stackoverflow.com/questions/6104774/how-many-table-partitions-is-too-many-in-postgres

This says 1000 is too many:
http://dba.stackexchange.com/questions/95977/maximum-partitioning-postgresql

Honestly you'd have to benchmark it, because I have no idea if there is a difference between 100 and 1000.

That being said, I'm surprised a good index isn't fast enough.  Partitions do cut the index size down, which is good, but it still has to scan all the child tables to see which match.

Do you ever update or delete data from the flow table?
Correct me if I'm wrong, but it looks like your where clause only uses fields: timeid, wkb_geometry and streamflow.  Yes?


Your explain analyze include table conus_flow, which isnt int your query or view.  Are you sure that's the right explain analyze?

-Andy


On 03/31/2017 08:18 PM, Andrew Gaydos wrote:

> Thanks for the help!
>
> I originally tried putting everything into a single non-partitioned table but the performance was horrible! Since each set of 2.3M rows shares the same timestamp, I thought this would be a good way to divide up the data when partitioning - I set a constraint on each, e.g.
>
> table 1: constraint: timeid=101
> table 2: constraint: timeid=102
> etc.
>
> I could try grouping times into a single table, e.g.
>
> table 1: constraint: 100 <= timeid < 110
> table 2: constraint: 110 <= timeid < 120
> etc.
>
> so that would give me 1000 partitions of 24 million rows each.
>
> Is this what you were suggesting? What do you think the optimal balance of partitions and rows would be? 100 partitions of 240 million rows each? 10 partitions of 2.4 billion rows each? At some point I think I would run into the insufferable performance I was getting with a single table, though.
>
> Actually, now that I check the number of partitions is closer to 17,000, and number of rows per is 2.7M, so 46 billion rows altogether...
>
> Thanks again!
> -Andy
>
> On Fri, Mar 31, 2017 at 6:15 PM, Andy Colson <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 03/31/2017 11:38 AM, Andrew Gaydos wrote:
>
>         Hi,
>
>
>
>         My questions are
>
>          1. It seems that for every session, there is a one-time penalty for the first query (several minutes) after which queries tend to run much quicker (about 10 seconds for all the tiles to be served). What is going on here?
>          2. Is there a way to optimize GeoServer's queries against this schema, or a more efficient query to try?
>          3. other postgres optimizations that might help?
>
>         I'm pretty new to both GeoServer and PostGIS and have a sinking feeling that I could be structuring this dataset and queries more efficiently, but I've run out of ideas and don't have any postgres experts at work to ask, so I'm posting here.
>
>         Thanks for any insight!
>
>         -Andy
>
>
>     Andy's Unite!
>
>     err.. anyway, Here is the problem:
>
>         data table: (10,000 partitions, each with 2.3 million rows)
>
>
>
>     Lots of partitions will kill planning time. Look at the very bottom of:
>     https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html <https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html>
>
>     Do you have your heart set on lots of partitions?  How'd you feel about 100? or maybe 1000?
>
>     -Andy
>

_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andy Colson-2
On 03/31/2017 09:33 PM, Andy Colson wrote:
>
> This says 1000 is too many:
> http://dba.stackexchange.com/questions/95977/maximum-partitioning-postgresql

Drat.  That says for 400K records you dont need 1000 partitions.  Sorry, I did't read that well, ignore this one.

I found another bit of interesting reading for partitions:
http://www.postgresql-archive.org/What-s-a-reasonable-maximum-number-for-table-partitions-td5837767.html

-Andy

_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andrew Gaydos
In reply to this post by Andy Colson-2
Yup, timeid, wkb_geometry and streamflow.

'streamflow' is a data value residing in one of the partitions (conus_flow is the master table which is partitioned)
'wkb_geometry' is a static location polygon(line) in a 'staticchannels' table.
'timeid' is a unique identifier that is also a key into another table which provides a valid timestamp, forecast timestamp, and model configuration.

Geoserver queries the 'conus_flow_view' table using a timeid, which returns a set of polygons joined with the particular values for that particular valid/init/model configuration.

I imagine that the EXPLAIN goes through all of the joins/etc of the view, which is where the conus_flow (and it's partitions) come into play.

My requirement is to maintain a month's worth of data, so once per day I delete records older than 30 days. Other than that, and the initial data inserts (inserts happen in batches every two hours), the data can be considered 'read only' (i.e. no additional deletes/updates/replaces).

-Andy



On Fri, Mar 31, 2017 at 8:33 PM, Andy Colson <[hidden email]> wrote:
This says 100:
http://stackoverflow.com/questions/6104774/how-many-table-partitions-is-too-many-in-postgres

This says 1000 is too many:
http://dba.stackexchange.com/questions/95977/maximum-partitioning-postgresql

Honestly you'd have to benchmark it, because I have no idea if there is a difference between 100 and 1000.

That being said, I'm surprised a good index isn't fast enough.  Partitions do cut the index size down, which is good, but it still has to scan all the child tables to see which match.

Do you ever update or delete data from the flow table?
Correct me if I'm wrong, but it looks like your where clause only uses fields: timeid, wkb_geometry and streamflow.  Yes?


Your explain analyze include table conus_flow, which isnt int your query or view.  Are you sure that's the right explain analyze?

-Andy


On 03/31/2017 08:18 PM, Andrew Gaydos wrote:
Thanks for the help!

I originally tried putting everything into a single non-partitioned table but the performance was horrible! Since each set of 2.3M rows shares the same timestamp, I thought this would be a good way to divide up the data when partitioning - I set a constraint on each, e.g.

table 1: constraint: timeid=101
table 2: constraint: timeid=102
etc.

I could try grouping times into a single table, e.g.

table 1: constraint: 100 <= timeid < 110
table 2: constraint: 110 <= timeid < 120
etc.

so that would give me 1000 partitions of 24 million rows each.

Is this what you were suggesting? What do you think the optimal balance of partitions and rows would be? 100 partitions of 240 million rows each? 10 partitions of 2.4 billion rows each? At some point I think I would run into the insufferable performance I was getting with a single table, though.

Actually, now that I check the number of partitions is closer to 17,000, and number of rows per is 2.7M, so 46 billion rows altogether...

Thanks again!
-Andy

On Fri, Mar 31, 2017 at 6:15 PM, Andy Colson <[hidden email] <mailto:[hidden email]>> wrote:

    On 03/31/2017 11:38 AM, Andrew Gaydos wrote:

        Hi,



        My questions are

         1. It seems that for every session, there is a one-time penalty for the first query (several minutes) after which queries tend to run much quicker (about 10 seconds for all the tiles to be served). What is going on here?
         2. Is there a way to optimize GeoServer's queries against this schema, or a more efficient query to try?
         3. other postgres optimizations that might help?

        I'm pretty new to both GeoServer and PostGIS and have a sinking feeling that I could be structuring this dataset and queries more efficiently, but I've run out of ideas and don't have any postgres experts at work to ask, so I'm posting here.

        Thanks for any insight!

        -Andy


    Andy's Unite!

    err.. anyway, Here is the problem:

        data table: (10,000 partitions, each with 2.3 million rows)



    Lots of partitions will kill planning time. Look at the very bottom of:
    https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html <https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html>

    Do you have your heart set on lots of partitions?  How'd you feel about 100? or maybe 1000?

    -Andy


_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users


_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andy Colson-2
In reply to this post by Andrew Gaydos
On 03/31/2017 11:38 AM, Andrew Gaydos wrote:

> Hi,
>
>
> explain analyze SELECT "streamflow",encode(ST_AsBinary(ST_Simplify(ST_Force2D("wkb_geometry"), 1.0357088025898521E-4)),'base64') as "wkb_geometry" FROM "public"."flow_view" WHERE  ("timeid" = 81388 AND "timeid" IS NOT NULL  AND "wkb_geometry" && ST_GeomFromText('POLYGON ((-115.13723373413087 41.01267720039345, -115.13723373413087 41.04660520510985, -115.09225845336915 41.04660520510985, -115.09225845336915 41.01267720039345, -115.13723373413087 41.01267720039345))', 4269) AND (("streamflow" > -1.0E-4 AND "streamflow" IS NOT NULL  AND "streamflow" < 5.0) OR ("streamflow" > 4.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 10.0) OR ("streamflow" > 9.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 15.0) OR ("streamflow" > 14.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 20.0) OR ("streamflow" > 19.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 25.0) OR ("streamflow" > 24.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 30.0) OR ("streamflow" > 29.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 100000.0)));
>
>
>      QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.70..163.24 rows=11 width=465) (actual time=0.101..0.200 rows=9 loops=1)
>
> Thanks for any insight!
>
> -Andy


Does that query really only return 9 rows?

> Planning time: 3041.998 ms
>  Execution time: 1.192 ms

So this query, by itself only takes 3 seconds.

> about 10 seconds for all the tiles to be served
How many tiles does it request?  Math (10/3 = ~3) would tell us 3 tiles, but that would be odd.  I'd think 4 or 8 tiles.



>     "flow_times_pkey" PRIMARY KEY, btree (id)
>     "flow_times_id_idx" btree (id)

You have two indexes on the same field on flow_times.

-Andy
_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andrew Gaydos
Hi,

The query takes over a minute the first time, then about 3 seconds on subsequent queries. Some on-the-fly index must need to be constructed the first time, and it is cached for a period afterwards (though the initial performance reoccurs after some idle period, like an hour or so).

Yes, in this case it only returns 9 rows, because my bounding box is small.

I am getting 4x4=16 tiles - the requests are done in parallel so it's not a straight 3x16 seconds calculation.

Ah yes, duplicate indexes! Thanks for catching that! Would that incur a performance hit on database reads?

Thanks,
-Andy

On Sat, Apr 1, 2017 at 10:28 AM, Andy Colson <[hidden email]> wrote:
On 03/31/2017 11:38 AM, Andrew Gaydos wrote:
Hi,


explain analyze SELECT "streamflow",encode(ST_AsBinary(ST_Simplify(ST_Force2D("wkb_geometry"), 1.0357088025898521E-4)),'base64') as "wkb_geometry" FROM "public"."flow_view" WHERE  ("timeid" = 81388 AND "timeid" IS NOT NULL  AND "wkb_geometry" && ST_GeomFromText('POLYGON ((-115.13723373413087 41.01267720039345, -115.13723373413087 41.04660520510985, -115.09225845336915 41.04660520510985, -115.09225845336915 41.01267720039345, -115.13723373413087 41.01267720039345))', 4269) AND (("streamflow" > -1.0E-4 AND "streamflow" IS NOT NULL  AND "streamflow" < 5.0) OR ("streamflow" > 4.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 10.0) OR ("streamflow" > 9.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 15.0) OR ("streamflow" > 14.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 20.0) OR ("streamflow" > 19.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 25.0) OR ("streamflow" > 24.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 30.0) OR ("streamflow" > 29.9999 AND "streamflow" IS NOT NULL  AND "streamflow" < 100000.0)));


     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.70..163.24 rows=11 width=465) (actual time=0.101..0.200 rows=9 loops=1)

Thanks for any insight!

-Andy


Does that query really only return 9 rows?

Planning time: 3041.998 ms
 Execution time: 1.192 ms

So this query, by itself only takes 3 seconds.

about 10 seconds for all the tiles to be served
How many tiles does it request?  Math (10/3 = ~3) would tell us 3 tiles, but that would be odd.  I'd think 4 or 8 tiles.



    "flow_times_pkey" PRIMARY KEY, btree (id)
    "flow_times_id_idx" btree (id)

You have two indexes on the same field on flow_times.

-Andy
_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users


_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimizing PostGIS/Geoserver schema for huge dataset

Andy Colson-2

On 04/03/2017 04:39 PM, Andrew Gaydos wrote:
> Hi,
>
> Yes, in this case it only returns 9 rows, because my bounding box is
> small.

Ah, that's the thing that cuts down rows the fastest, that should be the thing you query on first, then from the subset pull out streamflow and timeid.  Dunno if its possible to structure the data that way though.

 
> I am getting 4x4=16 tiles - the requests are done in parallel so it's
> not a straight 3x16 seconds calculation.

Ah, parallel, yes, that would answer that question.

>
> Ah yes, duplicate indexes! Thanks for catching that! Would that incur
> a performance hit on database reads?

No, it would slow down insert/delete though.

-Andy
_______________________________________________
postgis-users mailing list
[hidden email]
https://lists.osgeo.org/mailman/listinfo/postgis-users
Loading...