Merge pull request #463 from mapbox/stringpool

Limit the depth of the search in the string pool.
2025-04-08 11:34:12 +00:00 · 2017-09-08 10:16:19 -07:00 · 2017-09-08 10:16:19 -07:00 · e000bcc261
commit e000bcc261
parent 2518f238d4 a2d12f178f
10 changed files with 181 additions and 97 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,7 @@
+## 1.24.1
+
+* Limit the size and depth of the string pool for better performance
+
 ## 1.24.0

 * Add feature filters using the Mapbox GL Style Specification filter syntax
--- a/README.md
+++ b/README.md
@ -41,7 +41,7 @@ Usage
 -----

 ```sh
-$ tippecanoe -o file.mbtiles [file.json file.geobuf ...]
+$ tippecanoe -o file.mbtiles [options] [file.json file.geobuf ...]
 ```

 If no files are specified, it reads GeoJSON from the standard input.
@ -52,23 +52,39 @@ You can concatenate multiple GeoJSON features or files together,
 and it will parse out the features and ignore whatever other objects
 it encounters.

-Docker Image
------------
+Try this first
+--------------

-A tippecanoe Docker image can be built from source and executed as a task to
-automatically install dependencies and allow tippecanoe to run on any system
-supported by Docker.
+If you aren't sure what options to use, try this:

-```docker
-$ docker build -t tippecanoe:latest .
-$ docker run -it --rm \
-  -v /tiledata:/data \
-  tippecanoe:latest \
-  tippecanoe --output=/data/output.mbtiles /data/example.geojson
+```sh
+$ tippecanoe -o out.mbtiles -zg --drop-densest-as-needed in.geojson
 ```

-The commands above will build a Docker image from the source and compile the
-latest version. The image supports all tippecanoe flags and options.
+The `-zg` option will make Tippecanoe choose a maximum zoom level that should be
+high enough to reflect the precision of the original data. (If it turns out still
+not to be as detailed as you want, use `-z` manually with a higher number.)
+
+If the tiles come out too big, the `--drop-densest-as-needed` option will make
+Tippecanoe try dropping what should be the least visible features at each zoom level.
+(If it drops too many features, use `-x` to leave out some feature attributes that
+you didn't really need.)
+
+Examples
+--------
+
+Create a tileset of TIGER roads for Alameda County, to zoom level 13, with a custom layer name and description:
+
+```sh
+$ tippecanoe -o alameda.mbtiles -l alameda -n "Alameda County from TIGER" -z13 tl_2014_06001_roads.json
+```
+
+Create a tileset of all TIGER roads, at only zoom level 12, but with higher detail than normal,
+with a custom layer name and description, and leaving out the `LINEARID` and `RTTYP` attributes:
+
+```
+$ cat tiger/tl_2014_*_roads.json | tippecanoe -o tiger.mbtiles -l roads -n "All TIGER roads, one zoom" -z12 -Z12 -d14 -x LINEARID -x RTTYP
+```

 Options
 -------
@ -116,7 +132,7 @@ If your input is formatted as newline-delimited GeoJSON, use `-P` to make input

 ### Parallel processing of input

- * `-P` or `--read-parallel`: Use multiple threads to read different parts of each input file at once.
+ * `-P` or `--read-parallel`: Use multiple threads to read different parts of each GeoJSON input file at once.
   This will only work if the input is line-delimited JSON with each Feature on its
   own line, because it knows nothing of the top-level structure around the Features. Spurious "EOF" error
   messages may result otherwise.
@ -127,6 +143,8 @@ If the input file begins with the [RFC 8142](https://tools.ietf.org/html/rfc8142
 parallel processing of input will be invoked automatically, splitting at record separators rather
 than at all newlines.

+Parallel processing will also be automatic if the input file is in Geobuf format.
+
 ### Projection of input

 * `-s` _projection_ or `--projection=`_projection_: Specify the projection of the input data. Currently supported are `EPSG:4326` (WGS84, the default) and `EPSG:3857` (Web Mercator). In general you should use WGS84 for your input files if at all possible.
@ -142,8 +160,8 @@ than at all newlines.

 ### Tile resolution

- * `-d` _detail_ or `--full-detail=`_detail_: Detail at max zoom level (default 12, for tile resolution of 4096)
- * `-D` _detail_ or `--low-detail=`_detail_: Detail at lower zoom levels (default 12, for tile resolution of 4096)
+ * `-d` _detail_ or `--full-detail=`_detail_: Detail at max zoom level (default 12, for tile resolution of 2^12=4096)
+ * `-D` _detail_ or `--low-detail=`_detail_: Detail at lower zoom levels (default 12, for tile resolution of 2^12=4096)
 * `-m` _detail_ or `--minimum-detail=`_detail_: Minimum detail that it will try if tiles are too big at regular detail (default 7)

 All internal math is done in terms of a 32-bit tile coordinate system, so 1/(2^32) of the size of Earth,
@ -166,7 +184,7 @@ resolution is obtained than by using a smaller _maxzoom_ or _detail_.
 Example: to find the Natural Earth countries with low `scalerank` but high `LABELRANK`:

 ```
-tippecanoe -o filtered.mbtiles -j '{ "ne_10m_admin_0_countries": [ "all", [ "<", "scalerank", 3 ], [ ">", "LABELRANK", 5 ] ] }' ne_10m_admin_0_countries.geojson
+tippecanoe -z5 -o filtered.mbtiles -j '{ "ne_10m_admin_0_countries": [ "all", [ "<", "scalerank", 3 ], [ ">", "LABELRANK", 5 ] ] }' ne_10m_admin_0_countries.geojson
 ```

 ### Dropping a fixed fraction of features by zoom level
@ -297,17 +315,6 @@ Environment
 Tippecanoe ordinarily uses as many parallel threads as the operating system claims that CPUs are available.
 You can override this number by setting the `TIPPECANOE_MAX_THREADS` environmental variable.

-Example
-------
-
-```sh
-$ tippecanoe -o alameda.mbtiles -l alameda -n "Alameda County from TIGER" -z13 tl_2014_06001_roads.json
-```
-
-```
-$ cat tiger/tl_2014_*_roads.json | tippecanoe -o tiger.mbtiles -l roads -n "All TIGER roads, one zoom" -z12 -Z12 -d14 -x LINEARID -x RTTYP
-```
-
 GeoJSON extension
 -----------------

@ -437,6 +444,24 @@ sudo apt-get install -y g++-5
 export CXX=g++-5
 ```

+Docker Image
+------------
+
+A tippecanoe Docker image can be built from source and executed as a task to
+automatically install dependencies and allow tippecanoe to run on any system
+supported by Docker.
+
+```docker
+$ docker build -t tippecanoe:latest .
+$ docker run -it --rm \
+  -v /tiledata:/data \
+  tippecanoe:latest \
+  tippecanoe --output=/data/output.mbtiles /data/example.geojson
+```
+
+The commands above will build a Docker image from the source and compile the
+latest version. The image supports all tippecanoe flags and options.
+
 Examples
 ------

--- a/geobuf.cpp
+++ b/geobuf.cpp
@ -255,6 +255,9 @@ std::vector<drawvec_type> readGeometry(protozero::pbf_reader &pbf, size_t dim, d
 		dv.dv = readMultiLine(coords, lengths, dim, e, true);
 	} else if (type == MULTIPOLYGON) {
 		dv.dv = readMultiPolygon(coords, lengths, dim, e);
+	} else {
+		// GeometryCollection
+		return ret;
 	}

 	dv.type = type / 2 + 1;
--- a/man/tippecanoe.1
+++ b/man/tippecanoe.1
@ -37,7 +37,7 @@ $ brew install tippecanoe
 .PP
 .RS
 .nf
-$ tippecanoe \-o file.mbtiles [file.json file.geobuf ...]
+$ tippecanoe \-o file.mbtiles [options] [file.json file.geobuf ...]
 .fi
 .RE
 .PP
@ -48,24 +48,42 @@ The GeoJSON features need not be wrapped in a FeatureCollection.
 You can concatenate multiple GeoJSON features or files together,
 and it will parse out the features and ignore whatever other objects
 it encounters.
-.SH Docker Image
+.SH Try this first
 .PP
-A tippecanoe Docker image can be built from source and executed as a task to
-automatically install dependencies and allow tippecanoe to run on any system
-supported by Docker.
+If you aren't sure what options to use, try this:
 .PP
 .RS
 .nf
-$ docker build \-t tippecanoe:latest .
-$ docker run \-it \-\-rm \\
-  \-v /tiledata:/data \\
-  tippecanoe:latest \\
-  tippecanoe \-\-output=/data/output.mbtiles /data/example.geojson
+$ tippecanoe \-o out.mbtiles \-zg \-\-drop\-densest\-as\-needed in.geojson
 .fi
 .RE
 .PP
-The commands above will build a Docker image from the source and compile the
-latest version. The image supports all tippecanoe flags and options.
+The \fB\fC\-zg\fR option will make Tippecanoe choose a maximum zoom level that should be
+high enough to reflect the precision of the original data. (If it turns out still
+not to be as detailed as you want, use \fB\fC\-z\fR manually with a higher number.)
+.PP
+If the tiles come out too big, the \fB\fC\-\-drop\-densest\-as\-needed\fR option will make
+Tippecanoe try dropping what should be the least visible features at each zoom level.
+(If it drops too many features, use \fB\fC\-x\fR to leave out some feature attributes that
+you didn't really need.)
+.SH Examples
+.PP
+Create a tileset of TIGER roads for Alameda County, to zoom level 13, with a custom layer name and description:
+.PP
+.RS
+.nf
+$ tippecanoe \-o alameda.mbtiles \-l alameda \-n "Alameda County from TIGER" \-z13 tl_2014_06001_roads.json
+.fi
+.RE
+.PP
+Create a tileset of all TIGER roads, at only zoom level 12, but with higher detail than normal,
+with a custom layer name and description, and leaving out the \fB\fCLINEARID\fR and \fB\fCRTTYP\fR attributes:
+.PP
+.RS
+.nf
+$ cat tiger/tl_2014_*_roads.json | tippecanoe \-o tiger.mbtiles \-l roads \-n "All TIGER roads, one zoom" \-z12 \-Z12 \-d14 \-x LINEARID \-x RTTYP
+.fi
+.RE
 .SH Options
 .PP
 There are a lot of options. A lot of the time you won't want to use any of them
@ -122,7 +140,7 @@ specified, the files are all merged into the single named layer, even if they tr
 .SS Parallel processing of input
 .RS
 .IP \(bu 2
-\fB\fC\-P\fR or \fB\fC\-\-read\-parallel\fR: Use multiple threads to read different parts of each input file at once.
+\fB\fC\-P\fR or \fB\fC\-\-read\-parallel\fR: Use multiple threads to read different parts of each GeoJSON input file at once.
 This will only work if the input is line\-delimited JSON with each Feature on its
 own line, because it knows nothing of the top\-level structure around the Features. Spurious "EOF" error
 messages may result otherwise.
@ -133,6 +151,8 @@ rather than a stream that can only be read sequentially.
 If the input file begins with the RFC 8142 \[la]https://tools.ietf.org/html/rfc8142\[ra] record separator,
 parallel processing of input will be invoked automatically, splitting at record separators rather
 than at all newlines.
+.PP
+Parallel processing will also be automatic if the input file is in Geobuf format.
 .SS Projection of input
 .RS
 .IP \(bu 2
@ -154,9 +174,9 @@ specified maximum zoom and to any levels added beyond that.
 .SS Tile resolution
 .RS
 .IP \(bu 2
-\fB\fC\-d\fR \fIdetail\fP or \fB\fC\-\-full\-detail=\fR\fIdetail\fP: Detail at max zoom level (default 12, for tile resolution of 4096)
+\fB\fC\-d\fR \fIdetail\fP or \fB\fC\-\-full\-detail=\fR\fIdetail\fP: Detail at max zoom level (default 12, for tile resolution of 2
 .IP \(bu 2
-\fB\fC\-D\fR \fIdetail\fP or \fB\fC\-\-low\-detail=\fR\fIdetail\fP: Detail at lower zoom levels (default 12, for tile resolution of 4096)
+\fB\fC\-D\fR \fIdetail\fP or \fB\fC\-\-low\-detail=\fR\fIdetail\fP: Detail at lower zoom levels (default 12, for tile resolution of 2
 .IP \(bu 2
 \fB\fC\-m\fR \fIdetail\fP or \fB\fC\-\-minimum\-detail=\fR\fIdetail\fP: Minimum detail that it will try if tiles are too big at regular detail (default 7)
 .RE
@ -188,7 +208,7 @@ Example: to find the Natural Earth countries with low \fB\fCscalerank\fR but hig
 .PP
 .RS
 .nf
-tippecanoe \-o filtered.mbtiles \-j '{ "ne_10m_admin_0_countries": [ "all", [ "<", "scalerank", 3 ], [ ">", "LABELRANK", 5 ] ] }' ne_10m_admin_0_countries.geojson
+tippecanoe \-z5 \-o filtered.mbtiles \-j '{ "ne_10m_admin_0_countries": [ "all", [ "<", "scalerank", 3 ], [ ">", "LABELRANK", 5 ] ] }' ne_10m_admin_0_countries.geojson
 .fi
 .RE
 .SS Dropping a fixed fraction of features by zoom level
@ -363,19 +383,6 @@ tippecanoe \-o roads.mbtiles \-c 'if [ $1 \-lt 11 ]; then grep "\\"MTFCC\\": \\"
 .PP
 Tippecanoe ordinarily uses as many parallel threads as the operating system claims that CPUs are available.
 You can override this number by setting the \fB\fCTIPPECANOE_MAX_THREADS\fR environmental variable.
-.SH Example
-.PP
-.RS
-.nf
-$ tippecanoe \-o alameda.mbtiles \-l alameda \-n "Alameda County from TIGER" \-z13 tl_2014_06001_roads.json
-.fi
-.RE
-.PP
-.RS
-.nf
-$ cat tiger/tl_2014_*_roads.json | tippecanoe \-o tiger.mbtiles \-l roads \-n "All TIGER roads, one zoom" \-z12 \-Z12 \-d14 \-x LINEARID \-x RTTYP
-.fi
-.RE
 .SH GeoJSON extension
 .PP
 Tippecanoe defines a GeoJSON extension that you can use to specify the minimum and/or maximum zoom level
@ -519,6 +526,24 @@ sudo apt\-get install \-y g++\-5
 export CXX=g++\-5
 .fi
 .RE
+.SH Docker Image
+.PP
+A tippecanoe Docker image can be built from source and executed as a task to
+automatically install dependencies and allow tippecanoe to run on any system
+supported by Docker.
+.PP
+.RS
+.nf
+$ docker build \-t tippecanoe:latest .
+$ docker run \-it \-\-rm \\
+  \-v /tiledata:/data \\
+  tippecanoe:latest \\
+  tippecanoe \-\-output=/data/output.mbtiles /data/example.geojson
+.fi
+.RE
+.PP
+The commands above will build a Docker image from the source and compile the
+latest version. The image supports all tippecanoe flags and options.
 .SH Examples
 .PP
 Check out some examples of maps made with tippecanoe \[la]MADE_WITH.md\[ra]
--- a/memfile.hpp
+++ b/memfile.hpp
@ -6,7 +6,7 @@ struct memfile {
 	char *map;
 	long long len;
 	long long off;
-	long long tree;
+	unsigned long tree;
 };

 struct memfile *memfile_open(int fd);
--- a/milo/dtoa_milo.h
+++ b/milo/dtoa_milo.h
@ -1,6 +1,7 @@
 #pragma once
 #include <assert.h>
 #include <math.h>
+#include <cmath>

 #if defined(_MSC_VER)
 #include "msinttypes/stdint.h"
@ -379,10 +380,10 @@ inline void Prettify(std::string &buffer, int length, int k) {
 inline std::string dtoa_milo(double value) {
 	std::string buffer;

-	if (isnan(value)) {
+	if (std::isnan(value)) {
 		return "nan";
 	}
-	if (isinf(value)) {
+	if (std::isinf(value)) {
 		if (value < 0) {
 			return "-inf";
 		} else {
--- a/pool.cpp
+++ b/pool.cpp
@ -2,47 +2,43 @@
 #include <stdlib.h>
 #include <string.h>
 #include <limits.h>
+#include <math.h>
 #include "memfile.hpp"
 #include "pool.hpp"

-static unsigned char swizzle[256] = {
-	0x00, 0xBF, 0x18, 0xDE, 0x93, 0xC9, 0xB1, 0x5E, 0xDF, 0xBE, 0x72, 0x5A, 0xBB, 0x42, 0x64, 0xC6,
-	0xD8, 0xB7, 0x15, 0x74, 0x1C, 0x8B, 0x91, 0xF5, 0x29, 0x46, 0xEC, 0x6F, 0xCA, 0x20, 0xF0, 0x06,
-	0x27, 0x61, 0x87, 0xE0, 0x6E, 0x43, 0x50, 0xC5, 0x1B, 0xB4, 0x37, 0xC3, 0x69, 0xA6, 0xEE, 0x80,
-	0xAF, 0x9B, 0xA1, 0x76, 0x23, 0x24, 0x53, 0xF3, 0x5B, 0x65, 0x19, 0xF4, 0xFC, 0xDD, 0x26, 0xE8,
-	0x10, 0xF7, 0xCE, 0x92, 0x48, 0xF6, 0x94, 0x60, 0x07, 0xC4, 0xB9, 0x97, 0x6D, 0xA4, 0x11, 0x0D,
-	0x1F, 0x4D, 0x13, 0xB0, 0x5D, 0xBA, 0x31, 0xD5, 0x8D, 0x51, 0x36, 0x96, 0x7A, 0x03, 0x7F, 0xDA,
-	0x17, 0xDB, 0xD4, 0x83, 0xE2, 0x79, 0x6A, 0xE1, 0x95, 0x38, 0xFF, 0x28, 0xB2, 0xB3, 0xA7, 0xAE,
-	0xF8, 0x54, 0xCC, 0xDC, 0x9A, 0x6B, 0xFB, 0x3F, 0xD7, 0xBC, 0x21, 0xC8, 0x71, 0x09, 0x16, 0xAC,
-	0x3C, 0x8A, 0x62, 0x05, 0xC2, 0x8C, 0x32, 0x4E, 0x35, 0x9C, 0x5F, 0x75, 0xCD, 0x2E, 0xA2, 0x3E,
-	0x1A, 0xC1, 0x8E, 0x14, 0xA0, 0xD3, 0x7D, 0xD9, 0xEB, 0x5C, 0x70, 0xE6, 0x9E, 0x12, 0x3B, 0xEF,
-	0x1E, 0x49, 0xD2, 0x98, 0x39, 0x7E, 0x44, 0x4B, 0x6C, 0x88, 0x02, 0x2C, 0xAD, 0xE5, 0x9F, 0x40,
-	0x7B, 0x4A, 0x3D, 0xA9, 0xAB, 0x0B, 0xD6, 0x2F, 0x90, 0x2A, 0xB6, 0x1D, 0xC7, 0x22, 0x55, 0x34,
-	0x0A, 0xD0, 0xB5, 0x68, 0xE3, 0x59, 0xFD, 0xFA, 0x57, 0x77, 0x25, 0xA3, 0x04, 0xB8, 0x33, 0x89,
-	0x78, 0x82, 0xE4, 0xC0, 0x0E, 0x8F, 0x85, 0xD1, 0x84, 0x08, 0x67, 0x47, 0x9D, 0xCB, 0x58, 0x4C,
-	0xAA, 0xED, 0x52, 0xF2, 0x4F, 0xF1, 0x66, 0xCF, 0xA5, 0x56, 0xEA, 0x7C, 0xE9, 0x63, 0xE7, 0x01,
-	0xF9, 0xFE, 0x0C, 0x99, 0x2D, 0x0F, 0x3A, 0x41, 0x45, 0xA8, 0x30, 0x2B, 0x73, 0xBD, 0x86, 0x81,
-};
-
 int swizzlecmp(const char *a, const char *b) {
-	while (*a || *b) {
-		int aa = swizzle[(unsigned char) *a];
-		int bb = swizzle[(unsigned char) *b];
+	ssize_t alen = strlen(a);
+	ssize_t blen = strlen(b);

-		int cmp = aa - bb;
-		if (cmp != 0) {
-			return cmp;
-		}
-
-		a++;
-		b++;
+	if (strcmp(a, b) == 0) {
+		return 0;
 	}

-	return 0;
+	long long hash1 = 0, hash2 = 0;
+	for (ssize_t i = alen - 1; i >= 0; i--) {
+		hash1 = (hash1 * 37 + a[i]) & INT_MAX;
+	}
+	for (ssize_t i = blen - 1; i >= 0; i--) {
+		hash2 = (hash2 * 37 + b[i]) & INT_MAX;
+	}
+
+	int h1 = hash1, h2 = hash2;
+	if (h1 == h2) {
+		return strcmp(a, b);
+	}
+
+	return h1 - h2;
 }

 long long addpool(struct memfile *poolfile, struct memfile *treefile, const char *s, char type) {
-	long long *sp = &treefile->tree;
+	unsigned long *sp = &treefile->tree;
+	size_t depth = 0;
+
+	// In typical data, traversal depth generally stays under 2.5x
+	size_t max = 3 * log(treefile->off / sizeof(struct stringpool)) / log(2);
+	if (max < 30) {
+		max = 30;
+	}

 	while (*sp != 0) {
 		int cmp = swizzlecmp(s, poolfile->map + ((struct stringpool *) (treefile->map + *sp))->off + 1);
@ -58,6 +54,23 @@ long long addpool(struct memfile *poolfile, struct memfile *treefile, const char
 		} else {
 			return ((struct stringpool *) (treefile->map + *sp))->off;
 		}
+
+		depth++;
+		if (depth > max) {
+			// Search is very deep, so string is probably unique.
+			// Add it to the pool without adding it to the search tree.
+
+			long long off = poolfile->off;
+			if (memfile_write(poolfile, &type, 1) < 0) {
+				perror("memfile write");
+				exit(EXIT_FAILURE);
+			}
+			if (memfile_write(poolfile, (void *) s, strlen(s) + 1) < 0) {
+				perror("memfile write");
+				exit(EXIT_FAILURE);
+			}
+			return off;
+		}
 	}

 	// *sp is probably in the memory-mapped file, and will move if the file grows.
@ -78,6 +91,16 @@ long long addpool(struct memfile *poolfile, struct memfile *treefile, const char
 		exit(EXIT_FAILURE);
 	}

+	if (off >= LONG_MAX || treefile->off >= LONG_MAX) {
+		// Tree or pool is bigger than 2GB
+		static bool warned = false;
+		if (!warned) {
+			fprintf(stderr, "Warning: string pool is very large.\n");
+			warned = true;
+		}
+		return off;
+	}
+
 	struct stringpool tsp;
 	tsp.left = 0;
 	tsp.right = 0;
--- a/pool.hpp
+++ b/pool.hpp
@ -2,9 +2,9 @@
 #define POOL_HPP

 struct stringpool {
-	long long left;
-	long long right;
-	long long off;
+	unsigned long left;
+	unsigned long right;
+	unsigned long off;
 };

 long long addpool(struct memfile *poolfile, struct memfile *treefile, const char *s, char type);
--- a/serial.cpp
+++ b/serial.cpp
@ -425,7 +425,10 @@ int serialize_feature(struct serialization_state *sst, serial_feature &sf) {
 	if (sf.geometry.size() > 0 && (sf.bbox[2] < sf.bbox[0] || sf.bbox[3] < sf.bbox[1])) {
 		fprintf(stderr, "Internal error: impossible feature bounding box %llx,%llx,%llx,%llx\n", sf.bbox[0], sf.bbox[1], sf.bbox[2], sf.bbox[3]);
 	}
-	if (sf.bbox[2] - sf.bbox[0] > (2LL << (32 - sst->maxzoom)) || sf.bbox[3] - sf.bbox[1] > (2LL << (32 - sst->maxzoom))) {
+	if (sf.bbox[0] == LLONG_MAX) {
+		// No bounding box (empty geometry)
+		// Shouldn't happen, but avoid arithmetic overflow below
+	} else if (sf.bbox[2] - sf.bbox[0] > (2LL << (32 - sst->maxzoom)) || sf.bbox[3] - sf.bbox[1] > (2LL << (32 - sst->maxzoom))) {
 		inline_meta = false;

 		if (prevent[P_CLIPPING]) {
--- a/version.hpp
+++ b/version.hpp
@ -1,6 +1,6 @@
 #ifndef VERSION_HPP
 #define VERSION_HPP

-#define VERSION "tippecanoe v1.24.0\n"
+#define VERSION "tippecanoe v1.24.1\n"

 #endif