Hello.
I have one huge file (300M lines). This file contains csv data. I want to insert into a table using a FS pipeline.
We measured the time required according to the split size.
The result depends on the table schema.
I have two questions.
Please.
The data in this file is as follows:
2020-01-02 12:47:42.798,eq98958_0001,c1,z4xgnn79,80afkh0f64,8xxo2nq70kwhdc3,jdoyrn,qj2p_xmh,l3nj,run,1,kafkaExample.Sensor@54248cd8,10.516321182250977,1
Field Description
- field 1: timestamp; same value by 100,000, sorted in ascending order by this value.
- field 2: equipment name; all the same value.
- field 12 (ex, kafkaExampl …): sensor; 100,000 unique values are repeated. There is only one for each timestamp.
Performance Measurement Procedure
- Split huge file using the unix command “split -l”
- Create a pipeline and insert the data of the file into the table (sensor_data_table_in or sensor_data_table_ti).
- Measure elapse time and CPU usage.
Table schema:
– 1st table –
CREATE TABLE sensor_data_table_in (
timestamp timestamp(6) ,
name char(13),
chid char(2),
lotid char(15),
ppid char(15),
recipeid char(15),
chstep char(15),
stepseq char(13),
partid char(15),
status char(3),
slotno tinyint,
sensor char(28),
value double,
lcl double,
ucl double,
inserttime timestamp(6) DEFAULT CURRENT_TIMESTAMP,
SHARD KEY(sensor),
KEY (inserttime) USING CLUSTERED COLUMNSTORE
);
– 2nd table –
CREATE TABLE sensor_data_table_ti (
timestamp timestamp(6) ,
name char(13),
chid char(2),
lotid char(15),
ppid char(15),
recipeid char(15),
chstep char(15),
stepseq char(13),
partid char(15),
status char(3),
slotno tinyint,
sensor char(28),
value double,
lcl double,
ucl double,
SHARD KEY(sensor),
KEY (timestamp) USING CLUSTERED COLUMNSTORE
);
There are two differences between the two tables.
- ‘inserttime’ in the 1st table (default current_time).
- Different COLUMNSTORE keys. (1st table: ‘inserttime’, 2nd table: ‘timestamp’)
Performance measurement result : insert into 1st table (sensor_data_table_in)
Leaf | partitions | rows per file | size per file | files | batches | elapse time(sec) | elapse time(mm:ss) | rows per sec. | CPU utilization | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 24 | 100,000 | 16 MB | 3,000 | 125.0 | 184.4 | 03:04 | 1,627,255 | 43% | |||||
1 | 24 | 520,834 | 84 MB | 576 | 24.0 | 148.0 | 02:27 | 2,027,041 | 61% | |||||
1 | 24 | 2,083,334 | 337 MB | 144 | 6.0 | 132.2 | 02:12 | 2,268,778 | 73% | |||||
1 | 24 | 6,250,000 | 1,010 MB | 48 | 2.0 | 130.1 | 02:10 | 2,305,325 | 75% | |||||
1 | 24 | 12,500,000 | 2,020 MB | 24 | 1.0 | 127.6 | 02:07 | 2,351,044 | 75% | |||||
1 | 48 | 100,000 | 16 MB | 3,000 | 62.5 | 199.0 | 03:18 | 1,507,910 | 47% | |||||
1 | 48 | 520,834 | 84 MB | 576 | 12.0 | 138.4 | 02:18 | 2,167,730 | 72% | |||||
1 | 48 | 2,083,334 | 337 MB | 144 | 3.0 | 125.0 | 02:05 | 2,399,122 | 84% | |||||
1 | 48 | 6,250,000 | 1,010 MB | 48 | 1.0 | 120.1 | 02:00 | 2,497,858 | 88% |
Performance measurement result : insert into 1st table (sensor_data_table_ti)
Leaf | partitions | rows per file | size per file | files | batches | elapse time(sec) | elapse time(mm:ss) | data_size per sec. | rows per sec. | CPU utilization |
---|---|---|---|---|---|---|---|---|---|---|
1 | 24 | 100,000 | 16 MB | 3,000 | 125.0 | 226.4 | 03:46 | 214 MB | 1,325,302 | 40% |
1 | 24 | 200,000 | 32 MB | 1,500 | 62.5 | 214.2 | 03:34 | 226 MB | 1,400,510 | 46% |
1 | 24 | 250,000 | 40 MB | 1,200 | 50.0 | 206.8 | 03:26 | 234 MB | 1,450,960 | 53% |
1 | 24 | 520,834 | 84 MB | 576 | 24.0 | 211.1 | 03:31 | 230 MB | 1,421,280 | 58% |
1 | 24 | 1,041,667 | 168 MB | 288 | 12.0 | 229.6 | 03:49 | 211 MB | 1,306,447 | 58% |
1 | 24 | 2,083,334 | 337 MB | 144 | 6.0 | 243.0 | 04:03 | 199 MB | 1,234,449 | 48% |
1 | 24 | 6,250,000 | 1,010 MB | 48 | 2.0 | 265.1 | 04:25 | 183 MB | 1,131,734 | 47% |
1 | 24 | 12,500,000 | 2,020 MB | 24 | 1.0 | |||||
1 | 48 | 100,000 | 16 MB | 3,000 | 62.5 | 231.1 | 03:51 | 210 MB | 1,298,229 | 48% |
1 | 48 | 200,000 | 32 MB | 1,500 | 31.3 | 206.7 | 03:26 | 235 MB | 1,451,489 | 60% |
1 | 48 | 250,000 | 40 MB | 1,200 | 25.0 | 198.2 | 03:18 | 245 MB | 1,513,947 | 65% |
1 | 48 | 520,834 | 84 MB | 576 | 12.0 | 186.5 | 03:06 | 260 MB | 1,608,380 | 76% |
1 | 48 | 1,041,667 | 168 MB | 288 | 6.0 | 206.1 | 03:26 | 235 MB | 1,455,902 | 82% |
1 | 48 | 2,083,334 | 337 MB | 144 | 3.0 | 218.2 | 03:38 | 222 MB | 1,374,915 | 87% |
1 | 48 | 6,250,000 | 1,010 MB | 48 | 1.0 | 232.5 | 03:52 | 209 MB | 1,290,358 | 88% |
Questions
-
When inserting into the first table, the larger the file size, the better the “elapse time”.
However, the second table does not. Why?
It looks like the difference between the columnstore keys, but it’s strange to increase and decrease.
Note : elapse_time was obtained with the following query.
select sum (BATCH_TIME) from information_schema.pipelines_batches_summary;
-
If you are restoring a cvs file backed up from a database, it is more likely that it is actually the second table.
Is there a way to improve performance?
- System CPU/Memory: 24(vCpu48)/192G
- 1 line average size: 165byte ~ 175byte
Note: “load data infile …” is worse than FS pipeline.
Thank you very much.