Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore](log) Standardize S3 failure log formats to enable critical operation monitoring #49813

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

gavinchou
Copy link
Contributor

What problem does this PR solve?

This change unifies S3 error logging patterns for four critical operations that require alerting:

  • failed to complete multipart upload
  • failed to upload part
  • failed to put object
  • failed to get object.*.(dat|idx) (specific to data/index files)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@gavinchou gavinchou requested a review from dataroaring as a code owner April 5, 2025 07:35
@Thearas
Copy link
Contributor

Thearas commented Apr 5, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@gavinchou gavinchou changed the title [chore][log] Standardize S3 failure log formats to enable critical operation monitoring [chore](log) Standardize S3 failure log formats to enable critical operation monitoring Apr 5, 2025
@gavinchou
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 35451 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 67a9d5f24b4d1ce96b87e1a5a37cd0014f6f9d76, data reload: false

------ Round 1 ----------------------------------
q1	25602	5335	5310	5310
q2	2079	303	195	195
q3	10355	1331	683	683
q4	10221	1091	566	566
q5	7554	2647	2679	2647
q6	204	173	133	133
q7	1002	801	631	631
q8	9307	1436	1179	1179
q9	6969	5314	5335	5314
q10	6953	2369	1911	1911
q11	500	289	272	272
q12	365	390	233	233
q13	18050	3811	3247	3247
q14	249	241	232	232
q15	578	496	498	496
q16	652	631	571	571
q17	616	908	358	358
q18	7576	7287	7047	7047
q19	1985	1112	602	602
q20	344	351	228	228
q21	4552	3969	2632	2632
q22	1078	1035	964	964
Total cold run time: 116791 ms
Total hot run time: 35451 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5614	5514	5480	5480
q2	252	344	236	236
q3	2213	2729	2475	2475
q4	1562	2075	1563	1563
q5	4589	4401	4378	4378
q6	275	186	131	131
q7	2109	1964	1813	1813
q8	2899	2884	2765	2765
q9	7191	7144	7118	7118
q10	3116	3319	2826	2826
q11	629	527	504	504
q12	695	783	642	642
q13	3623	4198	3330	3330
q14	284	314	293	293
q15	544	489	488	488
q16	668	683	665	665
q17	1211	1617	1505	1505
q18	7914	7533	7216	7216
q19	853	866	1030	866
q20	2038	2049	1895	1895
q21	5739	4943	4937	4937
q22	1102	1066	1000	1000
Total cold run time: 55120 ms
Total hot run time: 52126 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192496 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 67a9d5f24b4d1ce96b87e1a5a37cd0014f6f9d76, data reload: false

query1	1385	1048	1029	1029
query2	6269	1945	1959	1945
query3	10994	4621	4465	4465
query4	25953	23711	22742	22742
query5	4897	637	461	461
query6	314	210	208	208
query7	3991	498	278	278
query8	310	262	237	237
query9	8510	2583	2557	2557
query10	473	327	276	276
query11	15485	15130	14791	14791
query12	162	108	105	105
query13	1556	509	376	376
query14	9261	6085	6074	6074
query15	197	181	167	167
query16	7581	649	473	473
query17	1162	738	614	614
query18	2023	407	319	319
query19	201	188	164	164
query20	126	123	121	121
query21	202	125	108	108
query22	4482	4627	4386	4386
query23	34141	33301	33337	33301
query24	8918	2454	2442	2442
query25	543	480	429	429
query26	1200	274	154	154
query27	2777	523	343	343
query28	4541	2424	2431	2424
query29	723	594	467	467
query30	284	225	201	201
query31	927	888	799	799
query32	79	65	66	65
query33	557	374	314	314
query34	795	929	513	513
query35	820	881	759	759
query36	967	992	894	894
query37	123	143	73	73
query38	4262	4100	4351	4100
query39	1476	1416	1450	1416
query40	204	125	110	110
query41	53	56	53	53
query42	122	108	108	108
query43	530	524	500	500
query44	1334	806	796	796
query45	190	181	164	164
query46	858	1024	648	648
query47	1864	1913	1832	1832
query48	373	431	323	323
query49	729	525	420	420
query50	665	715	407	407
query51	4341	4418	4301	4301
query52	110	114	97	97
query53	234	265	179	179
query54	581	570	504	504
query55	85	79	84	79
query56	309	308	321	308
query57	1165	1201	1140	1140
query58	273	268	262	262
query59	2678	2976	2739	2739
query60	329	311	297	297
query61	136	140	127	127
query62	770	756	685	685
query63	224	185	194	185
query64	3977	1097	747	747
query65	4481	4357	4434	4357
query66	1027	466	315	315
query67	16305	15567	15472	15472
query68	8935	882	514	514
query69	484	308	263	263
query70	1204	1145	1106	1106
query71	469	330	291	291
query72	5731	4692	4661	4661
query73	710	591	348	348
query74	8957	9219	8916	8916
query75	4285	3227	2684	2684
query76	3780	1208	759	759
query77	789	366	288	288
query78	10082	10087	9266	9266
query79	4060	802	561	561
query80	659	514	445	445
query81	471	259	218	218
query82	592	123	94	94
query83	288	252	241	241
query84	301	103	87	87
query85	858	366	345	345
query86	361	291	279	279
query87	4486	4405	4433	4405
query88	3511	2260	2306	2260
query89	448	312	289	289
query90	2026	210	212	210
query91	140	143	110	110
query92	80	61	58	58
query93	3190	950	583	583
query94	680	416	311	311
query95	382	306	293	293
query96	507	562	281	281
query97	3173	3256	3108	3108
query98	229	205	204	204
query99	1441	1417	1273	1273
Total cold run time: 285923 ms
Total hot run time: 192496 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.06 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 67a9d5f24b4d1ce96b87e1a5a37cd0014f6f9d76, data reload: false

query1	0.04	0.03	0.03
query2	0.12	0.11	0.11
query3	0.25	0.20	0.20
query4	1.59	0.19	0.19
query5	0.59	0.58	0.58
query6	1.19	0.72	0.72
query7	0.03	0.01	0.01
query8	0.04	0.04	0.03
query9	0.58	0.52	0.52
query10	0.59	0.58	0.56
query11	0.16	0.11	0.12
query12	0.14	0.11	0.11
query13	0.61	0.60	0.60
query14	2.83	2.73	2.72
query15	0.96	0.86	0.86
query16	0.39	0.38	0.38
query17	1.03	1.03	0.99
query18	0.21	0.20	0.20
query19	1.96	1.97	1.82
query20	0.02	0.01	0.01
query21	15.36	0.89	0.55
query22	0.77	1.15	0.72
query23	14.87	1.35	0.61
query24	7.70	0.76	0.72
query25	0.51	0.18	0.13
query26	0.57	0.17	0.13
query27	0.05	0.05	0.05
query28	9.59	0.86	0.42
query29	12.53	4.01	3.29
query30	0.25	0.09	0.06
query31	2.83	0.61	0.39
query32	3.23	0.54	0.46
query33	3.01	3.05	3.12
query34	15.84	5.11	4.45
query35	4.52	4.52	4.50
query36	0.66	0.48	0.50
query37	0.08	0.06	0.06
query38	0.06	0.04	0.03
query39	0.02	0.02	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.11 s
Total hot run time: 31.06 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 22.73% (5/22) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.23% (13998/26799)
Line Coverage 40.97% (120607/294375)
Region Coverage 39.69% (61346/154566)
Branch Coverage 34.31% (30627/89262)

…eration monitoring

This change unifies S3 error logging patterns for four critical operations that require alerting:
* failed to complete multipart upload
* failed to upload part
* failed to put object
* failed to get object.*\.(dat|idx) (specific to data/index files)
@gavinchou gavinchou force-pushed the gavin-opt-s3-fail-log branch from 67a9d5f to c4a98da Compare April 5, 2025 14:35
@gavinchou
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34158 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c4a98da2eb00a34cbbdd7e875f5d86bb3ba911fb, data reload: false

------ Round 1 ----------------------------------
q1	25638	5166	5059	5059
q2	2067	279	186	186
q3	10394	1278	690	690
q4	10234	1015	538	538
q5	7544	2369	2283	2283
q6	188	168	134	134
q7	909	739	615	615
q8	9305	1280	1111	1111
q9	6955	5063	5066	5063
q10	6796	2302	1921	1921
q11	485	288	266	266
q12	351	368	229	229
q13	17790	3777	3130	3130
q14	254	234	229	229
q15	538	511	483	483
q16	633	608	591	591
q17	601	866	362	362
q18	7629	7301	7021	7021
q19	2205	985	557	557
q20	324	325	220	220
q21	4016	3376	2492	2492
q22	1092	1016	978	978
Total cold run time: 115948 ms
Total hot run time: 34158 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5258	5141	5187	5141
q2	247	324	227	227
q3	2125	2708	2314	2314
q4	1478	1868	1482	1482
q5	4572	4417	4403	4403
q6	216	176	129	129
q7	2024	1945	1773	1773
q8	2602	2545	2626	2545
q9	7269	7092	7192	7092
q10	2992	3188	2755	2755
q11	583	527	505	505
q12	692	777	652	652
q13	3547	3862	3422	3422
q14	289	288	288	288
q15	536	482	489	482
q16	635	704	627	627
q17	1161	1524	1420	1420
q18	7675	7525	7406	7406
q19	851	815	811	811
q20	1901	1982	1847	1847
q21	5524	5045	4896	4896
q22	1116	1099	1036	1036
Total cold run time: 53293 ms
Total hot run time: 51253 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193916 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c4a98da2eb00a34cbbdd7e875f5d86bb3ba911fb, data reload: false

query1	1405	1096	1033	1033
query2	6091	1964	1963	1963
query3	11213	4682	4691	4682
query4	25893	23848	23326	23326
query5	4565	649	450	450
query6	311	210	194	194
query7	3999	492	287	287
query8	296	246	240	240
query9	8514	2589	2579	2579
query10	490	315	292	292
query11	15367	15252	14818	14818
query12	155	113	115	113
query13	1574	558	396	396
query14	9617	6444	6367	6367
query15	227	191	179	179
query16	7638	646	495	495
query17	1149	783	606	606
query18	2056	418	340	340
query19	210	201	171	171
query20	135	132	120	120
query21	206	133	117	117
query22	4432	4670	4438	4438
query23	34040	33606	33486	33486
query24	8017	2474	2436	2436
query25	519	489	397	397
query26	723	276	149	149
query27	2761	537	338	338
query28	4549	2467	2487	2467
query29	652	571	450	450
query30	271	240	195	195
query31	877	891	783	783
query32	73	68	63	63
query33	535	379	315	315
query34	789	878	525	525
query35	860	880	807	807
query36	988	1010	919	919
query37	125	106	74	74
query38	4234	4246	4174	4174
query39	1533	1443	1441	1441
query40	214	120	112	112
query41	54	52	52	52
query42	126	105	112	105
query43	518	521	501	501
query44	1373	826	816	816
query45	183	176	174	174
query46	853	1030	650	650
query47	1860	1891	1805	1805
query48	383	421	318	318
query49	718	515	414	414
query50	669	703	417	417
query51	4270	4394	4313	4313
query52	114	109	101	101
query53	229	264	198	198
query54	572	572	524	524
query55	85	85	85	85
query56	313	314	296	296
query57	1172	1224	1126	1126
query58	271	276	296	276
query59	2771	2830	2714	2714
query60	335	309	320	309
query61	141	127	126	126
query62	766	762	685	685
query63	233	191	188	188
query64	2754	1104	728	728
query65	4404	4332	4329	4329
query66	885	424	329	329
query67	16319	15686	15358	15358
query68	9679	894	532	532
query69	485	318	271	271
query70	1217	1101	1106	1101
query71	481	322	287	287
query72	5418	4679	4584	4584
query73	701	554	343	343
query74	8927	9095	8687	8687
query75	4415	3221	2695	2695
query76	4306	1206	753	753
query77	973	359	286	286
query78	9862	10241	9408	9408
query79	1809	810	568	568
query80	707	501	448	448
query81	493	252	223	223
query82	443	125	103	103
query83	287	255	240	240
query84	290	96	88	88
query85	783	358	318	318
query86	332	301	277	277
query87	4449	4519	4471	4471
query88	2789	2197	2206	2197
query89	395	316	282	282
query90	1896	214	213	213
query91	144	153	114	114
query92	78	66	55	55
query93	1078	936	590	590
query94	693	417	306	306
query95	374	305	304	304
query96	477	562	270	270
query97	3190	3291	3164	3164
query98	228	206	203	203
query99	1432	1403	1261	1261
Total cold run time: 278505 ms
Total hot run time: 193916 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.89 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c4a98da2eb00a34cbbdd7e875f5d86bb3ba911fb, data reload: false

query1	0.04	0.04	0.03
query2	0.14	0.10	0.10
query3	0.23	0.20	0.20
query4	1.59	0.20	0.20
query5	0.60	0.59	0.60
query6	1.19	0.72	0.71
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.58	0.52	0.51
query10	0.59	0.59	0.57
query11	0.16	0.12	0.11
query12	0.16	0.12	0.12
query13	0.62	0.59	0.60
query14	2.69	2.82	2.76
query15	0.94	0.86	0.84
query16	0.39	0.40	0.37
query17	1.03	1.02	1.02
query18	0.21	0.20	0.20
query19	1.90	1.95	1.84
query20	0.01	0.01	0.02
query21	15.36	0.93	0.59
query22	0.74	1.15	0.67
query23	14.94	1.39	0.60
query24	7.60	0.97	0.35
query25	0.43	0.19	0.15
query26	0.58	0.16	0.14
query27	0.05	0.06	0.05
query28	9.67	0.83	0.44
query29	12.55	3.95	3.29
query30	0.25	0.10	0.07
query31	2.82	0.58	0.38
query32	3.23	0.55	0.46
query33	3.13	3.00	3.02
query34	16.11	5.19	4.52
query35	4.56	4.56	4.56
query36	0.67	0.49	0.48
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.17	0.13	0.14
query41	0.08	0.03	0.03
query42	0.04	0.03	0.02
query43	0.04	0.04	0.03
Total cold run time: 106.31 s
Total hot run time: 30.89 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 22.73% (5/22) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.23% (13998/26799)
Line Coverage 40.97% (120608/294375)
Region Coverage 39.68% (61339/154566)
Branch Coverage 34.31% (30628/89262)

Copy link
Contributor

github-actions bot commented Apr 7, 2025

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Apr 7, 2025

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 7, 2025
@dataroaring dataroaring merged commit 180e0aa into apache:master Apr 7, 2025
28 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Apr 7, 2025
…eration monitoring (#49813)

This change unifies S3 error logging patterns for four critical
operations that require alerting:
* failed to complete multipart upload
* failed to upload part
* failed to put object
* failed to get object.*\.(dat|idx) (specific to data/index files)
gavinchou added a commit that referenced this pull request Apr 7, 2025
… critical operation monitoring #49813 (#49828)

Cherry-picked from #49813

Co-authored-by: Gavin Chou <gavin@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.5-merged p0_b reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants