查看数据清洗任务详情
更新时间:2025-05-16
功能介绍
本接口用于查看数据清洗任务详情。
注意事项
(1)通过API查看数据清洗任务,和千帆控制台页面展示字段不同:
- 本文API参数有的字段,可能在千帆控制台页面无
- 千帆控制台页面的部分字段,可能在本文API参数中无
- 后续会持续完善API功能,请关注API文档更新
(2)本文API支持通过Python SDK、Go SDK、Java SDK 和 Node.js SDK调用,调用流程请参考SDK安装及使用流程。
(3)权限说明
调用本文API,需符合以下权限要求,权限介绍及分配,请查看角色与权限控制列表、账号创建与权限分配。需具有以下任一权限:
- 完全控制千帆大模型平台的权限:QianfanFullControlAccessPolicy
- 只读访问千帆大模型平台的权限:QianfanReadAccessPolicy
- 完全控制千帆大模型平台数据管理(除数据标注外)的权限:QianfanDataFullControlAccessPolicy
- 运维操作千帆大模型平台数据管理(除数据标注外)的权限:QianfanDataOperateAccessPolicy
- 只读访问千帆大模型平台数据管理(除数据标注外)的权限:QianfanDataReadAccessPolicy
SDK调用
调用示例
1import os
2from qianfan import resources
3
4# 通过环境变量初始化认证信息
5# 使用安全认证AK/SK调用,替换下列示例中参数,安全认证Access Key替换your_iam_ak,Secret Key替换your_iam_sk,如何获取请查看https://cloud.baidu.com/doc/Reference/s/9jwvz2egb
6os.environ["QIANFAN_ACCESS_KEY"] = "your_iam_ak"
7os.environ["QIANFAN_SECRET_KEY"] = "your_iam_sk"
8
9
10
11resp = resources.console.utils.call_action(
12 # 调用本文API,该参数值为固定值,无需修改;对应API调用文档-请求结构-请求地址的后缀
13 "/wenxinworkshop/etl/detail", "",
14 # 请查看本文请求参数说明,根据实际使用选择参数;对应API调用文档-请求参数-Body参数
15 {
16 "etlId": "task-9tff1q3h7ngdmgh4"
17 }
18
19)
20
21print(resp.body)
1package main
2import (
3 "context"
4 "fmt"
5 "os"
6 "github.com/baidubce/bce-qianfan-sdk/go/qianfan"
7)
8func main() {
9 // 使用安全认证AK/SK鉴权,通过环境变量初始化;替换下列示例中参数,安全认证Access Key替换your_iam_ak,Secret Key替换your_iam_sk
10 os.Setenv("QIANFAN_ACCESS_KEY", "your_iam_ak")
11 os.Setenv("QIANFAN_SECRET_KEY", "your_iam_sk")
12
13 ca := qianfan.NewConsoleAction()
14
15 res, err := ca.Call(context.TODO(),
16 // 调用本文API,该参数值为固定值,无需修改;对应API调用文档-请求结构-请求地址的后缀
17 "/wenxinworkshop/etl/detail", "",
18 // 请查看本文请求参数说明,根据实际使用选择参数;对应API调用文档-请求参数-Body参数
19 map[string]interface{}{
20 "etlId": "task-9tff1q3h7ngdmgh4",
21 })
22 if err != nil {
23 panic(err)
24 }
25 fmt.Println(string(res.Body))
26
27}
1import com.baidubce.qianfan.Qianfan;
2import com.baidubce.qianfan.model.console.ConsoleResponse;
3import com.baidubce.qianfan.util.CollUtils;
4import com.baidubce.qianfan.util.Json;
5import java.util.Map;
6
7public class Dome {
8 public static void main(String args[]){
9 // 使用安全认证AK/SK鉴权,替换下列示例中参数,安全认证Access Key替换your_iam_ak,Secret Key替换your_iam_sk
10 Qianfan qianfan = new Qianfan("your_iam_ak", "your_iam_sk");
11
12 ConsoleResponse<Map<String, Object>> response = qianfan.console()
13 // 调用本文API,该参数值为固定值,无需修改;对应API调用文档-请求结构-请求地址的后缀
14 .route("/wenxinworkshop/etl/detail")
15 // 需要传入参数的场景,可以自行封装请求类,或者使用Map.of()来构建请求Body
16 // Java 8可以使用SDK提供的CollUtils.mapOf()来替代Map.of()
17 // 请查看本文请求参数说明,根据实际使用选择参数;对应API调用文档-请求参数-Body参数
18 .body(CollUtils.mapOf(
19 "etlId", "task-9tff1q3h7ngdmgh4"
20 ))
21 .execute();
22
23 System.out.println(Json.serialize(response));
24 }
25}
1import {consoleAction, setEnvVariable} from "@baiducloud/qianfan";
2
3// 使用安全认证AK/SK鉴权,通过环境变量初始化;替换下列示例中参数,安全认证Access Key替换your_iam_ak,Secret Key替换your_iam_sk
4setEnvVariable('QIANFAN_ACCESS_KEY','your_iam_ak');
5setEnvVariable('QIANFAN_SECRET_KEY','your_iam_sk');
6
7async function main() {
8 //base_api_route:调用本文API,该参数值为固定值,无需修改;对应API调用文档-请求结构-请求地址的后缀
9 //data:请查看本文请求参数说明,根据实际使用选择参数;对应API调用文档-请求参数-Body参数
10 const res = await consoleAction({base_api_route: '/wenxinworkshop/etl/detail', data: {
11 "etlId": "task-9tff1q3h7ngdmgh4"
12 }
13 });
14
15 console.log(res);
16}
17
18main();
返回示例
1{
2 "log_id": "44k3yj73ms178179",
3 "result": {
4 "id": 273,
5 "etlTaskId": "task-7bynx9aaa1qyex2s",
6 "userId": 113,
7 "sourceDatasetId": 2235,
8 "destDatasetId": 2230,
9 "taskId": 5331,
10 "entityCount": 1,
11 "entityType": 2,
12 "operationsV2": {
13 "clean": [
14 {
15 "name": "remove_invisible_character",
16 "args": {}
17 },
18 {
19 "name": "replace_uniform_whitespace",
20 "args": {}
21 },
22 {
23 "name": "remove_non_meaning_characters",
24 "args": {}
25 },
26 {
27 "name": "replace_traditional_chinese_to_simplified",
28 "args": {}
29 },
30 {
31 "name": "remove_web_identifiers",
32 "args": {}
33 },
34 {
35 "name": "remove_emoji",
36 "args": {}
37 },
38 {
39 "name": "save_pipeline_clean",
40 "args": {}
41 }
42 ],
43 "deduplication": [
44 {
45 "name": "deduplication_simhash",
46 "args": {
47 "distance": 5.6511
48 }
49 },
50 {
51 "name": "save_pipeline_deduplication",
52 "args": {}
53 }
54 ],
55 "desensitization": [
56 {
57 "name": "replace_emails",
58 "args": {}
59 },
60 {
61 "name": "replace_ip",
62 "args": {}
63 },
64 {
65 "name": "replace_identifier",
66 "args": {}
67 },
68 {
69 "name": "save_pipeline_desensitization",
70 "args": {}
71 }
72 ],
73 "filter": [
74 {
75 "name": "filter_check_number_words",
76 "args": {
77 "number_words_max_cutoff": 10000,
78 "number_words_min_cutoff": 2.2
79 }
80 },
81 {
82 "name": "filter_check_character_repetition_removal",
83 "args": {
84 "default_character_repetition_max_cutoff": 0.2
85 }
86 },
87 {
88 "name": "filter_check_word_repetition_removal",
89 "args": {
90 "word_repetition_max_cutoff": 0.6
91 }
92 },
93 {
94 "name": "filter_check_special_characters",
95 "args": {
96 "special_characters_max_cutoff": 0.3
97 }
98 },
99 {
100 "name": "filter_check_flagged_words",
101 "args": {
102 "flagged_words_max_cutoff": 0.50556
103 }
104 },
105 {
106 "name": "filter_check_lang_id",
107 "args": {
108 "lang_id_min_cutoff": 0.5
109 }
110 },
111 {
112 "name": "filter_check_perplexity",
113 "args": {
114 "perplexity_max_cutoff": 1110
115 }
116 },
117 {
118 "name": "save_pipeline_filter",
119 "args": {}
120 }
121 ]
122 },
123 "result": {
124 "RET_OK": 0,
125 "pipeline_stage_result": {
126 "clean": {
127 "status": "Success",
128 "operator_count": 6,
129 "entity_match_count": 1,
130 "each_operator_result": [
131 {
132 "name": "remove_invisible_character",
133 "remaining_count": 1,
134 "drop_count": 0
135 },
136 {
137 "name": "replace_uniform_whitespace",
138 "remaining_count": 1,
139 "drop_count": 0
140 },
141 {
142 "name": "remove_non_meaning_characters",
143 "remaining_count": 1,
144 "drop_count": 0
145 },
146 {
147 "name": "replace_traditional_chinese_to_simplified",
148 "remaining_count": 1,
149 "drop_count": 0
150 },
151 {
152 "name": "remove_web_identifiers",
153 "remaining_count": 1,
154 "drop_count": 0
155 },
156 {
157 "name": "remove_emoji",
158 "remaining_count": 1,
159 "drop_count": 0
160 }
161 ]
162 },
163 "deduplication": {
164 "status": "Success",
165 "operator_count": 1,
166 "entity_match_count": 0,
167 "each_operator_result": [
168 {
169 "name": "deduplication_simhash",
170 "remaining_count": 0,
171 "drop_count": 0
172 }
173 ]
174 },
175 "desensitization": {
176 "status": "Success",
177 "operator_count": 3,
178 "entity_match_count": 0,
179 "each_operator_result": [
180 {
181 "name": "replace_emails",
182 "remaining_count": 0,
183 "drop_count": 0
184 },
185 {
186 "name": "replace_ip",
187 "remaining_count": 0,
188 "drop_count": 0
189 },
190 {
191 "name": "replace_identifier",
192 "remaining_count": 0,
193 "drop_count": 0
194 }
195 ]
196 },
197 "filter": {
198 "status": "Success",
199 "operator_count": 7,
200 "entity_match_count": 1,
201 "each_operator_result": [
202 {
203 "name": "filter_check_number_words",
204 "remaining_count": 1,
205 "drop_count": 0
206 },
207 {
208 "name": "filter_check_character_repetition_removal",
209 "remaining_count": 0,
210 "drop_count": 1
211 },
212 {
213 "name": "filter_check_word_repetition_removal",
214 "remaining_count": 1,
215 "drop_count": 0
216 },
217 {
218 "name": "filter_check_special_characters",
219 "remaining_count": 1,
220 "drop_count": 0
221 },
222 {
223 "name": "filter_check_flagged_words",
224 "remaining_count": 1,
225 "drop_count": 0
226 },
227 {
228 "name": "filter_check_lang_id",
229 "remaining_count": 1,
230 "drop_count": 0
231 },
232 {
233 "name": "filter_check_perplexity",
234 "remaining_count": 1,
235 "drop_count": 0
236 }
237 ]
238 }
239 },
240 "export_entity_num": 0,
241 "remaining_entity": 0,
242 "unprocessed_entity": 0,
243 "remove_emoji": {
244 "processed_entity": 0
245 },
246 "remove_url": {
247 "processed_entity": 0
248 },
249 "trad_to_simp": {
250 "processed_entity": 0
251 },
252 "remove_id_card": {
253 "processed_entity": 0
254 },
255 "remove_phone_number": {
256 "processed_entity": 0
257 },
258 "remove_exception_char": {
259 "processed_entity": 0
260 },
261 "replace_sim2trad": {
262 "processed_entity": 0
263 },
264 "replace_trad2sim": {
265 "processed_entity": 0
266 },
267 "replace_upper2lower": {
268 "processed_entity": 0
269 },
270 "cut": {
271 "remaining_entity": 0,
272 "unprocessed_entity": 0
273 },
274 "failReason": "",
275 "pauseReason": ""
276 },
277 "processStatus": 2,
278 "status": 0,
279 "createTime": "2023-11-06T14:31:03+08:00",
280 "finishTime": "2023-11-06T14:32:11+08:00",
281 "creatorName": "yyw02",
282 "sourceDatasetName": "zy_泛文本5-V1",
283 "sourceDatasetStrId": "ds-xarnk5tdirfjky2q",
284 "destDatasetName": "g423423-V2",
285 "destDatasetStrId": "ds-9tf91q1h7n3dm7h4",
286 "etlResult": "",
287 "remainingEntity": 0,
288 "exceptionResult": "",
289 "startTime": "2023-11-06 14:31:03",
290 "endTime": "2023-11-06 14:32:11",
291 "modifyTime": "2023-11-06 14:32:11",
292 "logPath": "https://bj.bcebos.com/easydata-qabosqa/qianfan/qianfan1019/_system_/dataset/ds-u7898jqx2aabjp38/cleaning/2235-2230-273-20231106143103.txt?x-bce-security-token=ZjkyZmQ2YmQxZTQ3NDxxxxxZp70QaweY1MNyT32OKRGNCew%3D%3D\u0026authorization=bce-auth-v1%2F24ec282b7c6d11eexxxxx4d33b5123"
293 },
294 "status": 200,
295 "success": True
296}
1{
2 "log_id": "44k3yj73ms178179",
3 "result": {
4 "id": 273,
5 "etlTaskId": "task-7bynx9aaa1qyex2s",
6 "userId": 113,
7 "sourceDatasetId": 2235,
8 "destDatasetId": 2230,
9 "taskId": 5331,
10 "entityCount": 1,
11 "entityType": 2,
12 "operationsV2": {
13 "clean": [
14 {
15 "name": "remove_invisible_character",
16 "args": {}
17 },
18 {
19 "name": "replace_uniform_whitespace",
20 "args": {}
21 },
22 {
23 "name": "remove_non_meaning_characters",
24 "args": {}
25 },
26 {
27 "name": "replace_traditional_chinese_to_simplified",
28 "args": {}
29 },
30 {
31 "name": "remove_web_identifiers",
32 "args": {}
33 },
34 {
35 "name": "remove_emoji",
36 "args": {}
37 },
38 {
39 "name": "save_pipeline_clean",
40 "args": {}
41 }
42 ],
43 "deduplication": [
44 {
45 "name": "deduplication_simhash",
46 "args": {
47 "distance": 5.6511
48 }
49 },
50 {
51 "name": "save_pipeline_deduplication",
52 "args": {}
53 }
54 ],
55 "desensitization": [
56 {
57 "name": "replace_emails",
58 "args": {}
59 },
60 {
61 "name": "replace_ip",
62 "args": {}
63 },
64 {
65 "name": "replace_identifier",
66 "args": {}
67 },
68 {
69 "name": "save_pipeline_desensitization",
70 "args": {}
71 }
72 ],
73 "filter": [
74 {
75 "name": "filter_check_number_words",
76 "args": {
77 "number_words_max_cutoff": 10000,
78 "number_words_min_cutoff": 2.2
79 }
80 },
81 {
82 "name": "filter_check_character_repetition_removal",
83 "args": {
84 "default_character_repetition_max_cutoff": 0.2
85 }
86 },
87 {
88 "name": "filter_check_word_repetition_removal",
89 "args": {
90 "word_repetition_max_cutoff": 0.6
91 }
92 },
93 {
94 "name": "filter_check_special_characters",
95 "args": {
96 "special_characters_max_cutoff": 0.3
97 }
98 },
99 {
100 "name": "filter_check_flagged_words",
101 "args": {
102 "flagged_words_max_cutoff": 0.50556
103 }
104 },
105 {
106 "name": "filter_check_lang_id",
107 "args": {
108 "lang_id_min_cutoff": 0.5
109 }
110 },
111 {
112 "name": "filter_check_perplexity",
113 "args": {
114 "perplexity_max_cutoff": 1110
115 }
116 },
117 {
118 "name": "save_pipeline_filter",
119 "args": {}
120 }
121 ]
122 },
123 "result": {
124 "RET_OK": 0,
125 "pipeline_stage_result": {
126 "clean": {
127 "status": "Success",
128 "operator_count": 6,
129 "entity_match_count": 1,
130 "each_operator_result": [
131 {
132 "name": "remove_invisible_character",
133 "remaining_count": 1,
134 "drop_count": 0
135 },
136 {
137 "name": "replace_uniform_whitespace",
138 "remaining_count": 1,
139 "drop_count": 0
140 },
141 {
142 "name": "remove_non_meaning_characters",
143 "remaining_count": 1,
144 "drop_count": 0
145 },
146 {
147 "name": "replace_traditional_chinese_to_simplified",
148 "remaining_count": 1,
149 "drop_count": 0
150 },
151 {
152 "name": "remove_web_identifiers",
153 "remaining_count": 1,
154 "drop_count": 0
155 },
156 {
157 "name": "remove_emoji",
158 "remaining_count": 1,
159 "drop_count": 0
160 }
161 ]
162 },
163 "deduplication": {
164 "status": "Success",
165 "operator_count": 1,
166 "entity_match_count": 0,
167 "each_operator_result": [
168 {
169 "name": "deduplication_simhash",
170 "remaining_count": 0,
171 "drop_count": 0
172 }
173 ]
174 },
175 "desensitization": {
176 "status": "Success",
177 "operator_count": 3,
178 "entity_match_count": 0,
179 "each_operator_result": [
180 {
181 "name": "replace_emails",
182 "remaining_count": 0,
183 "drop_count": 0
184 },
185 {
186 "name": "replace_ip",
187 "remaining_count": 0,
188 "drop_count": 0
189 },
190 {
191 "name": "replace_identifier",
192 "remaining_count": 0,
193 "drop_count": 0
194 }
195 ]
196 },
197 "filter": {
198 "status": "Success",
199 "operator_count": 7,
200 "entity_match_count": 1,
201 "each_operator_result": [
202 {
203 "name": "filter_check_number_words",
204 "remaining_count": 1,
205 "drop_count": 0
206 },
207 {
208 "name": "filter_check_character_repetition_removal",
209 "remaining_count": 0,
210 "drop_count": 1
211 },
212 {
213 "name": "filter_check_word_repetition_removal",
214 "remaining_count": 1,
215 "drop_count": 0
216 },
217 {
218 "name": "filter_check_special_characters",
219 "remaining_count": 1,
220 "drop_count": 0
221 },
222 {
223 "name": "filter_check_flagged_words",
224 "remaining_count": 1,
225 "drop_count": 0
226 },
227 {
228 "name": "filter_check_lang_id",
229 "remaining_count": 1,
230 "drop_count": 0
231 },
232 {
233 "name": "filter_check_perplexity",
234 "remaining_count": 1,
235 "drop_count": 0
236 }
237 ]
238 }
239 },
240 "export_entity_num": 0,
241 "remaining_entity": 0,
242 "unprocessed_entity": 0,
243 "remove_emoji": {
244 "processed_entity": 0
245 },
246 "remove_url": {
247 "processed_entity": 0
248 },
249 "trad_to_simp": {
250 "processed_entity": 0
251 },
252 "remove_id_card": {
253 "processed_entity": 0
254 },
255 "remove_phone_number": {
256 "processed_entity": 0
257 },
258 "remove_exception_char": {
259 "processed_entity": 0
260 },
261 "replace_sim2trad": {
262 "processed_entity": 0
263 },
264 "replace_trad2sim": {
265 "processed_entity": 0
266 },
267 "replace_upper2lower": {
268 "processed_entity": 0
269 },
270 "cut": {
271 "remaining_entity": 0,
272 "unprocessed_entity": 0
273 },
274 "failReason": "",
275 "pauseReason": ""
276 },
277 "processStatus": 2,
278 "status": 0,
279 "createTime": "2023-11-06T14:31:03+08:00",
280 "finishTime": "2023-11-06T14:32:11+08:00",
281 "creatorName": "yyw02",
282 "sourceDatasetName": "zy_泛文本5-V1",
283 "sourceDatasetStrId": "ds-xarnk5tdirfjky2q",
284 "destDatasetName": "g423423-V2",
285 "destDatasetStrId": "ds-9tf91q1h7n3dm7h4",
286 "etlResult": "",
287 "remainingEntity": 0,
288 "exceptionResult": "",
289 "startTime": "2023-11-06 14:31:03",
290 "endTime": "2023-11-06 14:32:11",
291 "modifyTime": "2023-11-06 14:32:11",
292 "logPath": "https://bj.bcebos.com/easydata-qabosqa/qianfan/qianfan1019/_system_/dataset/ds-u7898jqx2aabjp38/cleaning/2235-2230-273-20231106143103.txt?x-bce-security-token=ZjkyZmQ2YmQxZTQ3NDxxxxxZp70QaweY1MNyT32OKRGNCew%3D%3D\u0026authorization=bce-auth-v1%2F24ec282b7c6d11eexxxxx4d33b5123"
293 },
294 "status": 200,
295 "success": true
296}
1{
2 "log_id": "44k3yj73ms178179",
3 "result": {
4 "id": 273,
5 "etlTaskId": "task-7bynx9aaa1qyex2s",
6 "userId": 113,
7 "sourceDatasetId": 2235,
8 "destDatasetId": 2230,
9 "taskId": 5331,
10 "entityCount": 1,
11 "entityType": 2,
12 "operationsV2": {
13 "clean": [
14 {
15 "name": "remove_invisible_character",
16 "args": {}
17 },
18 {
19 "name": "replace_uniform_whitespace",
20 "args": {}
21 },
22 {
23 "name": "remove_non_meaning_characters",
24 "args": {}
25 },
26 {
27 "name": "replace_traditional_chinese_to_simplified",
28 "args": {}
29 },
30 {
31 "name": "remove_web_identifiers",
32 "args": {}
33 },
34 {
35 "name": "remove_emoji",
36 "args": {}
37 },
38 {
39 "name": "save_pipeline_clean",
40 "args": {}
41 }
42 ],
43 "deduplication": [
44 {
45 "name": "deduplication_simhash",
46 "args": {
47 "distance": 5.6511
48 }
49 },
50 {
51 "name": "save_pipeline_deduplication",
52 "args": {}
53 }
54 ],
55 "desensitization": [
56 {
57 "name": "replace_emails",
58 "args": {}
59 },
60 {
61 "name": "replace_ip",
62 "args": {}
63 },
64 {
65 "name": "replace_identifier",
66 "args": {}
67 },
68 {
69 "name": "save_pipeline_desensitization",
70 "args": {}
71 }
72 ],
73 "filter": [
74 {
75 "name": "filter_check_number_words",
76 "args": {
77 "number_words_max_cutoff": 10000,
78 "number_words_min_cutoff": 2.2
79 }
80 },
81 {
82 "name": "filter_check_character_repetition_removal",
83 "args": {
84 "default_character_repetition_max_cutoff": 0.2
85 }
86 },
87 {
88 "name": "filter_check_word_repetition_removal",
89 "args": {
90 "word_repetition_max_cutoff": 0.6
91 }
92 },
93 {
94 "name": "filter_check_special_characters",
95 "args": {
96 "special_characters_max_cutoff": 0.3
97 }
98 },
99 {
100 "name": "filter_check_flagged_words",
101 "args": {
102 "flagged_words_max_cutoff": 0.50556
103 }
104 },
105 {
106 "name": "filter_check_lang_id",
107 "args": {
108 "lang_id_min_cutoff": 0.5
109 }
110 },
111 {
112 "name": "filter_check_perplexity",
113 "args": {
114 "perplexity_max_cutoff": 1110
115 }
116 },
117 {
118 "name": "save_pipeline_filter",
119 "args": {}
120 }
121 ]
122 },
123 "result": {
124 "RET_OK": 0,
125 "pipeline_stage_result": {
126 "clean": {
127 "status": "Success",
128 "operator_count": 6,
129 "entity_match_count": 1,
130 "each_operator_result": [
131 {
132 "name": "remove_invisible_character",
133 "remaining_count": 1,
134 "drop_count": 0
135 },
136 {
137 "name": "replace_uniform_whitespace",
138 "remaining_count": 1,
139 "drop_count": 0
140 },
141 {
142 "name": "remove_non_meaning_characters",
143 "remaining_count": 1,
144 "drop_count": 0
145 },
146 {
147 "name": "replace_traditional_chinese_to_simplified",
148 "remaining_count": 1,
149 "drop_count": 0
150 },
151 {
152 "name": "remove_web_identifiers",
153 "remaining_count": 1,
154 "drop_count": 0
155 },
156 {
157 "name": "remove_emoji",
158 "remaining_count": 1,
159 "drop_count": 0
160 }
161 ]
162 },
163 "deduplication": {
164 "status": "Success",
165 "operator_count": 1,
166 "entity_match_count": 0,
167 "each_operator_result": [
168 {
169 "name": "deduplication_simhash",
170 "remaining_count": 0,
171 "drop_count": 0
172 }
173 ]
174 },
175 "desensitization": {
176 "status": "Success",
177 "operator_count": 3,
178 "entity_match_count": 0,
179 "each_operator_result": [
180 {
181 "name": "replace_emails",
182 "remaining_count": 0,
183 "drop_count": 0
184 },
185 {
186 "name": "replace_ip",
187 "remaining_count": 0,
188 "drop_count": 0
189 },
190 {
191 "name": "replace_identifier",
192 "remaining_count": 0,
193 "drop_count": 0
194 }
195 ]
196 },
197 "filter": {
198 "status": "Success",
199 "operator_count": 7,
200 "entity_match_count": 1,
201 "each_operator_result": [
202 {
203 "name": "filter_check_number_words",
204 "remaining_count": 1,
205 "drop_count": 0
206 },
207 {
208 "name": "filter_check_character_repetition_removal",
209 "remaining_count": 0,
210 "drop_count": 1
211 },
212 {
213 "name": "filter_check_word_repetition_removal",
214 "remaining_count": 1,
215 "drop_count": 0
216 },
217 {
218 "name": "filter_check_special_characters",
219 "remaining_count": 1,
220 "drop_count": 0
221 },
222 {
223 "name": "filter_check_flagged_words",
224 "remaining_count": 1,
225 "drop_count": 0
226 },
227 {
228 "name": "filter_check_lang_id",
229 "remaining_count": 1,
230 "drop_count": 0
231 },
232 {
233 "name": "filter_check_perplexity",
234 "remaining_count": 1,
235 "drop_count": 0
236 }
237 ]
238 }
239 },
240 "export_entity_num": 0,
241 "remaining_entity": 0,
242 "unprocessed_entity": 0,
243 "remove_emoji": {
244 "processed_entity": 0
245 },
246 "remove_url": {
247 "processed_entity": 0
248 },
249 "trad_to_simp": {
250 "processed_entity": 0
251 },
252 "remove_id_card": {
253 "processed_entity": 0
254 },
255 "remove_phone_number": {
256 "processed_entity": 0
257 },
258 "remove_exception_char": {
259 "processed_entity": 0
260 },
261 "replace_sim2trad": {
262 "processed_entity": 0
263 },
264 "replace_trad2sim": {
265 "processed_entity": 0
266 },
267 "replace_upper2lower": {
268 "processed_entity": 0
269 },
270 "cut": {
271 "remaining_entity": 0,
272 "unprocessed_entity": 0
273 },
274 "failReason": "",
275 "pauseReason": ""
276 },
277 "processStatus": 2,
278 "status": 0,
279 "createTime": "2023-11-06T14:31:03+08:00",
280 "finishTime": "2023-11-06T14:32:11+08:00",
281 "creatorName": "yyw02",
282 "sourceDatasetName": "zy_泛文本5-V1",
283 "sourceDatasetStrId": "ds-xarnk5tdirfjky2q",
284 "destDatasetName": "g423423-V2",
285 "destDatasetStrId": "ds-9tf91q1h7n3dm7h4",
286 "etlResult": "",
287 "remainingEntity": 0,
288 "exceptionResult": "",
289 "startTime": "2023-11-06 14:31:03",
290 "endTime": "2023-11-06 14:32:11",
291 "modifyTime": "2023-11-06 14:32:11",
292 "logPath": "https://bj.bcebos.com/easydata-qabosqa/qianfan/qianfan1019/_system_/dataset/ds-u7898jqx2aabjp38/cleaning/2235-2230-273-20231106143103.txt?x-bce-security-token=ZjkyZmQ2YmQxZTQ3NDxxxxxZp70QaweY1MNyT32OKRGNCew%3D%3D\u0026authorization=bce-auth-v1%2F24ec282b7c6d11eexxxxx4d33b5123"
293 },
294 "status": 200,
295 "success": true
296}
1{
2 log_id: '44k3yj73ms178179',
3 result: {
4 id: 273,
5 etlTaskId: 'task-7bynx9aaa1qyex2s',
6 userId: 113,
7 sourceDatasetId: 2235,
8 destDatasetId: 2230,
9 taskId: 5331,
10 entityCount: 1,
11 entityType: 2,
12 operationsV2: {
13 clean: [
14 {
15 name: 'remove_invisible_character',
16 args: {}
17 },
18 {
19 name: 'replace_uniform_whitespace',
20 args: {}
21 },
22 {
23 name: 'remove_non_meaning_characters',
24 args: {}
25 },
26 {
27 name: 'replace_traditional_chinese_to_simplified',
28 args: {}
29 },
30 {
31 name: 'remove_web_identifiers',
32 args: {}
33 },
34 {
35 name: 'remove_emoji',
36 args: {}
37 },
38 {
39 name: 'save_pipeline_clean',
40 args: {}
41 }
42 ],
43 deduplication: [
44 {
45 name: 'deduplication_simhash',
46 args: {
47 distance: 5.6511
48 }
49 },
50 {
51 name: 'save_pipeline_deduplication',
52 args: {}
53 }
54 ],
55 desensitization: [
56 {
57 name: 'replace_emails',
58 args: {}
59 },
60 {
61 name: 'replace_ip',
62 args: {}
63 },
64 {
65 name: 'replace_identifier',
66 args: {}
67 },
68 {
69 name: 'save_pipeline_desensitization',
70 args: {}
71 }
72 ],
73 filter: [
74 {
75 name: 'filter_check_number_words',
76 args: {
77 number_words_max_cutoff: 10000,
78 number_words_min_cutoff: 2.2
79 }
80 },
81 {
82 name: 'filter_check_character_repetition_removal',
83 args: {
84 default_character_repetition_max_cutoff: 0.2
85 }
86 },
87 {
88 name: 'filter_check_word_repetition_removal',
89 args: {
90 word_repetition_max_cutoff: 0.6
91 }
92 },
93 {
94 name: 'filter_check_special_characters',
95 args: {
96 special_characters_max_cutoff: 0.3
97 }
98 },
99 {
100 name: 'filter_check_flagged_words',
101 args: {
102 flagged_words_max_cutoff: 0.50556
103 }
104 },
105 {
106 name: 'filter_check_lang_id',
107 args: {
108 lang_id_min_cutoff: 0.5
109 }
110 },
111 {
112 name: 'filter_check_perplexity',
113 args: {
114 perplexity_max_cutoff: 1110
115 }
116 },
117 {
118 name: 'save_pipeline_filter',
119 args: {}
120 }
121 ]
122 },
123 result: {
124 RET_OK: 0,
125 pipeline_stage_result: {
126 clean: {
127 status: 'Success',
128 operator_count: 6,
129 entity_match_count: 1,
130 each_operator_result: [
131 {
132 name: 'remove_invisible_character',
133 remaining_count: 1,
134 drop_count: 0
135 },
136 {
137 name: 'replace_uniform_whitespace',
138 remaining_count: 1,
139 drop_count: 0
140 },
141 {
142 name: 'remove_non_meaning_characters',
143 remaining_count: 1,
144 drop_count: 0
145 },
146 {
147 name: 'replace_traditional_chinese_to_simplified',
148 remaining_count: 1,
149 drop_count: 0
150 },
151 {
152 name: 'remove_web_identifiers',
153 remaining_count: 1,
154 drop_count: 0
155 },
156 {
157 name: 'remove_emoji',
158 remaining_count: 1,
159 drop_count: 0
160 }
161 ]
162 },
163 deduplication: {
164 status: 'Success',
165 operator_count: 1,
166 entity_match_count: 0,
167 each_operator_result: [
168 {
169 name: 'deduplication_simhash',
170 remaining_count: 0,
171 drop_count: 0
172 }
173 ]
174 },
175 desensitization: {
176 status: 'Success',
177 operator_count: 3,
178 entity_match_count: 0,
179 each_operator_result: [
180 {
181 name: 'replace_emails',
182 remaining_count: 0,
183 drop_count: 0
184 },
185 {
186 name: 'replace_ip',
187 remaining_count: 0,
188 drop_count: 0
189 },
190 {
191 name: 'replace_identifier',
192 remaining_count: 0,
193 drop_count: 0
194 }
195 ]
196 },
197 filter: {
198 status: 'Success',
199 operator_count: 7,
200 entity_match_count: 1,
201 each_operator_result: [
202 {
203 name: 'filter_check_number_words',
204 remaining_count: 1,
205 drop_count: 0
206 },
207 {
208 name: 'filter_check_character_repetition_removal',
209 remaining_count: 0,
210 drop_count: 1
211 },
212 {
213 name: 'filter_check_word_repetition_removal',
214 remaining_count: 1,
215 drop_count: 0
216 },
217 {
218 name: 'filter_check_special_characters',
219 remaining_count: 1,
220 drop_count: 0
221 },
222 {
223 name: 'filter_check_flagged_words',
224 remaining_count: 1,
225 drop_count: 0
226 },
227 {
228 name: 'filter_check_lang_id',
229 remaining_count: 1,
230 drop_count: 0
231 },
232 {
233 name: 'filter_check_perplexity',
234 remaining_count: 1,
235 drop_count: 0
236 }
237 ]
238 }
239 },
240 export_entity_num: 0,
241 remaining_entity: 0,
242 unprocessed_entity: 0,
243 remove_emoji: {
244 processed_entity: 0
245 },
246 remove_url: {
247 processed_entity: 0
248 },
249 trad_to_simp: {
250 processed_entity: 0
251 },
252 remove_id_card: {
253 processed_entity: 0
254 },
255 remove_phone_number: {
256 processed_entity: 0
257 },
258 remove_exception_char: {
259 processed_entity: 0
260 },
261 replace_sim2trad: {
262 processed_entity: 0
263 },
264 replace_trad2sim: {
265 processed_entity: 0
266 },
267 replace_upper2lower: {
268 processed_entity: 0
269 },
270 cut: {
271 remaining_entity: 0,
272 unprocessed_entity: 0
273 },
274 failReason: '',
275 pauseReason: ''
276 },
277 processStatus: 2,
278 status: 0,
279 createTime: '2023-11-06T14:31:03+08:00',
280 finishTime: '2023-11-06T14:32:11+08:00',
281 creatorName: 'yyw02',
282 sourceDatasetName: 'zy_泛文本5-V1',
283 sourceDatasetStrId: 'ds-xarnk5tdirfjky2q',
284 destDatasetName: 'g423423-V2',
285 destDatasetStrId: 'ds-9tf91q1h7n3dm7h4',
286 etlResult: '',
287 remainingEntity: 0,
288 exceptionResult: '',
289 startTime: '2023-11-06 14:31:03',
290 endTime: '2023-11-06 14:32:11',
291 modifyTime: '2023-11-06 14:32:11',
292 logPath: 'https://bj.bcebos.com/easydata-qabosqa/qianfan/qianfan1019/_system_/dataset/ds-u7898jqx2aabjp38/cleaning/2235-2230-273-20231106143103.txt?x-bce-security-token=ZjkyZmQ2YmQxZTQ3NDxxxxxZp70QaweY1MNyT32OKRGNCew==\u0026authorization=bce-auth-v1/24ec282b7c6d11eexxxxx4d33b5123'
293 },
294 status: 200,
295 success: true
296}
请求参数
名称 | 类型 | 必填 | 描述 |
---|---|---|---|
etlId | string | 是 | 数据清洗任务序号,说明: (1)可以通过以下任一方式获取该字段值: · 方式一,通过调用创建数据清洗任务接口,返回的字段result获取 · 方式二,通过调用查看清洗任务列表接口,返回的字段etlStrId获取 · 方式三,在控制台-数据处理-数据清洗页面,查看任务序号,如下图所示: ![]() (2)该字段新增支持string类型,如果之前使用的是int类型,建议变更为string类型,后续可能将逐步废弃int类型;例如之前获取数据清洗任务序号,是调用查看清洗任务列表接口,返回的etlId字段获取,请替换为接口返回的etlStrId字段获取 |
返回参数
说明:返回的部分字段如下,未说明的字段暂无需关注。
名称 | 类型 | 描述 |
---|---|---|
log_id | string | 操作记录id |
result | object | 返回结果 |
status | int | 状态码 |
success | bool | 是否操作成功,说明: · true:成功 · false:失败 |
返回结果result说明
名称 | 类型 | 描述 |
---|---|---|
id | int | 任务序号,注意:该字段后续将废弃,如果有使用此字段,建议变更为etlTaskId字段 |
etlTaskId | string | 任务序号 |
userId | int | 用户ID |
sourceDatasetId | int | 清洗前的源数据集版本ID,注意:该字段后续将废弃,如果有使用此字段,建议变更为sourceDatasetStrId字段 |
sourceDatasetStrId | string | 清洗前的源数据集版本ID |
destDatasetId | int | 清洗后的目标数据集版本ID,注意: 该字段后续将废弃,如果有使用此字段,建议变更为destDatasetStrId字段 |
destDatasetStrId | string | 清洗后的目标数据集版本ID |
taskId | int | 数据清洗任务ID |
entityCount | int | 样本个数 |
entityType | int | 样本类型,说明: 1:图片 2:文本 3:音频 4:视频 |
operationsV2 | map[string][]object | 清洗配置 ,说明: (1)key为string,有以下值: · 清洗:clean · 过滤:filter · 去重:deduplication · 去隐私:desensitization (2)value为list,值为单个阶段用户所选择的所有算子组成的列表 · 列表中的每个元素,对应某个算子的配置,格式参考operationsV2说明 · 如果用户没有在对应阶段选择任何算子,则value为空列表 |
result | object | 清洗结果 |
processStatus | int | 清洗状态信息,说明: · 0:无状态,表示没有任务 · 1:进行中 · 2:已完成 · 3:已终止 · 4:清洗失败 · 5:任务暂停 |
status | int | 清洗任务状态,说明: · 0:正常 · 1:删除 |
createTime | string | 创建时间 |
finishTime | string | 完成时间 |
creatorName | string | 创建者名称 |
sourceDatasetName | string | 源数据集名称 |
destDatasetName | string | 目标数据集名称 |
etlResult | string | 清洗结果 |
remainingEntity | int | 清洗后剩余的样本数量 |
exceptionResult | string | 异常原因 |
startTime | string | 任务启动时间 |
endTime | string | 任务结束时间 |
modifyTime | string | 更改时间 |
logPath | string | 清洗日志文件路径,如/minio/v-abc/some/path/1-2-1-20231010181818.txt |
operationV2说明
名称 | 类型 | 描述 |
---|---|---|
name | string | 算子名称,见各阶段name值和args值 (1)Clean清洗阶段算子 · remove_emoji:去除文档中的表情 · remove_invisible_character:移除ASCII中的一些不可见字符, 如0-32 和127-160这两个范围 · replace_uniform_whitespace:将不同的unicode空格比如 u2008,转成正常的空格 · remove_non_meaning_characters:去除乱码和无意义的unicode · replace_traditional_chinese_to_simplified:繁体转简体,如“不經意,妳的笑容”清洗成“不经意,你的笑容” · remove_web_identifiers:移除文档中的html标签,如 <html>,<dev>,<p> 等(2)Filter过滤阶段算子 · filter_check_number_words:检查文档的词数目,词数目不在指定范围会被过滤掉,如中文[1,10000] · filter_check_word_repetition_removal:检查文档的词重复率,如果词重复率太高,意味着文档中重复的词太多,文档会被过滤掉 · filter_check_character_repetition_removal:检查文档的字重复率,如果字重复率太高,意味着文档中重复的字太多,文档会被过滤掉 · filter_check_special_characters:检查文档的特殊字符率,如果特殊字符率太高,意味着文档中特殊字符太多,文档会被过滤掉 ·filter_check_flagged_words:检查文档的色情暴力词率,如果色情暴力词率太高,文档会被过滤掉 · filter_check_lang_id:检查文档的语言概率,如果语言概率太低,文档会被过滤掉 · filter_check_perplexity:检查文档的困惑度,如果困惑度太高,文档会被过滤掉 (3)Deduplication去重阶段算子 · deduplication_simhash:根据海明距离计算文档相似度, 相似度<=海明距离,认为两个文档相似。 (4)Desensitization 去隐私阶段算子 · replace_emails:去除email地址 · replace_ip:去除IPv4 或者 IPv6 地址 · replace_identifier:去除数字和字母数字标识符,如电话号码、信用卡号、十六进制散列等,同时跳过年份和简单数字的实例 |
args | object | 算子参数,格式随参数名称而变化,见各阶段name值对应的args说明: · 当name为Clean清洗阶段算子,args值为空 · 当name为Desensitization 去隐私阶段算子,args值为空 · 当name为Deduplication或Desensitization,请查看args说明 |
args说明
- 当name为Clean清洗阶段算子,args值为空
- 当name为Desensitization 去隐私阶段算子,args值为空
- 当name为Deduplication去重阶段算子,args说明如下
名称 | 类型 | 描述 |
---|---|---|
distance | int | 范围4-6 |
- 当name为Filter过滤阶段算子,args说明如下
名称 | 类型 | 描述 |
---|---|---|
number_words_min_cutoff | float | 最小词数目 · 范围为[1,10000] · 当name=filter_check_number_words,该字段必填 |
number_words_max_cutoff | float | 最大词数目 · 范围为[1,10000] · 当name=filter_check_number_words,该字段必填 |
word_repetition_max_cutoff | float | 文档的词重复率 · 范围为0-1 · 当name=filter_check_word_repetition_removal,该字段必填 |
default_character_repetition_max_cutoff | float | 文档的字重复率 · 范围为0-1 · 当name=filter_check_character_repetition_removal,该字段必填 |
special_characters_max_cutoff | float | 检查文档的特殊字符率,如果特殊字符率太高,意味着文档中特殊字符太多,文档会被过滤掉 · 范围为0-1 · 当name=filter_check_special_characters,该字段必填 |
flagged_words_max_cutoff | float | 检查文档的色情暴力词率,如果色情暴力词率太高,文档会被过滤掉 ·范围为0-1 · 当name=filter_check_flagged_words,该字段必填 |
lang_id_min_cutoff | float | 检查文档的语言概率,如果语言概率太低,文档会被过滤掉 · 范围为0-1 · 当name=filter_check_lang_id,该字段必填 |
perplexity_max_cutoff | float | 检查文档的困惑度,如果困惑度太高,文档会被过滤掉 ·范围为1-5000 · 当name=filter_check_perplexity,该字段必填 |
清洗结果result说明
名称 | 类型 | 描述 |
---|---|---|
RET_OK | int | 清洗结果 |
pipeline_stage_result | object | pipeline状态结果 |
export_entity_num | int | 导出样本数量 |
remaining_entity | int | 剩余样本 |
unprocessed_entity | int | 尚未清洗样本 |
remove_emoji | object | 里面只有一个int字段,processed_entity:某个算子被执行的行数 |
remove_url | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
trad_to_simp | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
remove_id_card | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
remove_phone_number | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
remove_exception_char | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
replace_sim2trad | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
replace_trad2sim | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
replace_upper2lower | object | 只有一个int字段,processed_entity:某个算子被执行的行数 |
cut | object | 裁剪,说明: · remaining_entity:剩余样本数量 · unprocessed_entity:尚未清洗样本 |
failReason | string | 失败原因 |
pauseReason | string | 暂停原因 |
pipeline_stage_result说明
名称 | 类型 | 描述 |
---|---|---|
clean | object | 数据清洗clean阶段执行结果 |
deduplication | object | 数据清洗deduplication阶段执行结果 |
desensitization | object | 数据清洗desensitization阶段执行结果 |
filter | object | 数据清洗filter阶段执行结果 |
执行结果说明
clean、deduplication、desensitization、filter 阶段执行结果字段相同,如下
名称 | 类型 | 描述 |
---|---|---|
status | string | 数据清洗某阶段执行结果,例:"Success" |
operator_count | int | 该阶段算子数 |
entity_match_count | int | 匹配到的样本数量 |
each_operator_result | List<object> | 具体到算子的清洗结果列表 |
each_operator_result 说明
名称 | 类型 | 描述 |
---|---|---|
name | string | 算子名称 |
remaining_count | int | 通过该算子清洗后剩余样本数 |
drop_count | int | 通过该算子清洗掉的样本数 |