add a script to feed fastpath from failed measurements s3 bucket#404
add a script to feed fastpath from failed measurements s3 bucket#404
Conversation
Ansible Run Output 🤖Ansible Playbook Recap 🔍Ansible playbook output 📖
|
| Pusher | @aagbsn |
| Action | pull_request |
| Working Directory | |
| Workflow | .github/workflows/check_ansible.yml |
| Last updated | Wed, 15 Apr 2026 12:36:48 GMT |
|
|
||
| # Configuration from environment (set these in your shell) | ||
| AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID") # required if not using IAM role/profile | ||
| AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY") # required if not using IAM role/profile |
There was a problem hiding this comment.
The source bucket and destination bucket are different and require two different access keys, so these parameters should be separated.
| p = Path(local_path) | ||
| msmt_id = p.stem | ||
| with p.open("r", encoding="utf-8") as f: | ||
| data = json.load(f) |
There was a problem hiding this comment.
You should use ujson here since it's significantly faster than the stock python json parser.
| print("S3_BUCKET_NAME environment variable is required.") | ||
| return | ||
| s3 = get_s3_client() | ||
| for prefix, subs, objs in walk(s3, BUCKET_NAME, ""): |
There was a problem hiding this comment.
You should support a dry-run mode which doesn't actually do any copy or submission, to make sure that everything is working as intended.
| s3 = get_s3_client() | ||
| for prefix, subs, objs in walk(s3, BUCKET_NAME, ""): | ||
| print(f"PREFIX: {prefix} subdirs={len(subs)} objects={len(objs)}") | ||
| with ThreadPoolExecutor(max_workers=50) as _exe: |
There was a problem hiding this comment.
I wouldn't run this inside of a thread pool to avoid concurrency issues. I think we are OK with this not going too fast.
| content = data.get('content') | ||
| endpoint = f"{FASTPATH_API}/{msmt_id}" | ||
| try: | ||
| resp = requests.post(endpoint, json=content, timeout=30) |
There was a problem hiding this comment.
You need to post the full content of the payload (including the format and content keys), not just the content of content.
Because of that you don't actually need to parse the JSON body, you can just treat it as a binary blob.
There was a problem hiding this comment.
You should not be using the json option of requests, since this is going to re-serialize the JSON, which can lead to what's being sent not matching the hash of what's inside of the measurement_uid: https://github.com/ooni/backend/blob/master/ooniapi/services/ooniprobe/src/ooniprobe/routers/reports.py#L162
| def process_postcan(s3, bucket, key, local_path): | ||
| try: | ||
| print("Downloading", key) | ||
| s3.download_file(bucket, key, local_path) |
There was a problem hiding this comment.
You don't actually need to download this to a local file and re-read it, you can just stream it from s3 directly to the post request
this script reads from ooniprobe-failed-reports and submits them to fastpath.