زمان جایگزینی IN() با EXISTS() – سوالات فرعی همبسته و نامرتبط

ek3nk4r 2024-12-26

0 4 خواندن این مطلب 8 دقیقه زمان میبرد

زمان جایگزینی IN() با EXISTS() – سوالات فرعی همبسته و نامرتبط

پیشنهاد ویژه

خرید فالوور واقعی خرید لایک اینستاگرام خرید ویو اینستاگرام خرید فالوور اینستاگرام

اگر کسی پیشنهاد می کند که پرس و جوی SQL شما را بازنویسی کنید تا از EXISTS به جای پرس و جو فرعی IN بدون بررسی طرح اجرا استفاده کنید، ممکن است اشتباه کرده باشد. در زیر دو مثال نشان داده شده است که پرسش‌های فرعی ناهمبسته و همبسته را نشان می‌دهد.

فهرست مطالب

مثال

من دو جدول دارم: “کاربر” و “پیام”.
یک پیام از یک کاربر منشأ می گیرد، بنابراین ما “user_id” را هم به عنوان کلید اصلی برای “کاربران” و یک کلید خارجی برای “پیام ها” می یابیم. زمانی که یک پیام نویسنده ای نداشته باشد، می تواند پوچ باشد. کاربران می‌توانند به یک کشور (“user_country”) تعلق داشته باشند، و پیام‌ها می‌توانند از یک کشور (“message_country”):

create extension if not exists pgcrypto;

create table users (
 user_id uuid primary key default gen_random_uuid()
 , user_country text
 , active boolean default true
);

create table messages (
 message_id uuid primary key default gen_random_uuid()
 , user_id uuid references users
 , message_country text
);

insert into users (active, user_country)
select 
 random()>0.1
 ,(ARRAY['CH','FR','ES'])[3*random()]
from generate_series(1,10)
;

insert into messages (user_id,message_country)
select 
 case when random()>0.1 then user_id end
 ,(ARRAY['CH','FR','ES'])[3*random()]
from users, generate_series(1,1000)
;

create index on users(user_country);
create index on messages(message_country asc, user_id asc);
analyze users, messages;

استعلام فرعی ناهمبسته

یک پرس و جو فرعی ناهمبسته نیازی به اجرا برای هر ردیفی که در آن ارزیابی می شود نیست، زیرا به ستون های آن سطرها ارجاع نمی دهد و همیشه نتیجه یکسانی را ایجاد می کند.

پرس و جوی زیر پیام هایی را که کاربر برای آنها فعال است فهرست می کند. کاربران فعال در یک پرسش فرعی که “user_id” را برمی گرداند، انتخاب می کند و از آن در یک عبارت Where استفاده می کند. IN ()، = ANY () یا = SOME () (اینها معادل هستند):

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where user_id in (
 select user_id
 from users u
 where active
)
;

                                 QUERY PLAN
----------------------------------------------------------------------------
 Hash Join (actual time=3.013..9.913 rows=8091 loops=1)
   Hash Cond: (m.user_id = u.user_id)
   ->  Seq Scan on messages m (actual time=1.248..6.190 rows=10000 loops=1)
         Storage Table Read Requests: 12
         Storage Table Read Execution Time: 4.078 ms
         Storage Table Rows Scanned: 10000
   ->  Hash (actual time=1.745..1.745 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on users u (actual time=1.731..1.734 rows=9 loops=1)
               Storage Filter: active
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 0.884 ms
               Storage Table Rows Scanned: 10

نه ردیف از “کاربران” خوانده شده و به “پیام ها” پیوسته است.

همین را می‌توانست به صورت پیوست نوشته شود:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
join users u
on (m.user_id=u.user_id)
where u.active
;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Hash Join (actual time=2.469..9.494 rows=8091 loops=1)
   Hash Cond: (m.user_id = u.user_id)
   ->  Seq Scan on messages m (actual time=1.274..6.371 rows=10000 loops=1)
         Storage Table Read Requests: 12
         Storage Table Read Execution Time: 4.188 ms
         Storage Table Rows Scanned: 10000
   ->  Hash (actual time=1.178..1.179 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on users u (actual time=1.127..1.131 rows=9 loops=1)
               Storage Filter: active
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 1.052 ms
               Storage Table Rows Scanned: 10

همچنین می توان با EXISTS نوشت:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where exists (
 select from users u
 where
 m.user_id = u.user_id
 and active
)
;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Hash Join (actual time=1.875..8.790 rows=8091 loops=1)
   Hash Cond: (m.user_id = u.user_id)
   ->  Seq Scan on messages m (actual time=1.154..6.126 rows=10000 loops=1)
         Storage Table Read Requests: 12
         Storage Table Read Execution Time: 4.002 ms
         Storage Table Rows Scanned: 10000
   ->  Hash (actual time=0.706..0.706 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on users u (actual time=0.692..0.695 rows=9 loops=1)
               Storage Filter: active
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 0.621 ms
               Storage Table Rows Scanned: 10

برنامه ریز پرس و جو در همه موارد زیرپرس و جو را به یک پیوست تبدیل کرد و آنها به همین ترتیب عمل کردند.

پرسش فرعی مرتبط

من یک فیلتر اضافه می‌کنم: پیام‌ها باید از کشور کاربر باشد. از آنجایی که فیلتر می‌شود تا ردیف‌های کمتری ایجاد کند، می‌توانیم انتظار زمان پاسخ کمتری داشته باشیم، اما این پرس‌وجو به جای هشت میلی‌ثانیه، پنج ثانیه طول می‌کشد:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where user_id in (
 select user_id
 from users u
 where active
 and u.user_country = m.message_country
)
;

                                QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on messages m (actual time=4.163..5657.120 rows=2084 loops=1)
   Filter: (SubPlan 1)
   Rows Removed by Filter: 7916
   SubPlan 1
     ->  Seq Scan on users u (actual time=0.559..0.560 rows=2 loops=10000)
           Storage Filter: (active AND (user_country = m.message_country))
           Storage Table Read Requests: 1
           Storage Table Read Execution Time: 0.501 ms
           Storage Table Rows Scanned: 11

زمان پاسخ دهی بالا ناشی از loops=10000: پرس و جو فرعی برای هر سطر برای ارزیابی فیلتر جدید شامل هر دو جدول اجرا شده است.

یک فیلتر با یک پرسش فرعی که با نشان داده شده است Filter: (SubPlan 1)، شبیه به اتصال حلقه تو در تو است. این loops نشانگر پیچیدگی زمانی است.

برنامه ریز پرس و جو PostgreSQL نمی تواند این را به عنوان یک پیوست تبدیل کند زیرا پرس و جوی فرعی همبسته است: یکی از شرایط اتصال در پرس و جو است، در حالی که دیگری در طرح بندی به عبارت IN است.

بهتر نیست این دو قید را برگردانم:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where message_country in (
 select user_country
 from users u
 where u.user_id=m.user_id
 and active
)
;
                                QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on messages m (actual time=4.835..5720.221 rows=2084 loops=1)
   Filter: (SubPlan 1)
   Rows Removed by Filter: 7916
   SubPlan 1
     ->  Seq Scan on users u (actual time=0.566..0.567 rows=1 loops=10000)
           Storage Filter: (active AND (user_id = m.user_id))
           Storage Table Read Requests: 1
           Storage Table Read Execution Time: 0.507 ms
           Storage Table Rows Scanned: 11

برنامه ریز پرس و جو نتوانست این را به یک پیوستن تبدیل کند، اما من می توانم پرس و جو را با استفاده از یک join بازنویسی کنم:

yugabyte=# explain (analyze, dist, costs off, summary off)
select *
from messages m
join users    u on (
 m.user_id=u.user_id
 and
 m.message_country=u.user_country
 and u.active
) ;
                                                                        QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join (actual time=4.527..9.453 rows=2084 loops=1)
   Join Filter: ((m.user_id = u.user_id) AND (m.message_country = u.user_country))
   ->  Seq Scan on users u (actual time=0.790..0.794 rows=9 loops=1)
         Storage Filter: active
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.691 ms
         Storage Table Rows Scanned: 10
   ->  Index Scan using messages_message_country_user_id_idx on messages m (actual time=3.544..7.153 rows=2084 loops=1)
         Index Cond: (ROW(message_country, user_id) = ANY (ARRAY[ROW(u.user_country, u.user_id), ROW($1, $1025), ROW($2, $1026), ..., ROW($1023, $2047)]))
         Storage Table Read Requests: 3
         Storage Table Read Execution Time: 4.701 ms
         Storage Table Rows Scanned: 2084
         Storage Index Read Requests: 3
         Storage Index Read Execution Time: 0.831 ms
         Storage Index Rows Scanned: 2084

این طرح کارآمد است زیرا جدول کوچک (“کاربران”) را به جدول بزرگ (“پیام”) می‌پیوندد و می‌تواند تمام محمولات پیوستن را پایین بیاورد. در PostgreSQL، این نه حلقه (loops=9، اما YugabyteDB آن را فقط با حلقه تودرتوی دسته‌ای به یک حلقه بهینه می‌کند تا 9 مقدار از جدول بیرونی به عنوان یک آرایه به پایین فشار داده شوند.

در این مورد، می‌توان با داشتن همه ستون‌های پیوسته در طرح‌بندی فرعی، همبستگی‌زدایی کرد:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where ( m.message_country,m.user_id) in (
 select u.user_country,   u.user_id
 from users u
 where active
)
;
                                                                        QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join (actual time=5.342..10.323 rows=2084 loops=1)
   Join Filter: ((m.message_country = u.user_country) AND (m.user_id = u.user_id))
   ->  Seq Scan on users u (actual time=0.929..0.933 rows=9 loops=1)
         Storage Filter: active
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.830 ms
         Storage Table Rows Scanned: 10
   ->  Index Scan using messages_message_country_user_id_idx on messages m (actual time=4.231..7.894 rows=2084 loops=1)
         Index Cond: (ROW(message_country, user_id) = ANY (ARRAY[ROW(u.user_country, u.user_id), ROW($1, $1025), ROW($2, $1026), ..., ROW($1023, $2047)]))
         Storage Table Read Requests: 3
         Storage Table Read Execution Time: 5.359 ms
         Storage Table Rows Scanned: 2084
         Storage Index Read Requests: 3
         Storage Index Read Execution Time: 0.840 ms
         Storage Index Rows Scanned: 2084

امکان دیگر این است که از EXISTS استفاده کنید و تمام ستون‌ها را در زیرپرسی همبسته داشته باشید:

yugabyte=# explain (analyze, dist, costs off, summary off)
select m.*
from messages m
where exists (
 select from users u
 where
 m.user_id = u.user_id
 and
 m.message_country=u.user_country
 and u.active
)
;
                                                                        QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join (actual time=4.356..10.501 rows=2084 loops=1)
   Join Filter: ((m.user_id = u.user_id) AND (m.message_country = u.user_country))
   ->  Seq Scan on users u (actual time=0.767..0.770 rows=9 loops=1)
         Storage Filter: active
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.658 ms
         Storage Table Rows Scanned: 10
   ->  Index Scan using messages_message_country_user_id_idx on messages m (actual time=3.409..8.277 rows=2084 loops=1)
         Index Cond: (ROW(message_country, user_id) = ANY (ARRAY[ROW(u.user_country, u.user_id), ROW($1, $1025), ROW($2, $1026), ..., ROW($1023, $2047)]))
         Storage Table Read Requests: 3
         Storage Table Read Execution Time: 5.606 ms
         Storage Table Rows Scanned: 2084
         Storage Index Read Requests: 3
         Storage Index Read Execution Time: 0.790 ms
         Storage Index Rows Scanned: 2084

توجه داشته باشید که ستون مورد استفاده در subquery یکتا اعلام شده است که در این مثال کلید اصلی است و به همین دلیل است که یک join معادل است. اگر اینطور نبود، نمی‌توانید پرس و جوی فرعی را با یک بند الحاق جایگزین کنید، زیرا منجر به ایجاد یک ردیف برای هر ردیف منطبق از سؤال فرعی می‌شود. بندهای IN() یا EXISTS() semi-join هستند و هیچ عبارت semi-join وجود ندارد.

می توانید آن را با:

alter table messages drop constraint messages_user_id_fkey;
alter table users drop constraint users_pkey;
create index on users ( user_id );
analyze users;

می توانید پرس و جوی بالا (با EXISTS) را اجرا کنید و همه متدهای پیوستن را با راهنمایی تست کنید

Nested Loop سطرها را از پرس و جوی فرعی با کپی می کند HashAggregate:

 YB Batched Nested Loop Join (actual time=37.856..43.425 rows=2084 loops=1)
   Join Filter: ((m.user_id = u.user_id) AND (m.message_country = u.user_country))
   ->  HashAggregate (actual time=0.926..0.928 rows=9 loops=1)
         Group Key: u.user_id, u.user_country
         ->  Seq Scan on users u (actual time=0.913..0.916 rows=9 loops=1)
               Storage Filter: active

ادغام پیوستن (/*+ MergeJoin(u m) */) و Hash Join (/*+ HashJoin(u m) */) نیمه پیوستن را نشان می دهد:

 Merge Semi Join (actual time=12.213..33.015 rows=2084 loops=1)
   Merge Cond: ((m.message_country = u.user_country) AND (m.user_id = u.user_id))
   ->  Index Scan using messages_message_country_user_id_idx on messages m (actual time=3.746..30.814 rows=8322 loops=1)
         Storage Table Read Requests: 9

برنامه ریز پرس و جو PostgreSQL می تواند برخی از سوالات فرعی مرتبط را به یک پیوست تبدیل کند و از روش های پیوستن بیشتری بهره مند شود، اما نه در همه موارد. وقتی شک دارید به نقشه اجرایی نگاه کنید.
وقتی یک پرس و جو را بازنویسی می کنید، همان نتیجه را انتظار دارید. طرح اجرا می تواند به اعتبارسنجی آن کمک کند، به خصوص با یک مورد آزمایشی قابل تکرار. همه احتمالات را در مورد تست خود بگنجانید، مانند مقادیر صفر و محدودیت های یکپارچگی. و به یاد داشته باشید که این رفتار ممکن است با نسخه های موتور پایگاه داده جدید تکامل یابد.

همه پایگاه های داده در مورد تبدیل بهینه ساز برابر نیستند. کریس آنتوگنینی MySQL، PostgreSQL، و Oracle را با انواع مختلف پرس و جوهای فرعی مقایسه کرده است: https://antognini.ch/2017/12/how-well-a-query-optimizer-handles-subqueries.

در PostgreSQL یا YugabyteDB، از آن استفاده نکنید IN()، =ANY، یا =SOME برای سوالات فرعی همبسته که در آن جدول بیرونی دارای سطرهای زیادی است زیرا ممکن است پرس و جوی فرعی را برای هر سطر اجرا کند. آن را با EXISTS جایگزین کنید تا یک شبه پیوست برای برنامه ریز پرس و جو ارائه شود تا روش اتصال بهتری را انتخاب کند.