今天在执行hive sql的时候,需要了奇怪的join错误,这里总结下
问题sql如下
-- 2019-05-03 -- 求2019-05-03 之前发布的视频作者中,其现会员人数及占比,历史会员人数及占比 -- 旧视频(作者id) with old_video_user as ( select distinct user_id from `dw`.`dim_video` where published_at <= '2019-05-03' ), -- 昨日用户vip等级表。可以快速过滤出,现vip用户,一天vip时效性误差不考虑 vip_user as ( select old_video_user.user_id from (select distinct user_id from parquet.`/idl/user_vip_level` where vip != 0) vip join old_video_user on vip.user_id = old_video_user.user_id ), -- 当前不是会员,但以往开过会员 old_vip_user as ( select no_vip_now.user_id from (select * from old_video_user where user_id not in (select user_id from vip_user)) no_vip_now join (select user_id from `dw`.`dim_user` where vip_expired_at < now()) old_vip on no_vip_now.user_id = old_vip.user_id ) select all_count.count as `总用户数`, vip_count.count as `当前vip用户数`, round(vip_count.count / all_count.count, 3) as `当前vip用户占总用户占比`, old_vip_count.count as `历史vip用户数`, round(old_vip_count.count / all_count.count, 3) as `历史vip用户占总用户占比`, (all_count.count - vip_count.count - old_vip_count.count) as `普通用户数`, round((all_count.count - vip_count.count - old_vip_count.count) / all_count.count, 3) as `普通用户占总用户占比` from (select count(user_id) as count, 1 as id from old_video_user) all_count, (select count(user_id) as count, 1 as id from vip_user) vip_count, (select count(user_id) as count, 1 as id from old_vip_user) old_vip_count
执行结果
java.sql.SQLException: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans Join Inner :- Aggregate [count(user_id#300) AS count#264L] : +- Aggregate [user_id#300], [user_id#300] : +- Project [user_id#300] : +- Filter (isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) : +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet +- Aggregate [count(user_id#300) AS count#266L] +- Project [user_id#300] +- Join Inner, (user_id#378 = user_id#300) :- Aggregate [user_id#378], [user_id#378] : +- Project [user_id#378] : +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378)) : +- Relation[user_id#378,vip#379,date#380] parquet +- Aggregate [user_id#300], [user_id#300] +- Project [user_id#300] +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300)) +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet and Aggregate [count(user_id#300) AS count#268L] +- Project [user_id#300] +- Join Inner, (user_id#300 = user_id#381) :- Aggregate [user_id#300], [user_id#300] : +- Project [user_id#300] : +- Filter ((isnotnull(published_at#323) && ((published_at#323 <= 2019-05-03) && NOT user_id#300 IN (list#277 []))) && isnotnull(user_id#300)) : : +- Project [user_id#300 AS user_id#300#467] : : +- Join Inner, (user_id#378 = user_id#300) : : :- Aggregate [user_id#378], [user_id#378] : : : +- Project [user_id#378] : : : +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378)) : : : +- Relation[user_id#378,vip#379,date#380] parquet : : +- Aggregate [user_id#300], [user_id#300] : : +- Project [user_id#300] : : +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300)) : : +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet : +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet +- Project [user_id#381] +- Filter ((isnotnull(vip_expired_at#393) && (vip_expired_at#393 < 2021-07-19 16:56:54.135)) && (NOT user_id#381 IN (list#277 []) && isnotnull(user_id#381))) : +- Project [user_id#300 AS user_id#300#467] : +- Join Inner, (user_id#378 = user_id#300) : :- Aggregate [user_id#378], [user_id#378] : : +- Project [user_id#378] : : +- Filter ((isnotnull(vip#379) && NOT (cast(vip#379 as int) = 0)) && isnotnull(user_id#378)) : : +- Relation[user_id#378,vip#379,date#380] parquet : +- Aggregate [user_id#300], [user_id#300] : +- Project [user_id#300] : +- Filter ((isnotnull(published_at#323) && (published_at#323 <= 2019-05-03)) && isnotnull(user_id#300)) : +- Relation[photos_count#278,duration#279,name#280,channel#281,point#282,mp4_etag#283,catalog#284,logo_etag#285,scenes#286,digest#287,shared_at#288,state#289,shared#290,post_id#291,tags#292,published#293,audio#294,logo_path#295,udid#296,mold_id#297,create_type#298,last_comment_id#299,user_id#300,reference_id#301,... 26 more fields] parquet +- Relation[user_id#381,name#382,vip#383,icon#384,udid#385,age_zone#386,signature#387,created_at#388,level#389,platform#390,auth_token#391,updated_at#392,vip_expired_at#393,gender#394,first_udid#395,last_on#396,is_wechat#397,version#398,app_version#399,os#400,os_version#401,device_name#402,from_src#403,original_src#404,... 13 more fields] parquet Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291) at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291) at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:736) at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:819) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:632) at org.apache.zeppelin.scheduler.Job.run(Job.java:188) at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
执行结果中,错误信息提示,join存在问题,存在笛卡尔积的问题。一开始看到这个错误信息,加上“Join condition is missing or trivial
” ,我就排查sql中的join on
对着没,确认没有任何问题啊,摸不着头绪。最后才留意到最后总sql中from多表查询的问题。
原本我是想,每个表都count一下,然后多个count后的表联合查询一下,就算笛卡尔积也是“1*1*1=1”,也不影响。
于是,谷歌了一下错误信息。才理解,原来环境中搭建的ThriftServer其对应spark配置并没有对笛卡尔积特殊处理,也就是上面最后一句sql,from多表没有where条件,也就是将产生的笛卡尔积直接输出,而hive where可能使用inner join方式(跟mysql原理不太一样??todo待查)inner join禁止不带有on条件,直接将笛卡尔积输出!是可能考虑到,如果join直接输出笛卡尔积结果,肯定是数据结果错误,不是理想的结果。所以这里错误信息中指出,建议使用cross join sql,或开启cross join可用的配置。cross join就是hive中可以直接输出两个表笛卡尔积的一种连接查询方式。所以若采用cross join的方式运行,你可以将from多表改为一层层cross join。也可以直接设置cross join配置可用(spark.sql.crossJoin.enabled=true
),如直接在ThriftServer的spark配置文件中,加上该配置项。这样再次执行上面sql,hive会自动优化为cross join的方式。当然,除了采用错误信息提醒的方式,也可以增加where条件(毕竟where hive应该对应inner join,错误信息中,不就提示没有join条件吗),所以这里给每个count表增加一个用于条件过滤的相同列,然后where指定一下,具体最后一句sql修改如下
select all_count.count as `总用户数`, vip_count.count as `当前vip用户数`, round(vip_count.count / all_count.count, 3) as `当前vip用户占总用户占比`, old_vip_count.count as `历史vip用户数`, round(old_vip_count.count / all_count.count, 3) as `历史vip用户占总用户占比`, (all_count.count - vip_count.count - old_vip_count.count) as `普通用户数`, round((all_count.count - vip_count.count - old_vip_count.count) / all_count.count, 3) as `普通用户占总用户占比` from (select count(user_id) as count, 1 as id from old_video_user) all_count, (select count(user_id) as count, 1 as id from vip_user) vip_count, (select count(user_id) as count, 1 as id from old_vip_user) old_vip_count where all_count.id = vip_count.id and all_count.id = old_vip_count.id
再次完整执行,结果成功!
一般不建议直接输出笛卡尔积,我遇到的这个情况,刚好是已知每个表只有一条记录。
发表评论