Thursday 30 November 2017

Olap الحركة من المتوسط


معظم الناس على دراية هذه العبارة، كوتيسيس سوف تقتل طائرتين مع ستونكوت واحد. إذا لم تكن كذلك، فإن المرحلة تشير إلى نهج يعالج هدفين في إجراء واحد. (للأسف، فإن التعبير نفسه غير سارة إلى حد ما، كما أن معظمنا don39t تريد رمي الحجارة على الحيوانات البريئة) اليوم I39m الذهاب لتغطية بعض الأساسيات على اثنين من الميزات الرائعة في سكل سيرفر: مؤشر كولومنستور (متوفر فقط في سكل سيرفر المؤسسة) و مخزن الاستعلام سكل. نفذت ميكروسوفت فعلا مؤشر كولومنستور في سكل 2012 المؤسسة، على الرغم من أنها 39ve تعزيزه في آخر إصدارين من سكل سيرفر. عرضت ميكروسوفت مخزن الاستعلام في سكل سيرفر 2016. لذلك، ما هي هذه الميزات والسبب في أنها مهمة حسنا، لدي تجريبي من شأنها أن أعرض كل من الميزات وتبين كيف يمكن أن تساعدنا. قبل أن أذهب أبعد من ذلك، وأنا أيضا تغطية هذا (وغيرها من سكل 2016 الميزات) في بلدي مجلة مجلة كود على الميزات الجديدة سكل 2016. كمقدمة أساسية، يمكن أن يساعد مؤشر كولومنستور تسريع الاستعلامات التي سكاناجريغات على كميات كبيرة من البيانات، و يخزن "مخزن طلبات البحث" عمليات تنفيذ الاستعلام وخطط التنفيذ وإحصاءات وقت التشغيل التي تحتاج عادة إلى جمعها يدويا. ثق بي عندما أقول، هذه هي الميزات الرائعة. لهذا العرض، I39ll باستخدام قاعدة بيانات تجريبي مستودع بيانات ميكروسوفت كونتوسو التجزئة. يتحدث فضفاضة، كونتوسو دو مثل حصة كبيرة حقا أدفنتورورسكوت، مع الجداول التي تحتوي على الملايين من الصفوف. (أكبر جدول أدفنتوروركس يحتوي على ما يقرب من 100،000 الصفوف على الأكثر). يمكنك تنزيل قاعدة بيانات كونتوسو دو هنا: ميكروسوفت-usdownloaddetails. aspxid18279. يعمل كونتوسو دو بشكل جيد جدا عندما تريد اختبار الأداء على الاستعلامات ضد الجداول الكبيرة. كونتوسو دو يحتوي على مستودع البيانات القياسية جدول الحقائق يسمى فاكتونلينزاليس، مع 12.6 مليون الصفوف. that39s بالتأكيد ليست أكبر مستودع مستودع البيانات في العالم، ولكن it39s لا child39s اللعب سواء. لنفترض أنني أريد تلخيص كمية مبيعات المنتجات لعام 2009، وترتيب المنتجات. قد استعلم جدول الحقائق والانضمام إلى جدول "أبعاد المنتج" واستخدام الدالة رانك، مثل ذلك: Here39s مجموعة نتيجة جزئية من أعلى 10 صفوف، من إجمالي المبيعات. على جهاز الكمبيوتر المحمول (i7، 16 غيغابايت من ذاكرة الوصول العشوائي)، الاستعلام يأخذ أي مكان من 3-4 ثوان لتشغيل. قد لا يبدو ذلك وكأنه نهاية العالم، ولكن بعض المستخدمين قد يتوقعون نتائج قريبة من لحظة (الطريقة التي قد ترى نتائج شبه فورية عند استخدام إكسيل مقابل مكعب أولاب). الفهرس الوحيد الموجود حاليا على هذا الجدول هو فهرس متفاوت المسافات لمفتاح المبيعات. إذا نظرنا إلى خطة التنفيذ، سكل سيرفر يجعل اقتراح لإضافة فهرس تغطية إلى الجدول: الآن، فقط لأن سكل سيرفر يقترح فهرس، يعني tn39 يجب عليك إنشاء الفهارس عمياء على كل رسالة اقتباس إندكسكوت. ومع ذلك، في هذه الحالة، يكتشف سكل سيرفر أننا تصفية استنادا إلى السنة، واستخدام مفتاح المنتج ومبلغ المبيعات. لذلك، يقترح سكل سيرفر مؤشر تغطية، مع داتيكي كحقل مفتاح الفهرس. السبب الذي نسميه هذا مؤشر كوتوفرينجكوت لأن سكل سيرفر سوف كوتبرينغ على طول الحقول غير المفتاحكوت استخدمنا في الاستعلام، كوتور ريديكوت. وبهذه الطريقة، سكل سيرفر لا تحتاج إلى استخدام الجدول أو الفهرس متفاوت المسافات في كافة محركات قاعدة البيانات يمكن ببساطة استخدام مؤشر تغطية الاستعلام. تغطية المؤشرات تحظى بشعبية في بعض البيانات تخزين البيانات والإبلاغ سيناريوهات، على الرغم من أنها لا تأتي بتكلفة محرك قاعدة البيانات الحفاظ عليها. ملاحظة: تغطي الفهارس حول لفترة طويلة، لذلك أنا haven39t حتى الآن غطت مؤشر كولومنستور ومخزن الاستعلام. لذلك، سأقوم بإضافة مؤشر التغطية: إذا قمت بإعادة تنفيذ نفس الاستعلام ركضت قبل لحظة (تلك التي جمعت كمية المبيعات لكل منتج)، الاستعلام في بعض الأحيان يبدو لتشغيل حوالي ثانية أسرع، وأحصل على خطة تنفيذ مختلفة، واحدة التي تستخدم مؤشر تسعى بدلا من فهرس المسح الضوئي (باستخدام مفتاح التاريخ على مؤشر تغطية لاسترداد المبيعات لعام 2009). لذلك، قبل مؤشر كولومنستور، يمكن أن يكون هذا طريقة واحدة لتحسين هذا الاستعلام في الإصدارات القديمة من سكل سيرفر. تشغيله أسرع قليلا من أول واحد، وأنا الحصول على خطة التنفيذ مع مؤشر تسعى بدلا من مسح الفهرس. ومع ذلك، هناك بعض القضايا: واثنين من مشغلي التنفيذ كوتيندكس سيكوت و كوتهاش المباراة (مجموع) كوت على حد سواء تعمل أساسا عن طريق روكوت. تخيل ذلك في جدول يحتوي على مئات الملايين من الصفوف. ذات الصلة، والتفكير في محتويات جدول الحقائق: في هذه الحالة، قيمة مفتاح تاريخ واحد أندور قيمة مفتاح منتج واحد قد تتكرر عبر مئات الآلاف من الصفوف (تذكر، يحتوي الجدول حقيقة أيضا مفاتيح للجغرافيا والترويج، بائع ، وما إلى ذلك) لذلك، عندما كوتندكس سيكوت و كوتهاش ماتشكوت صف العمل حسب الصف، وأنها تفعل ذلك على القيم التي قد تتكرر عبر العديد من الصفوف الأخرى. هذا هو عادة حيث I39d سيجو إلى سكل سيرفر كولومنستور الفهرس الذي يقدم سيناريو لتحسين أداء هذا الاستعلام بطرق مذهلة. ولكن قبل أن أفعل ذلك، Let39s العودة في الوقت المناسب. Let39s العودة إلى عام 2010، عندما قدمت مايكروسوفت وظيفة إضافية ل إكسيل المعروفة باسم بويربيفوت. كثير من الناس ربما يتذكرون مشاهدة العروض التجريبية من بويربيفوت ل إكسيل، حيث يمكن للمستخدم قراءة الملايين من الصفوف من مصدر بيانات خارجي إلى إكسيل. سوف بويربيفوت ضغط البيانات، وتوفير محرك لإنشاء الجداول المحورية ومخططات البيفوت التي أجريت بسرعة مذهلة ضد البيانات المضغوطة. يستخدم بويربيفوت تقنية في الذاكرة التي وصفتها ميكروسوفت كوفرتيباكوت. هذه التكنولوجيا في الذاكرة في بويربيفوت سوف تتخذ أساسا مكررة القيم الرئيسية كييفورين الأعمال وضغط عليها وصولا إلى ناقلات واحدة. كما أن التكنولوجيا الموجودة في الذاكرة سوف تقوم بتقسيم هذه القيم بالتوازي، في كتل عدة مئات في المرة الواحدة. وخلاصة القول هو أن مايكروسوفت خبز كمية كبيرة من تحسينات الأداء في ميزة فيرتيباق في الذاكرة بالنسبة لنا لاستخدام، والحق من مربع المثل. لماذا أنا أخذ هذا نزهة صغيرة أسفل حارة الذاكرة لأنه في سكل سيرفر 2012، نفذت ميكروسوفت واحدة من أهم الميزات في تاريخ محرك قاعدة البيانات الخاصة بهم: مؤشر كولومنستور. الفهرس هو في الحقيقة فهرس بالاسم فقط: هو طريقة لاتخاذ جدول سكل سيرفر وإنشاء مخزن عمود مضغوط في الذاكرة يقوم بضغط قيم المفتاح الخارجي المكرر وصولا إلى قيم متجه واحد. أنشأت ميكروسوفت أيضا تجمع الاحتياطي الجديد لقراءة هذه القيم متجه مضغوط بالتوازي، وخلق إمكانات لتحقيق مكاسب أداء ضخمة. لذلك، I39m الذهاب إلى إنشاء فهرس العمودية على الطاولة، و I39ll نرى كم أفضل (وأكثر كفاءة) تشغيل الاستعلام، مقابل الاستعلام الذي يتم تشغيله ضد مؤشر التغطية. لذلك، I39ll إنشاء نسخة مكررة من فاكتونلينزاليس (I39ll نسميها فاكتونلينزاليسديتايلنكس)، و I39ll إنشاء فهرس العمودية على الجدول مكررة بهذه الطريقة أنا won39t تتداخل مع الجدول الأصلي ومؤشر تغطي بأي شكل من الأشكال. بعد ذلك، I39ll إنشاء فهرس العمودية على الجدول الجديد: لاحظ عدة أشياء: I39ve المحدد عدة أعمدة المفتاح الأجنبي، فضلا عن مبلغ المبيعات. تذكر أن فهرس الأعمدة ليس مثل فهرس مخزن الصف التقليدي. ليس هناك كوتيكوت. نحن ببساطة تشير إلى الأعمدة التي يجب ضغط سكل سيرفر ووضعها في مخزن عمود في الذاكرة. لاستخدام تشبيه بويربيفوت ل إكسيل عندما نقوم بإنشاء فهرس عمود، we39re نقول سكل سيرفر أن تفعل أساسا نفس الشيء الذي فعلت بويربيفوت عندما استوردنا 20 مليون صف في إكسيل باستخدام بويربيفوت لذلك، I39ll إعادة تشغيل الاستعلام، وهذه المرة باستخدام جدول فاكتونلينزاليسديتايلنكس مكرر الذي يحتوي على فهرس مخزن العمود. يتم تشغيل هذا الاستعلام على الفور في أقل من ثانية. ويمكنني أيضا أن أقول أنه حتى لو كان الجدول مئات الملايين من الصفوف، فإنه لا يزال يعمل في كوتابات المثل من إيلاشكوت. يمكننا أن ننظر في خطة التنفيذ (وفي لحظات قليلة، ونحن سوف)، ولكن الآن الوقت 39s لتغطية ميزة مخزن الاستعلام. تخيل للحظة، أننا أجرينا كلا الاستعلامات بين عشية وضحاها: الاستعلام الذي استخدم الجدول فاكتونلينزاليس العادية (مع مؤشر التغطية) ثم الاستعلام الذي استخدم الجدول المكررة مع فهرس العمود. عند تسجيل الدخول في صباح اليوم التالي، نحن نرغب في رؤية خطة التنفيذ لكل من الاستعلامات عند حدوثها، فضلا عن إحصاءات التنفيذ. وبعبارة أخرى، we39d ترغب في رؤية نفس الإحصاءات التي we39d تكون قادرة على معرفة ما إذا كنا ركض كل من الاستفسارات بشكل تفاعلي في ستوديو إدارة سكل، تحولت في تايم وإحصاءات إو، وعرضت خطة التنفيذ مباشرة بعد تنفيذ الاستعلام. حسنا، أن 39s ما مخزن الاستعلام يسمح لنا القيام به يمكننا تشغيل (تمكين) مخزن الاستعلام لقاعدة بيانات، والذي سيؤدي سكل سيرفر لتخزين تنفيذ الاستعلام والتخطيط الإحصاءات حتى نتمكن من عرضها في وقت لاحق. لذلك، I39m الذهاب إلى تمكين مخزن الاستعلام على قاعدة بيانات كونتوسو مع الأمر التالي (و I39ll أيضا مسح أي التخزين المؤقت): ثم I39ll تشغيل اثنين من الاستعلامات (و كوتريبندكوت أن ركضتها قبل ساعات): الآن Let39s تدعي أنها ركضت ساعات منذ. وفقا لما قلته، فإن مخزن الاستعلام التقاط إحصاءات التنفيذ. فكيف أرى لهم لحسن الحظ، أن 39s من السهل جدا. إذا قمت بتوسيع قاعدة بيانات كونتوسو دو، I39ll راجع مجلد "استعلام مخزن". مخزن الاستعلام لديه وظائف هائلة و I39ll محاولة لتغطية الكثير من ذلك في المشاركات بلوق لاحقة. ولكن في الوقت الحالي، أريد عرض إحصاءات التنفيذ على الاستعلامات اثنين، وعلى وجه التحديد فحص مشغلي التنفيذ لمؤشر المخازن. لذلك I39ll انقر بزر الماوس الأيمن فوق أعلى استعلامات الموارد المستهلك وتشغيل هذا الخيار. هذا يعطيني الرسم البياني مثل واحد أدناه، حيث أستطيع أن أرى مدة تنفيذ التنفيذ (بالمللي ثانية) لجميع الاستعلامات التي تم تنفيذها. في هذه الحالة، كان الاستعلام 1 الاستعلام ضد الجدول الأصلي مع فهرس التغطية، وكان الاستعلام 2 ضد الجدول مع فهرس الأعمدة. أرقام don39t تكمن مؤشر مخزن تجاوزت أداء مؤشر تابلكوفرينغ الأصلي بعامل ما يقرب من 7 إلى 1. يمكنني تغيير المقياس للنظر في استهلاك الذاكرة بدلا من ذلك. في هذه الحالة، لاحظ أن الاستعلام 2 (استعلام الفهرس عمود مخزن) استخدام ذاكرة أكثر بكثير. هذا يوضح بوضوح لماذا مؤشر العمودية يمثل التكنولوجيا كوتين-ميموريكوت تحميل سكل سيرفر مؤشر مخزن العمود بأكمله في الذاكرة ويستخدم تجمع المخزن المؤقت مختلفة تماما مع مشغلي التنفيذ المحسنة لمعالجة الفهرس. موافق، لذلك لدينا بعض الرسوم البيانية لعرض إحصائيات التنفيذ يمكن أن نرى خطة التنفيذ (ومشغلي التنفيذ) المرتبطة بكل تنفيذ نعم، يمكننا إذا قمت بالنقر فوق شريط العمودي للاستعلام الذي استخدم فهرس العمود، you39ll انظر التنفيذ خطة أدناه. أول شيء نراه هو أن سكل سيرفر إجراء فحص فهرس العمود، والتي تمثل ما يقرب من 100 من تكلفة الاستعلام. قد تكون تقول، كوتايت دقيقة واحدة، الاستعلام الأول يستخدم مؤشر تغطية وأداء فهرس تسعى فكيف يمكن أن يكون مسح فهرس العمودية يكون أسرع من أن 39s سؤالا مشروعا، ولحسن الحظ هناك 39s الجواب. حتى عندما يقوم الاستعلام الأول بإجراء فهرس، فإنه لا يزال تنفيذ كوترو بواسطة روكوت. إذا وضعت الماوس فوق عامل المسح الضوئي عامل مسح الفهرس، أرى تلميح (مثل واحد أدناه)، مع إعداد واحد مهم: وضع التنفيذ هو باتش (على عكس رو، وهو ما كان لدينا مع الاستعلام الأول باستخدام تغطي المؤشر). هذا الوضع باتش يخبرنا أن سكل سيرفر معالجة ناقلات مضغوط (لأي القيم مفتاح أجنبي التي يتم تكرارها، مثل مفتاح المنتج ومفتاح التاريخ) في دفعات ما يقرب من 1000، بالتوازي. لذلك سكل سيرفر لا يزال قادرا على معالجة مؤشر عمود أكثر من ذلك بكثير بكفاءة. بالإضافة إلى ذلك، إذا وضعت الماوس فوق مهمة هاش ماتش (التجميع)، أرى أيضا أن سكل سيرفر يتم تجميع فهرس مخزن العمود باستخدام وضع دفعة (على الرغم من أن المشغل نفسه يمثل مثل هذه النسبة الصغيرة من تكلفة الاستعلام) وأخيرا، أنت قد يسأل، كوتوك، لذلك سكل سيرفر ضغط القيم في البيانات، يعامل القيم كمتجهات، وقراءتها في كتل ما يقرب من ألف القيم بالتوازي ولكن الاستعلام الخاص بي يريد فقط البيانات لعام 2009. لذلك هو سكل سيرفر المسح الضوئي فوق مجموعة كاملة من داتاكوت مرة أخرى، سؤال جيد. الجواب هو، كوتنوت ريليكوت. لحسن الحظ بالنسبة لنا، تجمع مخزن المخزن المؤقت مخزن عمود جديد وظيفة أخرى تسمى كوتسيجمنت إكستراكتيونكوت. في الأساس، سيقوم سكل سيرفر بفحص قيم المتجهات للعمود مفتاح التاريخ في فهرس العمود، وإزالة الشرائح التي تقع خارج نطاق عام 2009. I39ll توقف هنا. في المشاركات بلوق لاحق I39ll تغطية كل من مخزن عمود ومخزن الاستعلام بمزيد من التفصيل. أساسا، ما رأينا هنا اليوم هو أن مؤشر كولومنستور يمكن أن تسرع إلى حد كبير الاستعلامات التي سكاناجريغات على كميات كبيرة من البيانات، ومخزن الاستعلام التقاط عمليات تنفيذ الاستعلام والسماح لنا لفحص التنفيذ وإحصاءات الأداء في وقت لاحق. في النهاية، we39d ترغب في إنتاج مجموعة النتائج التي تبين ما يلي. لاحظ ثلاثة أشياء: الأعمدة أساسا محورية كل من أسباب العودة المحتملة، بعد عرض مبلغ المبيعات مجموعة النتائج تحتوي على المجاميع الفرعية بحلول نهاية الأسبوع (الأحد) التاريخ عبر جميع العملاء (حيث العميل هو نول) مجموعة النتائج يحتوي على المجموع الكلي صف (حيث العميل والتاريخ على حد سواء نول) أولا، قبل أن ندخل في نهاية سكل يمكننا استخدام القدرة بيفوتماتريكس ديناميكية في سرس. سنحتاج ببساطة إلى الجمع بين مجموعتي النتائج من عمود واحد ثم يمكننا تغذية النتائج إلى مراقبة مصفوفة سرس، والتي سوف تنتشر أسباب العودة عبر محور الأعمدة من التقرير. ومع ذلك، لا يستخدم الجميع سرس (على الرغم من أن معظم الناس ينبغي). ولكن حتى في هذه الحالة، يحتاج المطورون أحيانا إلى استهلاك مجموعات النتائج في شيء آخر غير أداة الإبلاغ. لذلك لهذا المثال، Let39s نفترض أننا نريد أن ننتج مجموعة النتائج لصفحة شبكة ويب وربما المطور يريد كوتستريب استبعاد الصفوف المجموع الفرعي (حيث لدي قيمة ريسولتسستنوم 2 و 3) ووضعها في شبكة ملخص. حتى الخط السفلي، ونحن بحاجة لتوليد الناتج أعلاه مباشرة من إجراء المخزنة. و كما أضاف تويست الأسبوع المقبل يمكن أن يكون سبب العودة X و Y و Z. لذلك نحن don39t معرفة عدد أسباب العودة يمكن أن يكون هناك. نحن نريد بسيط الاستعلام إلى محورية القيم المحتملة المحتملة لسبب الإرجاع. هنا حيث T - سكل بيفوت لديه قيود نحن بحاجة إلى توفيره القيم الممكنة. منذ أننا فاز 39t نعرف أنه حتى وقت التشغيل، ونحن بحاجة إلى إنشاء سلسلة الاستعلام ديناميكيا باستخدام نمط سكل الحيوي. يتضمن نمط سكل الديناميكي توليد بناء الجملة، قطعة قطعة، تخزينها في سلسلة، ثم تنفيذ السلسلة في النهاية. ديناميك سكل يمكن أن تكون صعبة، كما لدينا لتضمين بناء الجملة داخل سلسلة. ولكن في هذه الحالة، لدينا الخيار الحقيقي الوحيد إذا أردنا التعامل مع عدد متغير من أسباب العودة. I39ve وجدت دائما أن أفضل طريقة لإنشاء حل سكل الديناميكي هو من خلال معرفة ما كوتيدالكوت ولدت الاستعلام سيكون في النهاية (في هذه الحالة، نظرا لأسباب عودة نعرفها عن)، ثم عكس الهندسة ذلك عن طريق قطع معا جزء واحد في وقت واحد. وهكذا، هنا هو سكل نحن بحاجة إذا كنا نعرف أن أسباب العودة (A إلى D) كانت ثابتة ولن تتغير. يقوم الاستعلام بما يلي: يجمع البيانات من ساليسداتا مع البيانات من ريتورداتا حيث نحن كوثارد-ويريكوت كلمة المبيعات كنوع إجراء شكل جدول المبيعات ثم قم باستخدام سبب الإرجاع من بيانات الإرجاع في نفس العمود أكتيونتيب. وهذا سيعطينا العمود أكتيونتيب نظيفة التي محور. نحن الجمع بين عبارات سيليكت اثنين إلى تعبير جدول مشترك (كت)، وهو أساسا الاستعلام الفرعي الجدول المشتقة التي نستخدمها لاحقا في العبارة التالية (إلى بيفوت) عبارة بيفوت ضد كت، التي تجمع الدولارات لنوع الإجراء في أحد قيم نوع الإجراء الممكنة. لاحظ أن هذا isn39t مجموعة النتيجة النهائية. نحن نضع هذا في كت التي تقرأ من كت الأولى. والسبب في ذلك هو أننا نريد أن نفعل مجموعات متعددة في النهاية. عبارة سيليكت النهائية التي تقرأ من بيفوتكت، ويجمعها مع استعلام لاحق ضد بيفوتكت نفسه، ولكن حيث نقوم أيضا بتنفيذ مجموعتين في ميزة مجموعات التجميع في سكل 2008: تجميع حسب تاريخ نهاية الأسبوع (dbo. WeekEndingDate) تجميع لجميع الصفوف () حتى إذا كنا نعرف على وجه اليقين أن we39d لم يكن لديك أكثر رموز العودة السبب، ثم هذا سيكون الحل. ومع ذلك، نحن بحاجة إلى حساب رموز السبب الأخرى. لذلك نحن بحاجة إلى إنشاء هذا الاستعلام بأكمله أعلاه سلسلة واحدة كبيرة حيث أننا بناء أسباب العودة المحتملة كقائمة واحدة مفصولة بفواصل. I39m الذهاب لإظهار كامل رمز سكل T لتوليد (وتنفيذ) الاستعلام المطلوب. ثم I39ll كسر بها إلى أجزاء وشرح كل خطوة. أولا، هنا 39s رمز كامل لتوليد حيوي ما حصلت I39ve أعلاه. وهناك في الأساس خمس خطوات نحتاج إلى تغطيتها. الخطوة 1 . ونحن نعلم أن في مكان ما في هذا المزيج، ونحن بحاجة إلى إنشاء سلسلة لهذا في الاستعلام: ساليسامونت، السبب A، السبب B، السبب C، السبب D0160016001600160 ما يمكننا القيام به هو بنيت التعبير الجدول المشترك المؤقت الذي يجمع بين الثابت كوتساليس السلكية عمود كولومكوت مع قائمة فريدة من رموز السبب المحتملة. مرة واحدة لدينا ذلك في كت، يمكننا استخدام خدعة صغيرة لطيفة ل شمل باث (3939) لانهيار تلك الصفوف في سلسلة واحدة، وضع فاصلة أمام كل صف أن يقرأ الاستعلام، ومن ثم استخدام ستوف لاستبدال المرحلة الأولى من فاصلة مع مساحة فارغة. هذا هو خدعة التي يمكنك أن تجد في مئات من بلوق سكل. لذلك هذا الجزء الأول يبني سلسلة تسمى أكتيونسترينغ التي يمكننا استخدامها أكثر إلى أسفل. الخطوة 2 . ونحن نعلم أيضا أن we39ll تريد أن سوم أعمدة سبب جيندربيفوتد، جنبا إلى جنب مع العمود المبيعات القياسية. لذلك we39ll بحاجة إلى سلسلة منفصلة لذلك، الذي I39ll استدعاء سومسترينغ. I39ll ببساطة استخدام أكتيونسترينغ الأصلي، ثم ريبلاس الأقواس الخارجية مع بناء جملة سوم، بالإضافة إلى الأقواس الأصلية. الخطوة 3: الآن يبدأ العمل الحقيقي. باستخدام هذا الاستعلام الأصلي كنموذج، نريد إنشاء الاستعلام الأصلي (بدءا من ونيون من الجدولين)، ولكن استبدال أي مراجع للأعمدة المحورية مع السلاسل التي تم إنشاؤها ديناميكيا أعلاه. أيضا، في حين لم يكن مطلوبا تماما، I39ve أيضا إنشاء متغير ببساطة أي تركيبات تغذية عود النقل التي نريد تضمينها في الاستعلام الذي تم إنشاؤه (لقراءة). لذلك we39ll بناء الاستعلام بأكمله إلى متغير يسمى سكلبيفوتكيري. الخطوة 4. ونحن نواصل بناء الاستعلام مرة أخرى، تسلسل بناء الجملة يمكننا كوثارد-ويريكوت مع أكتيونسيلكتسترينغ (التي ولدت ديناميكيا لعقد كل قيم سبب عودة المحتملة) الخطوة 5. وأخيرا، we39ll توليد الجزء الأخير من الاستعلام المحوري، الذي يقرأ من 2 التعبير الجدول المشترك (بيفوتكت، من النموذج أعلاه) ويولد سيليكت النهائي للقراءة من بيفوتكت ودمجها مع 2 الثانية قراءة ضد بيفوتكت إلى تنفيذ مجموعات التجميع. وأخيرا، يمكننا كوتكسكوتكوت السلسلة باستخدام نظام سكل المخزنة بروك سبيكسكوتسول لذلك نأمل أن ترى أن عملية التالية لهذا النوع من الجهد هو تحديد ما الاستعلام النهائي سيكون، استنادا إلى مجموعة الحالية من البيانات والقيم (أي بنيت نموذج استعلام) كتابة التعليمات البرمجية T-سكل اللازمة لإنشاء نموذج الاستعلام هذا كسلسلة. يمكن القول إن الجزء الأكثر أهمية هو تحديد مجموعة فريدة من القيم التي يمكنك بيفوت لك 39، ومن ثم طيها في سلسلة واحدة باستخدام وظيفة ستوف و ل شمل باث (3939) خدعة حتى ماذا يكون في ذهني اليوم حسنا، على الأقل 13 البنود اثنين قبل صيف، كتبت مشروع بر الذي ركز (جزئيا) على دور التعليم وقيمة خلفية الفنون الليبرالية جيدة ليس فقط لصناعة البرمجيات ولكن حتى بالنسبة للصناعات الأخرى كذلك. واحدة من مواضيع هذا بدر خاص أكد وجهة نظر محورية ومستنيرة من مهندس البرمجيات الشهير ألين هولوب فيما يتعلق الفنون الحرة. يعبر إل (بصدق) عن رسالته: سلط الضوء على أوجه الشبه بين البرمجة ودراسة التاريخ، من خلال تذكير الجميع بأن التاريخ هو القراءة والكتابة (وإضافة إل، وتحديد الأنماط)، وتطوير البرمجيات هو أيضا القراءة والكتابة (ومرة أخرى، تحديد الأنماط ). وهكذا كتبت مقالة رأي ركزت على هذا الموضوع وغيره من الموضوعات ذات الصلة. ولكن حتى اليوم، لم أحصل أبدا على إما نشره. كل ما في كثير من الأحيان إد التفكير في تنقيحه، و إد حتى الجلوس لبضع دقائق وإجراء بعض التعديلات عليه. ولكن بعد ذلك الحياة بشكل عام سوف تحصل في الطريق و إد أبدا الانتهاء منه. لذلك ما تغير قبل بضعة أسابيع، كتب الكاتب كود مجلة زميل وزعيم الصناعة تيد نيوارد قطعة في عموده العادي، المدرب المبرمج، التي لفتت انتباهي. عنوان المقال هو على الفنون الليبرالية. وأنا أوصي أن الجميع قراءتها. تيد يناقش قيمة خلفية الفنون الحرة، والانقسام كاذبة بين خلفية الفنون الليبرالية والنجاح في تطوير البرمجيات، والحاجة إلى وريتيكومونيكات جيدا. ويتحدث عن بعض لقاءاته السابقة مع إدارة شؤون العاملين في الموارد البشرية فيما يتعلق بخلفيته التعليمية. كما يشدد على الحاجة إلى قبول والتكيف مع التغيرات في صناعتنا، فضلا عن السمات المميزة لنجاح البرمجيات المهنية (موثوق بها، والتخطيط للمستقبل، والتعلم للحصول على الصراع الأولي الماضي مع أعضاء الفريق الآخرين). لذلك لها قراءة كبيرة، وكذلك تيدس مقالات كود الأخرى ومدونات بلوق. كما أعادتني إلى التفكير في آرائي حول هذا الموضوع (والمواضيع الأخرى)، ودفعني أخيرا إلى إنهاء افتتاحيةي الخاصة. لذلك، في وقت متأخر أفضل من أبدا، وهنا بلدي الحالي الخبازين عشرات من تأملات: لدي قول مأثور: يتجمد الماء في 32 درجة. إذا كنت في دور التدريب، قد تعتقد أنك تفعل كل شيء في العالم لمساعدة شخص ما في الواقع، انهم يشعرون فقط درجة حرارة 34 درجة، وبالتالي الأمور أرينت ترسيخ بالنسبة لهم. في بعض الأحيان يستغرق سوى القليل من الجهد أو محفز إيديتشيميكال آخر أو منظور جديد مما يعني أن أولئك الذين لديهم تعليم سابق يمكن أن تعتمد على مصادر مختلفة. يتجمد الماء عند 32 درجة. بعض الناس يمكن الحفاظ على مستويات عالية من التركيز حتى مع غرفة مليئة الناس صاخبة. أنا لا أحد منهم أحيانا أحتاج إلى بعض الخصوصية للتفكير من خلال قضية حرجة. بعض الناس يصفون هذا كما كنت تعلم فلدي على المشي بعيدا عن ذلك. وبطريقة أخرى، بحثا عن الهواء نادر. هذا الأسبوع الماضي قضيت ساعات في نصف مضاءة، غرفة هادئة مع السبورة، حتى فهمت تماما مشكلة. وعندئذ فقط يمكنني التحدث مع المطورين الآخرين حول الحل. الرسالة هنا هو عدم الوعظ كيف يجب أن تذهب حول عملك من حل المشاكل ولكن بدلا للجميع لمعرفة نقاط القوة وما يعمل، واستخدامها لصالحك قدر الإمكان. بعض العبارات مثل الأظافر على السبورة بالنسبة لي. استخدامه بمثابة لحظة التدريس هو واحد. (لماذا هو مثل الأظافر على السبورة لأنه إذا كنت في دور التوجيه، يجب أن تكون عادة في تدريس الوضع لحظة على أي حال، ولكن بمهارة). هيريس آخر لا أستطيع أن أشرح حقا في الكلمات، ولكن أنا أفهم ذلك. هذا قد يبدو قليلا البرد، ولكن إذا كان الشخص حقا لا يمكن تفسير شيء في الكلمات، ربما لا يفهمون. بالتأكيد، يمكن للشخص أن يكون شعور غامض كيف يعمل شيء يمكنني خداع طريقي من خلال وصف كيف تعمل الكاميرا الرقمية ولكن الحقيقة هي أنني لا أفهم حقا كل ذلك بشكل جيد. هناك مجال دراسة يعرف باسم نظرية المعرفة (دراسة المعرفة). واحدة من الأسس الأساسية لفهم ما إذا كان الكاميرا أو نمط التصميم - هو القدرة على إنشاء السياق، لتحديد سلسلة من الأحداث ذات الصلة، وخصائص أي مكونات على طول الطريق، وما إلى ذلك نعم، والفهم هو في بعض الأحيان العمل الشاق جدا ، ولكن الغوص في موضوع وكسرها بعيدا يستحق هذا الجهد. وحتى أولئك الذين يتجنبون الشهادة سيعترفون بأن عملية دراسة اختبارات الاعتماد ستساعد على سد الثغرات في المعرفة. مدير إدارة قاعدة البيانات هو أكثر عرضة لتوظيف مطور قاعدة البيانات الذين يمكن أن يتكلمون بشكل مؤقت (وبدون عناء) حول مستويات العزلة المعاملات والمشغلات، على عكس شخص من نوع يعرف عن ذلك ولكن يكافح لوصف استخدامها. ثيريس نتيجة طبيعية أخرى هنا. ويوصي تيد نيوارد بأن يقوم المطورون بالتحدث أمام الجمهور والتدوين وما إلى ذلك. أوافق 100. إن عملية التحدث أمام الجمهور والمدونات ستجبرك عمليا على البدء في التفكير في المواضيع وكسر التعاريف التي ربما تكون قد اتخذت بطريقة أخرى أمرا مفروغا منه. قبل بضع سنوات ظننت أنني فهمت بيان T-سكل ميرج بشكل جيد. ولكن فقط بعد الكتابة عن ذلك، والتحدث عن، وإدخال الأسئلة من الآخرين الذين لديهم وجهات النظر التي لم يحدث لي أن مستوى تفهم بلدي زيادة أضعافا مضاعفة. أعرف قصة مدير التوظيف الذي أجرى مقابلة مع أوثورديفيلوبر لموقف العقد. وكان مدير التوظيف في ازدراء من المنشورات بشكل عام، ونباح في مقدم الطلب، لذلك، إذا كنت ذاهب للعمل هنا، وكنت بدلا من ذلك كتابة الكتب أو كتابة التعليمات البرمجية نعم، إل منح أنه في أي صناعة سيكون هناك عدد قليل من الأكاديميين نقية. ولكن ما غاب عنه مدير التوظيف هو فرص تعزيز وشحذ مجموعات المهارات. أثناء تنظيف مربع قديم من الكتب، جئت عبر كنز من 1980s: المبرمجين في العمل. الذي يحتوي على مقابلات مع شاب صغير جدا بيل غيتس، راي أوزي، وغيرها من الأسماء المعروفة. كل مقابلة وكل البصيرة يستحق ثمن الكتاب. في رأيي، كانت المقابلة الأكثر إثارة للاهتمام مع بتلر لامبسون. الذي قدم بعض النصائح القوية. إلى الجحيم مع محو الأمية الكمبيوتر. لها مثير للسخرية على الاطلاق. دراسة الرياضيات. تعلم التفكير. اقرأ. كتابة. هذه الأشياء هي ذات قيمة أكثر ثباتا. تعلم كيفية إثبات النظريات: وهناك الكثير من الأدلة التي تراكمت على مر القرون التي تشير إلى أن هذه المهارة قابلة للتحويل إلى أشياء أخرى كثيرة. بتلر يتحدث الحقيقة. سوء إضافة إلى هذه النقطة تعلم كيفية لعب الشياطين دفاع ضد نفسك. وكلما يمكنك واقع - تحقق العمليات الخاصة بك والعمل، وأفضل حالا يكون عليك. جعل عالم الكمبيوتر الكبير ألين هولوب العلاقة بين تطوير البرمجيات والفنون الليبرالية على وجه التحديد، موضوع التاريخ. هنا كان له نقطة: ما هو التاريخ القراءة والكتابة. ما هو تطوير البرمجيات من بين أمور أخرى، القراءة والكتابة. اعتدت أن أعطي طلابي أسئلة مقالة T - سكل كما اختبارات الممارسة. طالب أحد الطلاب بأنني تصرفت أكثر مثل أستاذ القانون. حسنا، تماما مثل المدرب دوني هاسكينز قال في فيلم غلوري رود، طريقي صعب. إنني أؤمن إيمانا راسخا بأساس فكري قوي لأي مهنة. تماما مثل التطبيقات يمكن أن تستفيد من الأطر، يمكن للأفراد وعمليات التفكير الاستفادة من الأطر البشرية أيضا. هذا الأساس الأساسي للمنح الدراسية. هناك قصة أن يعود في 1970s، وسعت عب جهودهم التجنيد في الجامعات الكبرى من خلال التركيز على أفضل وألمع خريجي الفنون الحرة. حتى بعد ذلك أدركوا أن أفضل القراء والكتاب قد تصبح يوما ما المحللين بروغرامرزيستمز قوية. (لا تتردد في استخدام تلك القصة إلى أي نوع من الموارد البشرية الذين يصرون على أن المرشح يجب أن يكون على درجة علوم الكمبيوتر) والتحدث عن التاريخ: إذا لأي سبب آخر، من المهم أن نتذكر تاريخ إصدارات المنتجات إذا كان إم العمل في موقع العميل التي لا تزال تستخدم سكل سيرفر 2008 أو حتى (غاسب) سكل سيرفر 2005، يجب أن نتذكر ما تم تنفيذ الميزات في الإصدارات مع مرور الوقت. من أي وقت مضى الطبيب المفضل الذي كنت تحب لأن هيش أوضح الأشياء في سهل الإنجليزية، أعطاك الحقيقة على التوالي، وحصل على ثقتكم للعمل على لكم هذه هي مهارات جنون. وهي نتيجة الخبرة والعمل الجاد الذي يستغرق سنوات وحتى عقود لزراعة. لا توجد ضمانات لنجاح العمل التركيز على الحقائق، واتخاذ بعض المخاطر المحسوبة عندما كنت متأكدا من أنك يمكن أن نرى طريقك إلى خط النهاية، والسماح للرقائق تقع حيث قد، وأبدا يغيب عن بالنا من أن مثل ذلك الطبيب الذي حصل ثقتك. على الرغم من أن بعض الأيام أقصر، أحاول علاج موكلي وبياناتهم كطبيب علاج المرضى. على الرغم من أن الطبيب يجعل المزيد من المال هناك العديد من كليتشس أنا أكره ولكن هيرس واحد أنا لا أكره: ليس هناك شيء مثل سؤال سيء. كمدرب سابق، شيء واحد الذي لفت غضبتي كان سماع شخص ينتقد شخص آخر لطرح سؤال مفترض، غبي. ويشير السؤال إلى أن الشخص يقر بأن لديها بعض الثغرات في المعرفة التي يبحثون عن تعبئتها. نعم، بعض الأسئلة هي أفضل صياغة من غيرها، وبعض الأسئلة تتطلب تأطير إضافية قبل أن يمكن الإجابة عليها. ولكن الرحلة من تشكيل سؤال إلى إجابة من المرجح أن تولد عملية عقلية نشطة في الآخرين. هناك كل الأشياء الجيدة. العديد من المناقشات الجيدة والمثمرة تنشأ مع سؤال غبي. أنا أعمل في جميع المجالات في سسيس، ساس، سرس، مدس، بس، شاريبوانت، باور بي، داكس جميع الأدوات في ميكروسوفت بي كومة. ما زلت أكتب بعض التعليمات البرمجية من وقت لآخر. ولكن تخمين ما أنا لا تزال تنفق الكثير من الوقت في كتابة التعليمات البرمجية سكل T إلى البيانات الشخصية كجزء من عملية الاكتشاف. يجب أن يكون جميع مطوري التطبيقات جيدة تي-سكل القطع. يكتب تيد نيوارد (بشكل صحيح) عن الحاجة للتكيف مع التغيرات التكنولوجية. إل إضافة إلى ذلك الحاجة إلى التكيف مع التغييرات عامل العميل. وتغير الشركات قواعد العمل. تستحوذ الشركات على شركات أخرى (أو تصبح هدفا للاستحواذ). الشركات أخطاء في التواصل متطلبات الأعمال والمواصفات. نعم، يمكننا أحيانا أن تلعب دورا في المساعدة على إدارة تلك التغييرات وأحيانا كانت الطاير، وليس الزجاج الأمامي. هذه في بعض الأحيان تسبب ألم كبير للجميع، وخاصة I. T. اشخاص. هذا هو السبب في وجود حقيقة الحياة يجب علينا التعامل معها. تماما مثل أي مطور يكتب التعليمات البرمجية خالية من الأخطاء في كل مرة، لا I. T. شخص يتعامل بشكل جيد مع تغيير كل مرة واحدة. واحدة من أكبر الصراعات إيف كان في بلدي 28 عاما في هذه الصناعة يظهر الصبر وضبط النفس عندما التغييرات تطير من العديد من الاتجاهات المختلفة. هنا هو المكان الذي يمكن أن تقدم فيه اقتراحاتي السابقة حول البحث عن الهواء المروع. إذا كنت تستطيع إدارة لاستيعاب التغييرات في عملية التفكير الخاص بك، ودون الشعور طغت، والاحتمالات هي أن تكون رصيدا كبيرا. في الأشهر ال 15 الماضية كان إيف للتعامل مع كمية كبيرة من التغيير المهني. كان من الصعب جدا في بعض الأحيان، ولكن إيف حلت أن التغيير سيكون هو القاعدة و إيف حاولت قرص بلدي عادات أفضل ما أستطيع للتعامل مع التغيير المتكرر (وغير مؤكد). من الصعب، من الصعب جدا. ولكن كما قال مدرب جيمي دوغان في فيلم "رابطة من تلقاء نفسها: بالطبع من الصعب. إذا لم يكن من الصعب، الجميع سوف تفعل ذلك. من الصعب، هو ما يجعلها كبيرة. رسالة قوية. وكان الحديث في هذه الصناعة على مدى السنوات القليلة الماضية حول السلوك في المؤتمرات المهنية (والسلوك في هذه الصناعة ككل). وقد كتب العديد من الكتاب المحترمين افتتاحية جيدة جدا حول هذا الموضوع. هيريس مدخلي، لما قيمتها. Its a message to those individuals who have chosen to behave badly: Dude, it shouldnt be that hard to behave like an adult. A few years ago, CoDe Magazine Chief Editor Rod Paddock made some great points in an editorial about Codes of Conduct at conferences. Its definitely unfortunate to have to remind people of what they should expect out of themselves. But the problems go deeper. A few years ago I sat on a five-person panel (3 women, 2 men) at a community event on Women in Technology. The other male stated that men succeed in this industry because the Y chromosome gives men an advantage in areas of performance. The individual who made these remarks is a highly respected technology expert, and not some bozo making dongle remarks at a conference or sponsoring a programming contest where first prize is a date with a bikini model. Our world is becoming increasingly polarized (just watch the news for five minutes), sadly with emotion often winning over reason. Even in our industry, recently I heard someone in a position of responsibility bash software tool XYZ based on a ridiculous premise and then give false praise to a competing tool. So many opinions, so many arguments, but heres the key: before taking a stand, do your homework and get the facts . Sometimes both sides are partly rightor wrong. Theres only one way to determine: get the facts. As Robert Heinlein wrote, Facts are your single clue get the facts Of course, once you get the facts, the next step is to express them in a meaningful and even compelling way. Theres nothing wrong with using some emotion in an intellectual debate but it IS wrong to replace an intellectual debate with emotion and false agenda. A while back I faced resistance to SQL Server Analysis Services from someone who claimed the tool couldnt do feature XYZ. The specifics of XYZ dont matter here. I spent about two hours that evening working up a demo to cogently demonstrate the original claim was false. In that example, it worked. I cant swear it will always work, but to me thats the only way. Im old enough to remember life at a teen in the 1970s. Back then, when a person lost hisher job, (often) it was because the person just wasnt cutting the mustard. Fast-forward to today: a sad fact of life is that even talented people are now losing their jobs because of the changing economic conditions. Theres never a full-proof method for immunity, but now more than ever its critical to provide a high level of what I call the Three Vs (value, versatility, and velocity) for your employerclients. I might not always like working weekends or very late at night to do the proverbial work of two people but then I remember there are folks out there who would give anything to be working at 1 AM at night to feed their families and pay their bills. Always be yourselfyour BEST self. Some people need inspiration from time to time. Heres mine: the great sports movie, Glory Road. If youve never watched it, and even if youre not a sports fan I can almost guarantee youll be moved like never before. And Ill close with this. If you need some major motivation, Ill refer to a story from 2006. Jason McElwain, a high school student with autism, came off the bench to score twenty points in a high school basketball game in Rochester New York. Heres a great YouTube video. His mother said it all . This is the first moment Jason has ever succeeded and is proud of himself. I look at autism as the Berlin Wall. He cracked it. To anyone who wanted to attend my session at todays SQL Saturday event in DC I apologize that the session had to be cancelled. I hate to make excuses, but a combination of getting back late from Detroit (client trip), a car thats dead (blown head gasket), and some sudden health issues with my wife have made it impossible for me to attend. Back in August, I did the same session (ColumnStore Index) for PASS as a webinar. You can go to this link to access the video (itll be streamed, as all PASS videos are streamed) The link does require that you fill out your name and email address, but thats it. And then you can watch the video. Feel free to contact me if you have questions, at kgoffkevinsgoff November 15, 2013 Getting started with Windows Azure and creating SQL Databases in the cloud can be a bit daunting, especially if youve never tried out any of Microsofts cloud offerings. Fortunately, Ive created a webcast to help people get started. This is an absolute beginners guide to creating SQL Databases under Windows Azure. It assumes zero prior knowledge of Azure. You can go to the BDBI Webcasts of this website and check out my webcast (dated 11102013). Or you can just download the webcast videos right here: here is part 1 and here is part 2. You can also download the slide deck here. November 03, 2013 Topic this week: SQL Server Snapshot Isolation Levels, added in SQL Server 2005. To this day, there are still many SQL developers, many good SQL developers who either arent aware of this feature, or havent had time to look at it. Hopefully this information will help. Companion webcast will be uploaded in the next day look for it in the BDBI Webcasts section of this blog. October 26, 2013 Im going to start a weekly post of T-SQL tips, covering many different versions of SQL Server over the years Heres a challenge many developers face. Ill whittle it down to a very simple example, but one where the pattern applies to many situations. Suppose you have a stored procedure that receives a single vendor ID and updates the freight for all orders with that vendor id. create procedure dbo. UpdateVendorOrders update Purchasing. PurchaseOrderHeader set Freight Freight 1 where VendorID VendorID Now, suppose we need to run this for a set of vendor IDs. Today we might run it for three vendors, tomorrow for five vendors, the next day for 100 vendors. We want to pass in the vendor IDs. If youve worked with SQL Server, you can probably guess where Im going with this. The big question is how do we pass a variable number of Vendor IDs Or, stated more generally, how do we pass an array, or a table of keys, to a procedure Something along the lines of exec dbo. UpdateVendorOrders SomeListOfVendors Over the years, developers have come up with different methods: Going all the way back to SQL Server 2000, developers might create a comma-separated list of vendor keys, and pass the CSV list as a varchar to the procedure. The procedure would shred the CSV varchar variable into a table variable and then join the PurchaseOrderHeader table to that table variable (to update the Freight for just those vendors in the table). I wrote about this in CoDe Magazine back in early 2005 (code-magazinearticleprint. aspxquickid0503071ampprintmodetrue. Tip 3) In SQL Server 2005, you could actually create an XML string of the vendor IDs, pass the XML string to the procedure, and then use XQUERY to shred the XML as a table variable. I also wrote about this in CoDe Magazine back in 2007 (code-magazinearticleprint. aspxquickid0703041ampprintmodetrue. Tip 12)Also, some developers will populate a temp table ahead of time, and then reference the temp table inside the procedure. All of these certainly work, and developers have had to use these techniques before because for years there was NO WAY to directly pass a table to a SQL Server stored procedure. Until SQL Server 2008 when Microsoft implemented the table type. This FINALLY allowed developers to pass an actual table of rows to a stored procedure. Now, it does require a few steps. We cant just pass any old table to a procedure. It has to be a pre-defined type (a template). So lets suppose we always want to pass a set of integer keys to different procedures. One day it might be a list of vendor keys. Next day it might be a list of customer keys. So we can create a generic table type of keys, one that can be instantiated for customer keys, vendor keys, etc. CREATE TYPE IntKeysTT AS TABLE ( IntKey int NOT NULL ) So Ive created a Table Typecalled IntKeysTT . Its defined to have one column an IntKey. Nowsuppose I want to load it with Vendors who have a Credit Rating of 1..and then take that list of Vendor keys and pass it to a procedure: DECLARE VendorList IntKeysTT INSERT INTO VendorList SELECT BusinessEntityID from Purchasing. Vendor WHERE CreditRating 1 So, I now have a table type variable not just any table variable, but a table type variable (that I populated the same way I would populate a normal table variable). Its in server memory (unless it needs to spill to tempDB) and is therefore private to the connectionprocess. OK, can I pass it to the stored procedure now Well, not yet we need to modify the procedure to receive a table type. Heres the code: create procedure dbo. UpdateVendorOrdersFromTT IntKeysTT IntKeysTT READONLY update Purchasing. PurchaseOrderHeader set Freight Freight 1 FROM Purchasing. PurchaseOrderHeader JOIN IntKeysTT TempVendorList ON PurchaseOrderHeader. VendorID Te mpVendorList. IntKey Notice how the procedure receives the IntKeysTT table type as a Table Type (again, not just a regular table, but a table type). It also receives it as a READONLY parameter. You CANNOT modify the contents of this table type inside the procedure. Usually you wont want to you simply want to read from it. Well, now you can reference the table type as a parameter and then utilize it in the JOIN statement, as you would any other table variable. لذلك هناك لديك. A bit of work to set up the table type, but in my view, definitely worth it. Additionally, if you pass values from , youre in luck. You can pass an ADO data table (with the same tablename property as the name of the Table Type) to the procedure. For developers who have had to pass CSV lists, XML strings, etc. to a procedure in the past, this is a huge benefit. Finally I want to talk about another approach people have used over the years. SQL Server Cursors. At the risk of sounding dogmatic, I strongly advise against Cursors, unless there is just no other way. Cursors are expensive operations in the server, For instance, someone might use a cursor approach and implement the solution this way: DECLARE VendorID int DECLARE dbcursor CURSOR FASTFORWARD FOR SELECT BusinessEntityID from Purchasing. Vendor where CreditRating 1 FETCH NEXT FROM dbcursor INTO VendorID WHILE FETCHSTATUS 0 EXEC dbo. UpdateVendorOrders VendorID FETCH NEXT FROM dbcursor INTO VendorID The best thing Ill say about this is that it works. And yes, getting something to work is a milestone. But getting something to work and getting something to work acceptably are two different things. Even if this process only takes 5-10 seconds to run, in those 5-10 seconds the cursor utilizes SQL Server resources quite heavily. Thats not a good idea in a large production environment. Additionally, the more the of rows in the cursor to fetch and the more the number of executions of the procedure, the slower it will be. When I ran both processes (the cursor approach and then the table type approach) against a small sampling of vendors (5 vendors), the processing times where 260 ms and 60 ms, respectively. So the table type approach was roughly 4 times faster. But then when I ran the 2 scenarios against a much larger of vendors (84 vendors), the different was staggering 6701 ms versus 207 ms, respectively. So the table type approach was roughly 32 times faster. Again, the CURSOR approach is definitely the least attractive approach. Even in SQL Server 2005, it would have been better to create a CSV list or an XML string (providing the number of keys could be stored in a scalar variable). But now that there is a Table Type feature in SQL Server 2008, you can achieve the objective with a feature thats more closely modeled to the way developers are thinking specifically, how do we pass a table to a procedure Now we have an answer Hope you find this feature help. Feel free to post a comment. Oracle Enterprise edition (EE), standard edition (SE), and express (XE) Features list Oracle XE Features Burleson is the American Team Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals. Feel free to ask questions on our Oracle forum . Verify experience Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications . Errata Oracle technology is changing and we strive to update our BC Oracle support information. If you find an error or have a suggestion for improving our content, we would appreciate your feedback. Just e-mail: and include the URL for the page. The Oracle of Database Support Copyright copy 1996 - 2016 All rights reserved by Burleson Oracle reg is the registered trademark of Oracle Corporation. Data Warehousing - Quick Guide Data Warehousing - Overview The term Data Warehouse was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions. A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining. Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. Thats why data warehouse has now become an important platform for data analysis and online analytical processing. Understanding a Data Warehouse A data warehouse is a database, which is kept separate from the organizations operational database. There is no frequent updating done in a data warehouse. It possesses consolidated historical data, which helps the organization to analyze its business. A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. Data warehouse systems help in the integration of diversity of application systems. A data warehouse system helps in consolidated historical data analysis. Why a Data Warehouse is Separated from Operational Databases A data warehouses is kept separate from operational databases due to the following reasons: An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data. Data Warehouse Features The key features of a data warehouse are discussed below: Subject Oriented - A data warehouse is subject oriented because it provides information around a subject rather than the organizations ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making. Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse. ملحوظة . A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database. Data Warehouse Applications As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess closed-loop feedback system for the enterprise management. Data warehouses are widely used in the following fields: Financial services Banking services Consumer goods Retail sectors Controlled manufacturing Types of Data Warehouse Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below: Information Processing - A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. Analytical Processing - A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting. Data Mining - Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools. Data Warehouse (OLAP) Data Warehousing - Concepts What is Data Warehousing Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured andor ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations. Using Data Warehouse Information There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains: Tuning Production Strategies - The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly. Customer Analysis - Customer analysis is done by analyzing the customers buying preferences, buying time, budget cycles, etc. Operations Analysis - Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations. Integrating Heterogeneous Databases To integrate heterogeneous databases, we have two approaches: Query-driven Approach Update-driven Approach Query-Driven Approach This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators. Process of Query-Driven Approach When a query is issued to a client side, a metadata dictionary translates the query into an appropriate from for individual heterogeneous sites involved. Now these queries are mapped and sent to the local query processor. The results from heterogeneous sites are integrated into a global answer set. Disadvantages Query-driven approach needs complex integration and filtering processes. This approach is very inefficient. It is very expensive for frequent queries. This approach is also very expensive for queries that require aggregations. Update-Driven Approach This is an alternative to the traditional approach. Todays data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis. Advantages This approach has the following advantages: This approach provide high performance. The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance. Query processing does not require an interface to process data at local sources. Functions of Data Warehouse Tools and Utilities The following are the functions of data warehouse tools and utilities: Data Extraction - Involves gathering data from multiple heterogeneous sources. Data Cleaning - Involves finding and correcting the errors in data. Data Transformation - Involves converting the data from legacy format to warehouse format. Data Loading - Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions. Refreshing - Involves updating from data sources to warehouse. ملحوظة . Data cleaning and data transformation are important steps in improving the quality of data and data mining results. Data Warehousing - Terminologies In this chapter, we will discuss some of the most commonly used terms in data warehousing. Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse, we can define metadata as following: Metadata is a road-map to data warehouse. Metadata in data warehouse defines the warehouse objects. Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse. Metadata Repository Metadata repository is an integral part of a data warehouse system. It contains the following metadata: Business metadata - It contains the data ownership information, business definition, and changing policies. Operational metadata - It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means history of data migrated and transformation applied on it. Data for mapping from operational environment to data warehouse - It metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules. The algorithms for summarization - It includes dimension algorithms, data on granularity, aggregation, summarizing, etc. A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records. Illustration of Data Cube Suppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, item dimension table may have attributes such as itemname, itemtype, and itembrand. The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions. But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below: The above 3-D table can be represented as 3-D data cube as shown in the following figure: Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Points to Remember About Data Marts Windows-based or UnixLinux-based servers are used to implement data marts. They are implemented on low-cost servers. The implementation cycle of a data mart is measured in short periods of time, i. e. in weeks rather than months or years. The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide. Data marts are small in size. Data marts are customized by department. The source of a data mart is departmentally structured data warehouse. Data marts are flexible. The following figure shows a graphical representation of data marts. Virtual Warehouse The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. Data Warehousing - Delivery Process A data warehouse is never static it evolves as the business expands. As the business evolves, its requirements keep changing and therefore a data warehouse must be designed to ride with these changes. Hence a data warehouse system needs to be flexible. Ideally there should be a delivery process to deliver a data warehouse. However data warehouse projects normally suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements are not understood completely. The architectures, designs, and build components can be completed only after gathering and studying all the requirements. Delivery Method The delivery method is a variant of the joint application development approach adopted for the delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally through the development process. ملحوظة . The delivery process is broken into phases to reduce the project and delivery risk. The following diagram explains the stages in the delivery process: IT Strategy Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required to procure and retain funding for the project. Business Case The objective of business case is to estimate business benefits that should be derived from using a data warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does not have a clear business case, then the business tends to suffer from credibility problems at some stage during the delivery process. Therefore in data warehouse projects, we need to understand the business case for investment. Education and Prototyping Organizations experiment with the concept of data analysis and educate themselves on the value of having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as: The prototype addresses a defined technical objective. The prototype can be thrown away after the feasibility concept has been shown. The activity addresses a small subset of eventual data content of the data warehouse. The activity timescale is non-critical. The following points are to be kept in mind to produce an early release and deliver business benefits. Identify the architecture that is capable of evolving. Focus on business requirements and technical blueprint phases. Limit the scope of the first build phase to the minimum that delivers business benefits. Understand the short-term and medium-term requirements of the data warehouse. Business Requirements To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution. The following aspects are determined in this stage: Things to determine in this stage are following. The business rule to be applied on data. The logical model for information within the data warehouse. The query profiles for the immediate requirement. The source systems that provide this data. Technical Blueprint This phase need to deliver an overall architecture satisfying the long term requirements. This phase also deliver the components that must be implemented in a short term to derive any business benefit. The blueprint need to identify the followings. The overall system architecture. The data retention policy. The backup and recovery strategy. The server and data mart architecture. The capacity plan for hardware and infrastructure. The components of database design. Building the version In this stage, the first production deliverable is produced. This production deliverable is the smallest component of a data warehouse. This smallest component adds business benefit. History Load This is the phase where the remainder of the required history is loaded into the data warehouse. In this phase, we do not add new entities, but additional physical tables would probably be created to store increased data volumes. Let us take an example. Suppose the build version phase has delivered a retail sales analysis data warehouse with 2 months worth of history. This information will allow the user to analyze only the recent trends and address the short-term issues. The user in this case cannot identify annual and seasonal trends. To help him do so, last 2 years sales history could be loaded from the archive. Now the 40GB data is extended to 400GB. ملحوظة . The backup and recovery procedures may become complex, therefore it is recommended to perform this activity within a separate phase. Ad hoc Query In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools can generate the database query. ملحوظة . It is recommended not to use these access tools when the database is being substantially modified. Automation In this phase, operational management processes are fully automated. These would include: Transforming the data into a form suitable for analysis. Monitoring query profiles and determining appropriate aggregations to maintain system performance. Extracting and loading data from different source systems. Generating aggregations from predefined definitions within the data warehouse. Backing up, restoring, and archiving the data. Extending Scope In this phase, the data warehouse is extended to address a new set of business requirements. The scope can be extended in two ways: By loading additional data into the data warehouse. By introducing new data marts using the existing information. ملحوظة . This phase should be performed separately, since it involves substantial efforts and complexity. Requirements Evolution From the perspective of delivery process, the requirements are always changeable. They are not static. The delivery process must support this and allow these changes to be reflected within the system. This issue is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries. The architecture is designed to change and grow to match the business needs, the process operates as a pseudo-application development process, where the new requirements are continually fed into the development activities and the partial deliverables are produced. These partial deliverables are fed back to the users and then reworked ensuring that the overall system is continually updated to meet the business needs. Data Warehousing - System Processes We have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data . keep table small . etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses. In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases. Process Flow in Data Warehouse There are four major processes that contribute to a data warehouse: Extract and load the data. Cleaning and transforming the data. Backup and archive the data. Managing queries and directing them to the appropriate data sources. Extract and Load Process Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse. ملحوظة . Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed. Controlling the Process Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time. When to Initiate Extract Data needs to be in a consistent state when it is extracted, i. e. the data warehouse should represent a single, consistent version of the information to the user. For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions. Loading the Data After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent. ملحوظة . Consistency checks are executed only when all the data sources have been loaded into the temporary data store. Clean and Transform Process Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming: Clean and transform the loaded data into a structure Partition the data Aggregation Clean and Transform the Loaded Data into a Structure Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent: within itself. with other data within the same data source. with the data in other source systems. with the existing data present in the warehouse. Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs. Partition the Data It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions. Aggregation Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data. Backup and Archive the Data In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required. For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive. Query Management Process This process performs the following functions: manages the queries. helps speed up the execution time of queris. directs the queries to their most effective data sources. ensures that all the system sources are used in the most effective way. monitors actual query profiles. The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse. Data Warehousing - Architecture In this chapter, we will discuss the business analysis framework for the data warehouse design and architecture of a data warehouse. Business Analysis Framework The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages: Since a data warehouse can gather information quickly and efficiently, it can enhance business productivity. A data warehouse provides us a consistent view of customers and items, hence, it helps us manage customer relationship. A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long period in a consistent and reliable manner. To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework . Each person has different views regarding the design of a data warehouse. These views are as follows: The top-down view - This view allows the selection of relevant information needed for a data warehouse. The data source view - This view presents the information being captured, stored, and managed by the operational system. The data warehouse view - This view includes the fact tables and dimension tables. It represents the information stored inside the data warehouse. The business query view - It is the view of the data from the viewpoint of the end-user. Three-Tier Data Warehouse Architecture Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data warehouse architecture. Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. Middle Tier - In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations. Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools. The following diagram depicts the three-tier architecture of data warehouse: Data Warehouse Models From the perspective of data warehouse architecture, we have the following data warehouse models: Virtual Warehouse Data mart Enterprise Warehouse Virtual Warehouse The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects. Points to remember about data marts: Window-based or UnixLinux-based servers are used to implement data marts. They are implemented on low-cost servers. The implementation data mart cycles is measured in short periods of time, i. e. in weeks rather than months or years. The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide. Data marts are small in size. Data marts are customized by department. The source of a data mart is departmentally structured data warehouse. Data mart are flexible. Enterprise Warehouse An enterprise warehouse collects all the information and the subjects spanning an entire organization It provides us enterprise-wide data integration. The data is integrated from operational systems and external information providers. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond. Load Manager This component performs the operations required to extract and load process. The size and complexity of the load manager varies between specific solutions from one data warehouse to other. Load Manager Architecture The load manager performs the following functions: Extract the data from source system. Fast Load the extracted data into temporary data store. Perform simple transformations into structure similar to the one in the data warehouse. Extract Data from Source The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway. In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time. The transformations affects the speed of data processing. It is more effective to load the data into relational database prior to applying transformations and checks. Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved. Simple Transformations While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks: Strip out all the columns that are not required within the warehouse. Convert all the values to required data types. Warehouse Manager A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts. The size and complexity of warehouse managers varies between specific solutions. Warehouse Manager Architecture A warehouse manager includes the following: Operations Performed by Warehouse Manager A warehouse manager analyzes the data to perform consistency and referential integrity checks. Creates indexes, business views, partition views against the base data. Generates new aggregations and updates existing aggregations. Generates normalizations. Transforms and merges the source data into the published data warehouse. Backup the data in the data warehouse. Archives the data that has reached the end of its captured life. ملحوظة . A warehouse Manager also analyzes query profiles to determine index and aggregations are appropriate. Query Manager Query manager is responsible for directing the queries to the suitable tables. By directing the queries to appropriate tables, the speed of querying and response generation can be increased. Query manager is responsible for scheduling the execution of the queries posed by the user. Query Manager Architecture The following screenshot shows the architecture of a query manager. It includes the following: Query redirection via C tool or RDBMS Stored procedures Query management tool Query scheduling via C tool or RDBMS Query scheduling via third-party software Detailed Information Detailed information is not kept online, rather it is aggregated to the next level of detail and then archived to tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated data. The following diagram shows a pictorial impression of where detailed information is stored and how it is used. Note: If detailed information is held offline to minimize disk storage, we should make sure that the data has been extracted, cleaned up, and transformed into starflake schema before it is archived. Summary Information Summary Information is a part of data warehouse that stores predefined aggregations. These aggregations are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-the-go in order to respond to the changing query profiles. Points to remember about summary information. Summary information speeds up the performance of common queries. It increases the operational cost. It needs to be updated whenever new data is loaded into the data warehouse. It may not have been backed up, since it can be generated fresh from the detailed information. Data Warehousing - OLAP Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP. Types of OLAP Servers We have four types of OLAP servers: Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers Relational OLAP ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS. ROLAP includes the following: Implementation of aggregation navigation logic. Optimization for each DBMS back end. Additional tools and services. Multidimensional OLAP MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets. Hybrid OLAP (HOLAP) Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store. Specialized SQL Servers Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. OLAP Operations Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data. Here is the list of OLAP operations: Roll-up performs aggregation on a data cube in any of the following ways: By climbing up a concept hierarchy for a dimension By dimension reduction The following diagram illustrates how roll-up works. Roll-up is performed by climbing up a concept hierarchy for the dimension location. Initially the concept hierarchy was street lt city lt province lt country. On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. The data is grouped into cities rather than countries. When roll-up is performed, one or more dimensions from the data cube are removed. Drill-down Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: By stepping down a concept hierarchy for a dimension By introducing a new dimension. The following diagram illustrates how drill-down works: Drill-down is performed by stepping down a concept hierarchy for the dimension time. Initially the concept hierarchy was day lt month lt quarter lt year. On drilling down, the time dimension is descended from the level of quarter to the level of month. When drill-down is performed, one or more dimensions from the data cube are added. It navigates the data from less detailed data to highly detailed data. The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works. Here Slice is performed for the dimension time using the criterion time Q1. It will form a new sub-cube by selecting one or more dimensions. Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. The dice operation on the cube based on the following selection criteria involves three dimensions. (location Toronto or Vancouver) (time Q1 or Q2) (item Mobile or Modem) The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation. In this the item and location axes in 2-D slice are rotated. OLAP vs OLTP Data Warehousing - Relational OLAP Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage the warehouse data, the relational OLAP uses relational or extended-relational DBMS. ROLAP includes the following: Implementation of aggregation navigation logic Optimization for each DBMS back-end Additional tools and services Points to Remember ROLAP servers are highly scalable. ROLAP tools analyze large volumes of data across multiple dimensions. ROLAP tools store and analyze highly volatile and changeable data. Relational OLAP Architecture ROLAP includes the following components: Advantages ROLAP servers can be easily used with existing RDBMS. Data can be stored efficiently, since no zero facts can be stored. ROLAP tools do not use pre-calculated data cubes. DSS server of micro-strategy adopts the ROLAP approach. Disadvantages Poor query performance. Some limitations of scalability depending on the technology architecture that is utilized. Data Warehousing - Multidimensional OLAP Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP servers use two levels of data storage representation to handle dense and sparse data-sets. Points to Remember: MOLAP tools process information with consistent response time regardless of level of summarizing or calculations selected. MOLAP tools need to avoid many of the complexities of creating a relational database to store data for analysis. MOLAP tools need fastest possible performance. MOLAP server adopts two level of storage representation to handle dense and sparse data sets. Denser sub-cubes are identified and stored as array structure. Sparse sub-cubes employ compression technology. MOLAP Architecture MOLAP includes the following components: Advantages MOLAP allows fastest indexing to the pre-computed summarized data. Helps the users connected to a network who need to analyze larger, less-defined data. Easier to use, therefore MOLAP is suitable for inexperienced users. Disadvantages MOLAP are not capable of containing detailed data. The storage utilization may be low if the data set is sparse. MOLAP vs ROLAP DBMS facility is strong. Data Warehousing - Schemas Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse. Star Schema Each dimension in a star schema is represented with only one-dimension table. This dimension table contains the set of attributes. The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location. There is a fact table at the center. It contains the keys to each of four dimensions. The fact table also contains the attributes, namely dollars sold and units sold. ملحوظة . Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set . This constraint may cause data redundancy. For example, Vancouver and Victoria both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes provinceorstate and country. Snowflake Schema Some dimension tables in the Snowflake schema are normalized. The normalization splits up the data into additional tables. Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table. Now the item dimension table contains the attributes itemkey, itemname, type, brand, and supplier-key. The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplierkey and suppliertype. ملحوظة . Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space. Fact Constellation Schema A fact constellation has multiple fact tables. It is also known as galaxy schema. The following diagram shows two fact tables, namely sales and shipping. The sales fact table is same as that in the star schema. The shipping fact table has the five dimensions, namely itemkey, timekey, shipperkey, fromlocation, tolocation. The shipping fact table also contains two measures, namely dollars sold and units sold. It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table. Schema Definition Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts. Syntax for Cube Definition Syntax for Dimension Definition Star Schema Definition The star schema that we have discussed can be defined using Data Mining Query Language (DMQL) as follows: Snowflake Schema Definition Snowflake schema can be defined using DMQL as follows: Fact Constellation Schema Definition Fact constellation schema can be defined using DMQL as follows: Data Warehousing - Partitioning Strategy Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different partitioning strategies. Why is it Necessary to Partition Partitioning is important for the following reasons: For easy management, To assist backuprecovery, To enhance performance. For Easy Management The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning. To Assist BackupRecovery If we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance of the system. ملحوظة . To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can then put these partitions into a state where they cannot be modified. Then they can be backed up. It means only the current partition is to be backed up. To Enhance Performance By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance is enhanced because now the query scans only those partitions that are relevant. It does not have to scan the whole data. Horizontal Partitioning There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse. Partitioning by Time into Equal Segments In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a significant retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them. Partition by Time into Different-sized Segments This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data. The detailed information remains available online. The number of physical tables is kept relatively small, which reduces the operating cost. This technique is suitable where a mix of data dipping recent history and data mining through entire history is required. This technique is not useful where the partitioning profile changes on a regular basis, because repartitioning will increase the operation cost of data warehouse. Partition on a Different Dimension The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension. Lets have an example. Suppose a market function has been structured into distinct regional departments like on a state by state basis. If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant. The query does not have to scan irrelevant data which speeds up the query process. This technique is not appropriate where the dimensions are unlikely to change in future. So, it is worth determining that the dimension does not change in future. If the dimension changes, then the entire fact table would have to be repartitioned. ملحوظة . We recommend to perform the partition only on the basis of time dimension, unless you are certain that the suggested dimension grouping will not change within the life of the data warehouse. Partition by Size of Table When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created. This partitioning is complex to manage. It requires metadata to identify what data is stored in each partition. Partitioning Dimensions If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension. Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time. Round Robin Partitions In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition. This technique makes it easy to automate table management facilities within the data warehouse. Vertical Partition Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done. Vertical partitioning can be performed in the following two ways: Normalization Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show how normalization is performed. Table before Normalization Row Splitting Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size. ملحوظة . While using vertical partitioning, make sure that there is no requirement to perform a major join operation between two partitions. Identify Key to Partition It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table. Lets have an example. Suppose we want to partition the following table. We can choose to partition on any key. The two possible keys could be Suppose the business is organized in 30 geographical regions and each region has different number of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that a vast majority of queries are restricted to the users own business region. If we partition by transactiondate instead of region, then the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partitions. Hence it is worth determining the right partitioning key. Data Warehousing - Metadata Concepts What is Metadata Metadata is simply defined as data about data. The data that is used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data warehouse, we can define metadata as follows. Metadata is the road-map to a data warehouse. Metadata in a data warehouse defines the warehouse objects. Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse. ملحوظة . In a data warehouse, we create metadata for the data names and definitions of a given data warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted data, the source of extracted data. Categories of Metadata Metadata can be broadly categorized into three categories: Business Metadata - It has the data ownership information, business definition, and changing policies. Technical Metadata - It includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. Operational Metadata - It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it. Role of Metadata Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is different from the warehouse data, yet it plays an important role. The various roles of metadata are explained below. Metadata acts as a directory. This directory helps the decision support system to locate the contents of the data warehouse. Metadata helps in decision support system for mapping of data when data is transformed from operational environment to data warehouse environment. Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data. Metadata is used for query tools. Metadata is used in extraction and cleansing tools. Metadata is used in reporting tools. Metadata is used in transformation tools. Metadata plays an important role in loading functions. The following diagram shows the roles of metadata. Metadata Respiratory Metadata respiratory is an integral part of a data warehouse system. It has the following metadata: Definition of data warehouse - It includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents. Business metadata - It contains has the data ownership information, business definition, and changing policies. Operational Metadata - It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it. Data for mapping from operational environment to data warehouse - It includes the source databases and their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules. Algorithms for summarization - It includes dimension algorithms, data on granularity, aggregation, summarizing, etc. Challenges for Metadata Management The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the definition of business terms to business end-users. With all these uses of metadata, it also has its challenges. Some of the challenges are discussed below. Metadata in a big organization is scattered across the organization. This metadata is spread in spreadsheets, databases, and applications. Metadata could be present in text files or multimedia files. To use this data for information management solutions, it has to be correctly defined. There are no industry-wide accepted standards. Data management solution vendors have narrow focus. There are no easy and accepted methods of passing metadata. Data Warehousing - Data Marting Why Do We Need a Data Mart Listed below are the reasons to create a data mart: To partition data in order to impose access control strategies. To speed up the queries by reducing the volume of data to be scanned. To segment data into different hardware platforms. To structure data in a form suitable for a user access tool. ملحوظة . Do not data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution. Cost-effective Data Marting Follow the steps given below to make data marting cost-effective: Identify the Functional Splits Identify User Access Tool Requirements Identify Access Control Issues Identify the Functional Splits In this step, we determine if the organization has natural functional splits. We look for departmental splits, and we determine whether the way in which departments use information tend to be in isolation from the rest of the organization. Lets have an example. Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of products. For this, the following are the valuable information: sales transaction on a daily basis sales forecast on a weekly basis stock position on a daily basis stock movements on a daily basis As the merchant is not interested in the products they are not dealing with, the data marting is a subset of the data dealing which the product group of interest. The following diagram shows data marting for different users. Given below are the issues to be taken into account while determining the functional split: The structure of the department may change. The products might switch from one department to other. The merchant could query the sales trend of other products to analyze what is happening to the sales. ملحوظة . We need to determine the business benefits and technical feasibility of using a data mart. Identify User Access Tool Requirements We need data marts to support user access tools that require internal data structures. The data in such structures are outside the control of data warehouse but need to be populated and updated on a regular basis. There are some tools that populate directly from the source system but some cannot. Therefore additional requirements outside the scope of the tool are needed to be identified for future. ملحوظة . In order to ensure consistency of data across all access tools, the data should not be directly populated from the data warehouse, rather each tool must have its own data mart. Identify Access Control Issues There should to be privacy rules to ensure the data is accessed by authorized users only. For example a data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the specific bank. Data marts allow us to build a complete wall by physically separating data segments within the data warehouse. To avoid possible privacy problems, the detailed data can be removed from the data warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed account data. Designing Data Marts Data marts should be designed as a smaller version of starflake schema within the data warehouse and should match with the database design of the data warehouse. It helps in maintaining control over database instances. The summaries are data marted in the same way as they would have been designed within the data warehouse. Summary tables help to utilize all dimension data in the starflake schema. Cost of Data Marting The cost measures for data marting are as follows: Hardware and Software Cost Network Access Time Window Constraints Hardware and Software Cost Although data marts are created on the same hardware, they require some additional hardware and software. To handle user queries, it requires additional processing power and disk storage. If detailed data and the data mart exist within the data warehouse, then we would face additional cost to store and manage replicated data. ملحوظة . Data marting is more expensive than aggregations, therefore it should be used as an additional strategy and not as an alternative strategy. Network Access A data mart could be on a different location from the data warehouse, so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process. Time Window Constraints The extent to which a data mart loading process will eat into the available time window depends on the complexity of the transformations and the data volumes being shipped. The determination of how many data marts are possible depends on: Network capacity. Time window available Volume of data being transferred Mechanisms being used to insert data into a data mart Data Warehousing - System Managers System management is mandatory for the successful implementation of a data warehouse. The most important system managers are: System configuration manager System scheduling manager System event manager System database manager System backup recovery manager System Configuration Manager The system configuration manager is responsible for the management of the setup and configuration of data warehouse. The structure of configuration manager varies from one operating system to another. In Unix structure of configuration, the manager varies from vendor to vendor. Configuration managers have single user interface. The interface of configuration manager allows us to control all aspects of the system. ملحوظة . The most important configuration tool is the IO manager. System Scheduling Manager System Scheduling Manager is responsible for the successful implementation of the data warehouse. Its purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some form of batch control mechanism. The list of features a system scheduling manager must have is as follows: Work across cluster or MPP boundaries Deal with international time differences Handle job failure Handle multiple queries Support job priorities Restart or re-queue the failed jobs Notify the user or a process when job is completed Maintain the job schedules across system outages Re-queue jobs to other queues Support the stopping and starting of queues Log Queued jobs Deal with inter-queue processing Note . The above list can be used as evaluation parameters for the evaluation of a good scheduler. Some important jobs that a scheduler must be able to handle are as follows: Daily and ad hoc query scheduling Execution of regular report requirements Data load Data processing Index creation Backup Aggregation creation Data transformation Note . If the data warehouse is running on a cluster or MPP architecture, then the system scheduling manager must be capable of running across the architecture. System Event Manager The event manager is a kind of a software. The event manager manages the events that are defined on the data warehouse system. We cannot manage the data warehouse manually because the structure of data warehouse is very complex. Therefore we need a tool that automatically handles all the events without any intervention of the user. ملحوظة . The Event manager monitors the events occurrences and deals with them. The event manager also tracks the myriad of things that can go wrong on this complex data warehouse system. Events are the actions that are generated by the user or the system itself. It may be noted that the event is a measurable, observable, occurrence of a defined action. Given below is a list of common events that are required to be tracked. Hardware failure Running out of space on certain key disks A process dying A process returning an error CPU usage exceeding an 805 threshold Internal contention on database serialization points Buffer cache hit ratios exceeding or failure below threshold A table reaching to maximum of its size Excessive memory swapping A table failing to extend due to lack of space Disk exhibiting IO bottlenecks Usage of temporary or sort area reaching a certain thresholds Any other database shared memory usage The most important thing about events is that they should be capable of executing on their own. Event packages define the procedures for the predefined events. The code associated with each event is known as event handler. This code is executed whenever an event occurs. System and Database Manager System and database manager may be two separate pieces of software, but they do the same job. The objective of these tools is to automate certain processes and to simplify the execution of others. The criteria for choosing a system and the database manager are as follows: increase users quota. assign and de-assign roles to the users assign and de-assign the profiles to the users perform database space management monitor and report on space usage tidy up fragmented and unused space add and expand the space add and remove users manage user password manage summary or temporary tables assign or deassign temporary space to and from the user reclaim the space form old or out-of-date temporary tables manage error and trace logs to browse log and trace files redirect error or trace information switch on and off error and trace logging perform system space management monitor and report on space usage clean up old and unused file directories add or expand space. System Backup Recovery Manager The backup and recovery tool makes it easy for operations and management staff to back-up the data. Note that the system backup manager must be integrated with the schedule manager software being used. The important features that are required for the management of backups are as follows: Scheduling Backup data tracking Database awareness Backups are taken only to protect against data loss. Following are the important points to remember. The backup software will keep some form of database of where and when the piece of data was backed up. The backup recovery manager must have a good front-end to that database. The backup recovery software should be database aware. Being aware of the database, the software then can be addressed in database terms, and will not perform backups that would not be viable. Data Warehousing - Process Managers Process managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers: Load manager Warehouse manager Query manager Data Warehouse Load Manager Load manager performs the operations required to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data warehouse to another. Load Manager Architecture The load manager does performs the following functions: Extract data from the source system. Fast load the extracted data into temporary data store. Perform simple transformations into structure similar to the one in the data warehouse. Extract Data from Source The data is extracted from the operational databases or the external information providers. Gateways are the application programs that are used to extract data. It is supported by underlying DBMS and allows the client program to generate SQL to be executed at a server. Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples of gateway. In order to minimize the total load window, the data needs to be loaded into the warehouse in the fastest possible time. Transformations affect the speed of data processing. It is more effective to load the data into a relational database prior to applying transformations and checks. Gateway technology is not suitable, since they are inefficient when large data volumes are involved. Simple Transformations While loading, it may be required to perform simple transformations. After completing simple transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we need to perform the following checks: Strip out all the columns that are not required within the warehouse. Convert all the values to required data types. Warehouse Manager The warehouse manager is responsible for the warehouse management process. It consists of a third-party system software, C programs, and shell scripts. The size and complexity of a warehouse manager varies between specific solutions. Warehouse Manager Architecture A warehouse manager includes the following: Functions of Warehouse Manager A warehouse manager performs the following functions: Analyzes the data to perform consistency and referential integrity checks. Creates indexes, business views, partition views against the base data. Generates new aggregations and updates the existing aggregations. Transforms and merges the source data of the temporary store into the published data warehouse. Backs up the data in the data warehouse. Archives the data that has reached the end of its captured life. Note: A warehouse Manager analyzes query profiles to determine whether the index and aggregations are appropriate. Query Manager The query manager is responsible for directing the queries to suitable tables. By directing the queries to appropriate tables, it speeds up the query request and response process. In addition, the query manager is responsible for scheduling the execution of the queries posted by the user. Query Manager Architecture A query manager includes the following components: Query redirection via C tool or RDBMS Stored procedures Query management tool Query scheduling via C tool or RDBMS Query scheduling via third-party software Functions of Query Manager It presents the data to the user in a form they understand. It schedules the execution of the queries posted by the end-user. It stores query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate. Data Warehousing - Security The objective of a data warehouse is to make large amounts of data easily accessible to the users, hence allowing the users to extract information about the business as a whole. But we know that there could be some security restrictions applied on the data that can be an obstacle for accessing the information. If the analyst has a restricted view of data, then it is impossible to capture a complete picture of the trends within the business. The data from each analyst can be summarized and passed on to management where the different summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the aggregation as a whole, it is possible to miss some information trends in the data unless someone is analyzing the data as a whole. Security Requirements Adding security features affect the performance of the data warehouse, therefore it is important to determine the security requirements as early as possible. It is difficult to add security features after the data warehouse has gone live. During the design phase of the data warehouse, we should keep in mind what data sources may be added later and what would be the impact of adding those data sources. We should consider the following possibilities during the design phase. Whether the new data sources will require new security andor audit restrictions to be implemented Whether the new users added who have restricted access to data that is already generally available This situation arises when the future users and the data sources are not well known. In such a situation, we need to use the knowledge of business and the objective of data warehouse to know likely requirements. The following activities get affected by security measures: User access Data load Data movement Query generation User Access We need to first classify the data and then classify the users on the basis of the data they can access. In other words, the users are classified according to the data they can access. The following two approaches can be used to classify the data: Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly restricted and less-sensitive data is classified as less restrictive. Data can also be classified according to the job function. This restriction allows only specific users to view particular data. Here we restrict the users to view only that part of the data in which they are interested and are responsible for. There are some issues in the second approach. To understand, lets have an example. Suppose you are building the data warehouse for a bank. Consider that the data being stored in the data warehouse is the transaction data for all the accounts. The question here is, who is allowed to see the transaction data. The solution lies in classifying the data according to the function. The following approaches can be used to classify the users: Users can be classified as per the hierarchy of users in an organization, i. e. users can be classified by departments, sections, groups, and so on. Users can also be classified according to their role, with people grouped across departments based on their role. Classification on basis of Department Lets have an example of a data warehouse where the users are from sales and marketing department. We can have security by top-to-down company view, with access centered on the different departments. But there could be some restrictions on users at different levels. This structure is shown in the following diagram. But if each department accesses different data, then we should design the security access for each department separately. This can be achieved by departmental data marts. Since these data marts are separated from the data warehouse, we can enforce separate security restrictions on each data mart. This approach is shown in the following figure. Classification on basis of Role If the data is generally available to all the departments, then it is useful to follow the role access hierarchy. In other words, if the data is generally accessed by all If the data is generally available to all the departments, then it is useful to follow the role access hierarchy. In other words, if the data is generally accessed by all Audit Requirements Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To complete an audit in time, we require more hardware and therefore, it is recommended that wherever possible, auditing should be switched off. Audit requirements can be categorized as follows: Connections Disconnections Data access Data change Note . For each of the above-mentioned categories, it is necessary to audit success, failure, or both. From the perspective of security reasons, the auditing of failures are very important. Auditing of failure is important because they can highlight unauthorized or fraudulent access. Network Requirements Network security is as important as other securities. We cannot ignore the network security requirement. We need to consider the following issues: Is it necessary to encrypt data before transferring it to the data warehouse Are there restrictions on which network routes the data can take These restrictions need to be considered carefully. Following are the points to remember: The process of encryption and decryption will increase overheads. It would require more processing power and processing time. The cost of encryption can be high if the system is already a loaded system because the encryption is borne by the source system. Data Movement There exist potential security implications while moving the data. Suppose we need to transfer some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse, the following questions are raised: Where is the flat file stored Who has access to that disk space If we talk about the backup of these flat files, the following questions are raised: Do you backup encrypted or decrypted versions Do these backups need to be made to special tapes that are stored separately Who has access to these tapes Some other forms of data movement like query result sets also need to be considered. The questions raised while creating the temporary table are as follows: Where is that temporary table to be held How do you make such table visible We should avoid the accidental flouting of security restrictions. If a user with access to the restricted data can generate accessible temporary tables, data can be visible to non-authorized users. We can overcome this problem by having a separate temporary area for users with access to restricted data. Documentation The audit and security requirements need to be properly documented. This will be treated as a part of justification. This document can contain all the information gathered from: Data classification User classification Network requirements Data movement and storage requirements All auditable actions Impact of Security on Design Security affects the application code and the development timescales. Security affects the following area. Application development Database design Testing Application Development Security affects the overall application development and it also affects the design of the important components of the data warehouse such as load manager, warehouse manager, and query manager. The load manager may require checking code to filter record and place them in different locations. More transformation rules may also be required to hide certain data. Also there may be requirements of extra metadata to handle any extra objects. To create and maintain extra views, the warehouse manager may require extra codes to enforce security. Extra checks may have to be coded into the data warehouse to prevent it from being fooled into moving data into a location where it should not be available. The query manager requires the changes to handle any access restrictions. The query manager will need to be aware of all extra views and aggregations. Database design The database layout is also affected because when security measures are implemented, there is an increase in the number of views and tables. Adding security increases the size of the database and hence increases the complexity of the database design and management. It will also add complexity to the backup management and recovery plan. Testing the data warehouse is a complex and lengthy process. Adding security to the data warehouse also affects the testing time complexity. It affects the testing in the following two ways: It will increase the time required for integration and system testing. There is added functionality to be tested which will increase the size of the testing suite. Data Warehousing - Backup A data warehouse is a complex system and it contains a huge volume of data. Therefore it is important to back up all the data so that it becomes available for recovery in future as per requirement. In this chapter, we will discuss the issues in designing the backup strategy. Backup Terminologies Before proceeding further, you should know some of the backup terminologies discussed below. Complete backup - It backs up the entire database at the same time. This backup includes all the database files, control files, and journal files. Partial backup - As the name suggests, it does not create a complete backup of the database. Partial backup is very useful in large databases because they allow a strategy whereby various parts of the database are backed up in a round-robin fashion on a day-to-day basis, so that the whole database is backed up effectively once a week. Cold backup - Cold backup is taken while the database is completely shut down. In multi-instance environment, all the instances should be shut down. Hot backup - Hot backup is taken when the database engine is up and running. The requirements of hot backup varies from RDBMS to RDBMS. Online backup - It is quite similar to hot backup. Hardware Backup It is important to decide which hardware to use for the backup. The speed of processing the backup and restore depends on the hardware being used, how the hardware is connected, bandwidth of the network, backup software, and the speed of servers IO system. Here we will discuss some of the hardware choices that are available and their pros and cons. These choices are as follows: Tape Technology The tape choice can be categorized as follows: Tape media Standalone tape drives Tape stackers Tape silos There exists several varieties of tape media. Some tape media standards are listed in the table below: Other factors that need to be considered are as follows: Reliability of the tape medium Cost of tape medium per unit Scalability Cost of upgrades to tape system Cost of tape medium per unit Shelf life of tape medium Standalone tape drives The tape drives can be connected in the following ways: Direct to the server As network available devices Remotely to other machine There could be issues in connecting the tape drives to a data warehouse. Consider the server is a 48node MPP machine. We do not know the node to connect the tape drive and we do not know how to spread them over the server nodes to get the optimal performance with least disruption of the server and least internal IO latency. Connecting the tape drive as a network available device requires the network to be up to the job of the huge data transfer rates. Make sure that sufficient bandwidth is available during the time you require it. Connecting the tape drives remotely also require high bandwidth. Tape Stackers The method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has finished with it and loads the next tape, hence only one tape is available at a time to be accessed. The price and the capabilities may vary, but the common ability is that they can perform unattended backups. Tape Silos Tape silos provide large store capacities. Tape silos can store and manage thousands of tapes. They can integrate multiple tape drives. They have the software and hardware to label and store the tapes they store. It is very common for the silo to be connected remotely over a network or a dedicated link. We should ensure that the bandwidth of the connection is up to the job. Disk Backups Methods of disk backups are: Disk-to-disk backups Mirror breaking These methods are used in the OLTP system. These methods minimize the database downtime and maximize the availability. Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the following reasons: Speed of initial backups Speed of restore Backing up the data from disk to disk is much faster than to the tape. However it is the intermediate step of backup. Later the data is backed up on the tape. The other advantage of disk-to-disk backups is that it gives you an online copy of the latest backup. The idea is to have disks mirrored for resilience during the working day. When backup is required, one of the mirror sets can be broken out. This technique is a variant of disk-to-disk backups. ملحوظة . The database may need to be shutdown to guarantee consistency of the backup. Optical Jukeboxes Optical jukeboxes allow the data to be stored near line. This technique allows a large number of optical disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this technique is that it has slow write speed than disks. But the optical media provides long-life and reliability that makes them a good choice of medium for archiving. Software Backups There are software tools available that help in the backup process. These software tools come as a package. These tools not only take backup, they can effectively manage and control the backup strategies. There are many software packages available in the market. Some of them are listed in the following table: Criteria for Choosing Software Packages The criteria for choosing the best software package are listed below: How scalable is the product as tape drives are added Does the package have client-server option, or must it run on the database server itself Will it work in cluster and MPP environments What degree of parallelism is required What platforms are supported by the package Does the package support easy access to information about tape contents Is the package database aware What tape drive and tape media are supported by the package Data Warehousing - Tuning A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will discuss how to tune the different aspects of a data warehouse such as performance, data load, queries, etc. Difficulties in Data Warehouse Tuning Tuning a data warehouse is a difficult procedure due to following reasons: Data warehouse is dynamic it never remains constant. It is very difficult to predict what query the user is going to post in the future. Business requirements change with time. Users and their profiles keep changing. The user can switch from one group to another. The data load on the warehouse also changes with time. ملحوظة . It is very important to have a complete knowledge of data warehouse. Performance Assessment Here is a list of objective measures of performance: Average query response time Scan rates Time used per day query Memory usage per process IO throughput rates Following are the points to remember. It is necessary to specify the measures in service level agreement (SLA). It is of no use trying to tune response time, if they are already better than those required. It is essential to have realistic expectations while making performance assessment. It is also essential that the users have feasible expectations. To hide the complexity of the system from the user, aggregations and views should be used. It is also possible that the user can write a query you had not tuned for. Data Load Tuning Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point into the system. ملحوظة . If there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore it is very important to tune the data load first. There are various approaches of tuning data load that are discussed below: The very common approach is to insert data using the SQL Layer . In this approach, normal checks and constraints need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert the data. If sufficient space is not available, then more space may have to be allocated to these tables. These checks take time to perform and are costly to CPU. The second approach is to bypass all these checks and constraints and place the data directly into the preformatted blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with whole blocks of data. This can lead to some space wastage. The third approach is that while loading the data into the table that already contains the table, we can maintain indexes. The fourth approach says that to load the data in tables that already contain data, drop the indexes amp recreate them when the data load is complete. The choice between the third and the fourth approach depends on how much data is already loaded and how many indexes need to be rebuilt. Integrity Checks Integrity checking highly affects the performance of the load. Following are the points to remember. Integrity checks need to be limited because they require heavy processing power. Integrity checks should be applied on the source system to avoid performance degrade of data load. Tuning Queries We have two kinds of queries in data warehouse: Fixed Queries Fixed queries are well defined. Following are the examples of fixed queries: regular reports Canned queries Common aggregations Tuning the fixed queries in a data warehouse is same as in a relational database system. The only difference is that the amount of data to be queried may be different. It is good to store the most successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot changing data size and data skew, as it will cause the execution plan to change. ملحوظة . We cannot do more on fact table but while dealing with dimension tables or the aggregations, the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune these queries. Ad hoc Queries To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each user or group of users, you need to know the following: The number of users in the group Whether they use ad hoc queries at regular intervals of time Whether they use ad hoc queries frequently Whether they use ad hoc queries occasionally at unknown intervals. The maximum size of query they tend to run The average size of query they tend to run Whether they require drill-down access to the base data The elapsed login time per day The peak time of daily usage The number of queries they run per peak hour It is important to track the users profiles and identify the queries that are run on a regular basis. It is also important that the tuning performed does not affect the performance. Identify similar and ad hoc queries that are frequently run. If these queries are identified, then the database will change and new indexes can be added for those queries. If these queries are identified, then new aggregations can be created specifically for those queries that would result in their efficient execution. Data Warehousing - Testing Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic levels of testing performed on a data warehouse: Unit testing Integration testing System testing Unit Testing In unit testing, each component is separately tested. Each module, i. e. procedure, program, SQL Script, Unix shell is tested. This test is performed by the developer. Integration Testing In integration testing, the various modules of the application are brought together and then tested against the number of inputs. It is performed to test whether the various components do well after integration. System Testing In system testing, the whole data warehouse application is tested together. The purpose of system testing is to check whether the entire system works correctly together or not. System testing is performed by the testing team. Since the size of the whole data warehouse is very large, it is usually possible to perform minimal system testing before the test plan can be enacted. Test Schedule First of all, the test schedule is created in the process of developing the test plan. In this schedule, we predict the estimated time required for the testing of the entire data warehouse system. There are different methodologies available to create a test schedule, but none of them are perfect because the data warehouse is very complex and large. Also the data warehouse system is evolving in nature. One may face the following issues while creating a test schedule: A simple problem may have a large size of query that can take a day or more to complete, i. e. the query does not complete in a desired time scale. There may be hardware failures such as losing a disk or human errors such as accidentally deleting a table or overwriting a large table. ملحوظة . Due to the above-mentioned difficulties, it is recommended to always double the amount of time you would normally allow for testing. Testing Backup Recovery Testing the backup recovery strategy is extremely important. Here is the list of scenarios for which this testing is needed: Media failure Loss or damage of table space or data file Loss or damage of redo log file Loss or damage of control file Instance failure Loss or damage of archive file Loss or damage of table Failure during data failure Testing Operational Environment There are a number of aspects that need to be tested. These aspects are listed below. Security - A separate security document is required for security testing. This document contains a list of disallowed operations and devising tests for each. Scheduler - Scheduling software is required to control the daily operations of a data warehouse. It needs to be tested during system testing. The scheduling software requires an interface with the data warehouse, which will need the scheduler to control overnight processing and the management of aggregations. Disk Configuration. - Disk configuration also needs to be tested to identify IO bottlenecks. The test should be performed with multiple times with different settings. Management Tools. - It is required to test all the management tools during system testing. Here is the list of tools that need to be tested. Event manager System manager Database manager Configuration manager Backup recovery manager Testing the Database The database is tested in the following three ways: Testing the database manager and monitoring tools - To test the database manager and the monitoring tools, they should be used in the creation, running, and management of test database. Testing database features - Here is the list of features that we have to test: Querying in parallel Create index in parallel Data load in parallel Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies. Testing the Application All the managers should be integrated correctly and work in order to ensure that the end-to-end load, index, aggregate and queries work as per the expectations. Each function of each manager should work correctly It is also necessary to test the application over a period of time. Week end and month-end tasks should also be tested. Logistic of the Test The aim of system test is to test all of the following areas. Scheduling software Day-to-day operational procedures Backup recovery strategy Management and scheduling tools Overnight processing Query performance Note . The most important point is to test the scalability. Failure to do so will leave us a system design that does not work when the system grows. Data Warehousing - Future Aspects Following are the future aspects of data warehousing. As we have seen that the size of the open database has grown approximately double its magnitude in the last few years, it shows the significant value that it contains. As the size of the databases grow, the estimates of what constitutes a very large database continues to grow. The hardware and software that are available today do not allow to keep a large amount of data online. For example, a Telco call record requires 10TB of data to be kept online, which is just a size of one months record. If it requires to keep records of sales, marketing customer, employees, etc. then the size will be more than 100 TB. The record contains textual information and some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task, whereas textual information can be retrieved by the relational software available today. Apart from size planning, it is complex to build and run data warehouse systems that are ever increasing in size. As the number of users increases, the size of the data warehouse also increases. These users will also require to access the system. With the growth of the Internet, there is a requirement of users to access data online. Hence the future shape of data warehouse will be very different from what is being created today. Data Warehousing - Interview Questions Dear readers, these Data Warehousing Interview Questions have been designed especially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Data Warehousing . Q: Define data warehouse A . Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports managements decision-making process. Q: What does subject-oriented data warehouse signify A . Subject oriented signifies that the data warehouse stores the information around a particular subject such as product, customer, sales, etc. Q: List any five applications of data warehouse. A . Some applications include financial services, banking services, customer goods, retail sectors, controlled manufacturing. Q: What do OLAP and OLTP stand for A . OLAP is an acronym for Online Analytical Processing and OLTP is an acronym of Online Transactional Processing. Q: What is the very basic difference between data warehouse and operational databases A . A data warehouse contains historical information that is made available for analysis of the business whereas an operational database contains current information that is required to run the business. Q: List the Schema that a data warehouse system can implements. A . A data Warehouse can implement star schema, snowflake schema, and fact constellation schema. Q: What is Data Warehousing A . Data Warehousing is the process of constructing and using the data warehouse. Q: List the process that are involved in Data Warehousing. A . Data Warehousing involves data cleaning, data integration and data consolidations. Q: List the functions of data warehouse tools and utilities. A . The functions performed by Data warehouse tool and utilities are Data Extraction, Data Cleaning, Data Transformation, Data Loading and Refreshing. Q: What do you mean by Data Extraction A . Data extraction means gathering data from multiple heterogeneous sources. Q: Define metadata A . Metadata is simply defined as data about data. In other words, we can say that metadata is the summarized data that leads us to the detailed data. Q: What does Metadata Respiratory contain A . Metadata respiratory contains definition of data warehouse, business metadata, operational metadata, data for mapping from operational environment to data warehouse, and the algorithms for summarization. Q: How does a Data Cube help A . Data cube helps us to represent the data in multiple dimensions. The data cube is defined by dimensions and facts. Q: Define dimension A . The dimensions are the entities with respect to which an enterprise keeps the records. Q: Explain data mart. A . Data mart contains the subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can say that a data mart contains data specific to a particular group. Q: What is Virtual Warehouse A . The view over an operational data warehouse is known as virtual warehouse. Q: List the phases involved in the data warehouse delivery process. A . The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version, History Load, Ad hoc query, Requirement Evolution, Automation, and Extending Scope. Q: Define load manager. A . A load manager performs the operations required to extract and load the process. The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse. Q: Define the functions of a load manager. A . A load manager extracts data from the source system. Fast load the extracted data into temporary data store. Perform simple transformations into structure similar to the one in the data warehouse. Q: Define a warehouse manager. A . Warehouse manager is responsible for the warehouse management process. The warehouse manager consist of third party system software, C programs and shell scripts. The size and complexity of warehouse manager varies between specific solutions. Q: Define the functions of a warehouse manager. A . The warehouse manager performs consistency and referential integrity checks, creates the indexes, business views, partition views against the base data, transforms and merge the source data into the temporary store into the published data warehouse, backs up the data in the data warehouse, and archives the data that has reached the end of its captured life. Q: What is Summary Information A . Summary Information is the area in data warehouse where the predefined aggregations are kept. Q: What does the Query Manager responsible for A . Query Manager is responsible for directing the queries to the suitable tables. Q: List the types of OLAP server A . There are four types of OLAP servers, namely Relational OLAP, Multidimensional OLAP, Hybrid OLAP, and Specialized SQL Servers. Q: Which one is faster, Multidimensional OLAP or Relational OLAP A . Multidimensional OLAP is faster than Relational OLAP. Q: List the functions performed by OLAP. A . OLAP performs functions such as roll-up, drill-down, slice, dice, and pivot. Q: How many dimensions are selected in Slice operation A . Only one dimension is selected for the slice operation. Q: How many dimensions are selected in dice operation A . For dice operation two or more dimensions are selected for a given cube. Q: How many fact tables are there in a star schema A . There is only one fact table in a star Schema. Q: What is Normalization A . Normalization splits up the data into additional tables. Q: Out of star schema and snowflake schema, whose dimension table is normalized A . Snowflake schema uses the concept of normalization. Q: What is the benefit of normalization A . Normalization helps in reducing data redundancy. Q: Which language is used for defining Schema Definition A . Data Mining Query Language (DMQL) is used for Schema Definition. Q: What language is the base of DMQL A . DMQL is based on Structured Query Language (SQL). Q: What are the reasons for partitioning A . Partitioning is done for various reasons such as easy management, to assist backup recovery, to enhance performance. Q: What kind of costs are involved in Data Marting A . Data Marting involves hardware amp software cost, network access cost, and time cost.