Data Science:PYQ's - BCS Guruji

Ad

Friday, December 8, 2023

Data Science:PYQ's

 - What is Data science?
Data science is a field that involves using scientific methods, processes, algorithms, and systems to extract insights and knowledge from structured and unstructured data.

 -Define Data source?
A data source is any location, platform, or system from which data originates or is collected.

- What is missing values?
Missing values refer to the absence of data in a specific field or variable where information is expected. 

- List the visualization libraries in python.

Matplotlib

Seaborn

Plotly

Bokeh

Pyplot

 -List applications of data science.

Climate Modeling:

Studying climate patterns and making predictions for climate change mitigation.

Recommendation Systems:

Providing personalized recommendations for products, movies, music, etc.

Social Network Analysis:

Studying relationships and patterns within social networks.

Predictive Analytics:

Forecasting future trends and outcomes based on historical data.

Fraud Detection:

Identifying and preventing fraudulent activities by analyzing patterns.

Customer Segmentation:

Grouping customers based on common characteristics for targeted marketing.

- What is data transformation?
Data transformation refers to the process of converting raw data into a more suitable format for analysis. 

- What is use of Bubble plot?
A Bubble plot is a variation of a scatter plot where a third dimension of the data is shown through the size of markers (bubbles). It is useful for visualizing three variables in a two-dimensional space, where the size of the bubbles represents the magnitude of the third variable.

 -Define Data cleaning?
Data cleaning, or data cleansing, is the process of identifying and correcting errors or inconsistencies in datasets. It involves handling missing values, removing duplicates, correcting inaccuracies, and ensuring data quality for accurate analysis.

 -Define standard deviation?
Standard Deviation is a measure of the amount of variation  in a set of values. It indicates how much individual data points differ from the mean (average) of the dataset. 

- List the tools for data scientist.

Python (with libraries like NumPy, Pandas, Scikit-learn)

Jupyter Notebooks

Tableau

Excel (for basic analysis)

- Define statistical data analysis?
Statistical Data Analysis involves using statistical methods to explore, summarize, and draw inferences from data. It includes descriptive statistics, hypothesis testing, regression analysis, and other techniques to understand patterns and relationships in the data.

 -What is data cube?
A data cube is a multidimensional representation of data, where values are organized along multiple dimensions. It allows for the analysis of data by enabling users to slice, dice, and drill down into the information

-Give the purpose of data preprocessing?
Data preprocessing is done to prepare raw data for analysis. Its purposes include cleaning and handling missing values, transforming data into a suitable format, and ensuring that data is ready for machine learning algorithms.

 -What is the purpose of data visualization?

Data visualization is used to represent data graphically, making complex patterns and trends more understandable. Its purposes include:

Communicating insights effectively

Identifying patterns and outliers

Supporting decision-making

Presenting data in a visually appealing manner.

 -What are the measures of central tendency? Explain any two of them in

 brief.

Measures of central tendency describe the center or average of a data set. Two common measures are:

Mean (Average): It is calculated by summing up all values and dividing by the total number of values.

Median: It is the middle value when data is arranged in ascending order. If there's an even number of values, the median is the average of the two middle values.

- What are the various types of data available? Give example of each?

Nominal Data: Categorical data with no inherent order (e.g., colors, types of fruit).

Ordinal Data: Categorical data with a meaningful order (e.g., ranking in a race, customer satisfaction levels).

Interval Data: Numeric data with equal intervals but no true zero point (e.g., temperature in Celsius).

Ratio Data: Numeric data with equal intervals and a true zero point (e.g., height, weight).

- What is venn diagram? How to create it? Explain with example.

A Venn diagram is a visual representation of the relationships between different sets. To create one, draw overlapping circles to represent each set, and where the circles overlap, you show the elements that belong to both sets.

Example: If Set A represents mammals and Set B represents four-legged animals, the overlapping part shows mammals that are also four-legged.

- Explain different data formats in brief.

CSV (Comma-Separated Values): Text-based format where values are separated by commas.

JSON (JavaScript Object Notation): Lightweight data interchange format.

Excel Spreadsheets: Tabular format with rows and columns.

- What is data quality? Which factors are affected data qualities?

Data quality refers to the accuracy, completeness, consistency, and reliability of data. Factors affecting data quality include:

Accuracy

Completeness

Consistency

Timeliness

Relevance

 -Write details notes on basic data visualization tools?

Matplotlib: A popular 2D plotting library for Python.

Seaborn: Built on Matplotlib, it provides a high-level interface for attractive and informative statistical graphics.

Tableau: A powerful and interactive data visualization tool.


 -What is outlier? State types of outliers.

n outlier is an observation that lies an abnormal distance from other values in a random sample from a population. Types of outliers include:

Univariate Outliers: Unusual values in a single variable.

Multivariate Outliers: Unusual combinations of values across multiple variables.

 -State and explain any three data transformation techniques

Normalization: Scaling values to a standard range, often between 0 and 1.

Log Transformation: Applying the logarithm to data to handle skewed distributions.

Standardization: Transforming data to have a mean of 0 and a standard deviation of 1.

- Define volume characteristic of data in reference to data science.
Volume refers to the sheer size of data. In data science, dealing with large volumes of data is common, and technologies like big data tools and distributed computing are employed to handle and analyze massive datasets.

- Give examples of semistructured data.

XML (eXtensible Markup Language)

JSON (JavaScript Object Notation)

- Define Data Discretization.
Data discretization involves converting continuous data into discrete intervals or categories. It's useful for simplifying complex data and can be applied to numerical variables.

- What is a quartile?
Quartiles divide a dataset into four equal parts. The three quartiles (Q1, Q2, and Q3) are the values that separate the data into quarters. Q2 is the median.

- List different types of attributes.

Nominal Attributes: Categorical with no inherent order.

Ordinal Attributes: Categorical with a meaningful order.

Interval Attributes: Numeric with equal intervals but no true zero.

Ratio Attributes: Numeric with equal intervals and a true zero point.

- Define Data object.
In data science, a data object refers to an individual unit of information, such as a row in a dataset.

 -What is Data Transformation?
Data transformation involves converting data from one format or structure into another to make it more suitable for analysis or modeling.

 -Write the tools used for geospatial data.

ArcGIS: A geographic information system for working with maps and geographic information.

QGIS (Quantum GIS): An open-source alternative for geospatial data analysis 

- State the methods of feature selection.

Filter Methods: Select features based on statistical characteristics.

Wrapper Methods: Evaluate feature subsets using a specific machine learning model.

- List any two libraries used in Python for data analysis.

Pandas: For data manipulation and analysis.

NumPy: For numerical operations and array processing.

- Explain any two ways in which data is stored in files.

CSV (Comma-Separated Values): Text-based format with values separated by commas.

JSON (JavaScript Object Notation): Lightweight data interchange format.

-  Explain role of statistics in data science.
Statistics helps in making sense of data by providing methods for summarizing, analyzing, and interpreting information.

- Explain two methods of data cleaning for missing values.

Imputation: Replacing missing values with estimated or calculated values.

Deletion: Removing rows or columns with missing values.

- Explain any two tools in data scientist tool box.

Jupyter Notebooks: For interactive and collaborative coding.

Git: Version control system for tracking changes in code.

- Write a short note on word clouds.
Word clouds visually represent the frequency of words in a text, with the size of each word indicating its frequency. They are often used for textual data exploration and visualization.

 -Explain data science life cycle with suitable diagram.
The data science life cycle typically involves stages like problem definition, data collection, data cleaning, exploration, modeling, evaluation, and deployment. It forms a cyclical process where insights drive further iterations.

-Explain concept and use of data visualisation.
Data visualization is the presentation of data in graphical or visual formats, making complex patterns and trends easily understandable. It conveys insights, patterns, and relationships within the data.

- Calculate the variance and standard deviation for the following data.

 X : 14 9 13 16 25 7 12

Mean (X̄) = (14 + 9 + 13 + 16 + 25 + 7 + 12) / 7 = 96 / 7 ≈ 13.71

Variance (σ²) = Σ(Xáµ¢ - X̄)² / n = (0.04 + 12.49 + 0.09 + 4.84 + 64.69 + 37.69 + 1.96) / 7 ≈ 20.70

Standard Deviation (σ) = √Variance ≈ √20.70 ≈ 4.55

- Write a short note on hypothesis testing.
Hypothesis testing is a statistical method to make inferences about a population based on a sample. It involves forming a hypothesis, collecting and analyzing data, and drawing conclusions about the validity of the hypothesis.

 -Differentiate between structured data and unstructured data.

Structured Data: Well-organized data with a clear format, often stored in databases.

Unstructured Data: Data lacking a predefined data model or structure, such as text, images, or videos.

- Explain data visualization libraries in Python.

  1. Matplotlib:

    • A versatile 2D plotting library that provides a wide range of charts and plots.
  2. Seaborn:

    • Built on top of Matplotlib, it simplifies the creation of attractive statistical graphics.
  3. Pandas Plotting:

    • Integrated with the Pandas library, it offers a simple interface for creating basic visualizations directly from DataFrames.
  4. Plotly:

    • Enables the creation of interactive, web-based visualizations and supports various chart types.
  5. Bokeh:

    • Another library for interactive visualizations, with a focus on modern web browsers and dynamic plots.

- Define data science.

Data Science is a deep study of the massive amount of data, which involes extracting meaningful insights from row,structured and unstructured data.

 -Explain any one technique of data transformation.

Normalization:

Normalization is a data transformation technique used to scale numerical features, bringing them to a standard range, typically between 0 and 1. This ensures that all features contribute equally to analyses, especially in machine learning, by preventing features with larger scales from dominating the model. The Min-Max normalization formula is commonly employed for this purpose.

-Write any two applications of data science

1)Healthcare Predictive Analytics:

Application: Predicting disease outcomes, optimizing patient care, and personalized medicine.

2)E-commerce Recommendation Systems:

Application: Enhancing user experience and driving sales through personalized product recommendations.

No comments:

Post a Comment