数据清洗和准备
  在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。
  在本章讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。
   处理缺失数据
  数值用浮点数NaN (Not a Number)表示缺失。
  In [1]: import numpy as np  In [2]: import pandas as pd  In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])  In [4]: string_data Out[4]:  0     aardvark 1    artichoke 2          NaN 3      avocado dtype: object  In [5]: string_data.isnull() Out[5]:  0    False 1    False 2     True 3    False dtype: bool  In [6]: string_data[0] = None  In [7]: string_data.isnull() Out[7]:  0     True 1    False 2     True 3    False dtype: bool  NA相关的处理方法
  数据缺失用NA(not available)表示, python内置的None也为NA。
              | Argument | Description |  | 
               | dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how |  | 
         | much missing data to tolerate. |  |  | 
         | fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' . |  | 
         | isnull | Return boolean values indicating which values are missing/NA. |  | 
         | notnull | Negation of isnull . |  | 
     
  In [8]: from numpy import nan as NA  In [9]: data = pd.Series([1, NA, 3.5, NA, 7])  In [10]: data.dropna() Out[10]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [11]: data[data.notnull()] Out[11]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],    ....: [NA, NA, NA], [NA, 6.5, 3.]])  In [13]: cleaned = data.dropna()  In [14]: data Out[14]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0  In [15]: cleaned Out[15]:       0    1    2 0  1.0  6.5  3.0  In [16]: data.dropna(how='all') Out[16]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 3  NaN  6.5  3.0  In [17]: data[4] = NA  In [18]: data Out[18]:       0    1    2   4 0  1.0  6.5  3.0 NaN 1  1.0  NaN  NaN NaN 2  NaN  NaN  NaN NaN 3  NaN  6.5  3.0 NaN  In [19]: data.dropna((axis='columns', how='all') Out[19]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0
  how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。
  In [21]: df = pd.DataFrame(np.random.randn(7, 3))  In [22]: df.iloc[:4, 1] = NA  In [23]: df.iloc[:2, 2] = NA  In [24]: df Out[24]:            0         1         2 0 -0.843340       NaN       NaN 1 -1.305941       NaN       NaN 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [25]: df.dropna() Out[25]:            0         1         2 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [26]: df.dropna(thresh=2) Out[26]:            0         1         2 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058
  fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。
  In [27]: df.fillna(0) Out[27]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [28]: df.fillna({1: 0.5, 2: 0}) Out[28]:            0         1         2 0 -0.843340  0.500000  0.000000 1 -1.305941  0.500000  0.000000 2  1.026378  0.500000  2.176567 3  0.048885  0.500000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [29]: _ = df.fillna(0, inplace=True)  In [30]: df Out[30]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [31]: df = pd.DataFrame(np.random.randn(6, 3))  In [32]: df.iloc[2:, 1] = NA  In [33]: df.iloc[4:, 2] = NA  In [34]: df Out[34]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679       NaN  1.272047 3  0.293133       NaN  0.273399 4  0.235116       NaN       NaN 5  1.365186       NaN       NaN  In [35]: df.fillna(method='ffill') Out[35]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116  0.977842  0.273399 5  1.365186  0.977842  0.273399  In [36]: df.fillna(method='ffill', limit=2) Out[36]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116       NaN  0.273399 5  1.365186       NaN  0.273399  In [37]: data = pd.Series([1., NA, 3.5, NA, 7])  In [38]: data.fillna(data.mean()) Out[38]:  0    1.000000 1    3.833333 2    3.500000 3    3.833333 4    7.000000 dtype: float64              | Argument | Description |  | 
               | value | Scalar value or dict-like object to use to fill missing values |  | 
         | method | Interpolation; by default 'ffill' if function called with no other arguments |  | 
         | axis | Axis to fill on; default axis=0 |  | 
         | inplace | Modify the calling object without producing a copy |  | 
         | limit | For forward and backward filling, maximum number of consecutive periods to fill |  | 
     
   数据转换
   去重
  In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],    ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})  In [40]: data Out[40]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4 6  two   4  In [41]: data.duplicated() Out[41]:  0    False 1    False 2    False 3    False 4    False 5    False 6     True dtype: bool  In [42]: data.drop_duplicates() Out[42]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4  In [43]: data['v1'] = range(7)  In [44]: data.drop_duplicates(['k1']) Out[44]:      k1  k2  v1 0  one   1   0 1  two   1   1  In [45]: data.drop_duplicates(['k1', 'k2'], keep='last') Out[45]:      k1  k2  v1 0  one   1   0 1  two   1   1 2  one   2   2 3  two   3   3 4  one   3   4 6  two   4   6  交流QQ群:python 测试开发 144081101
  付费技术支持wechat: pythontesting
  本文代码地址:https://github.com/xurongzhong/mobile_data/
  本文最新版本地址:http://t.cn/R8tJ9JH
   使用函数或者映射(map)转换数据
  import pandas as np  import pandas as pd  data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})  data Out[5]:            food  ounces 0        bacon     4.0 1  pulled pork     3.0 2        bacon    12.0 3     Pastrami     6.0 4  corned beef     7.5 5        Bacon     8.0 6     pastrami     3.0 7    honey ham     5.0 8     nova lox     6.0  meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }  lowercased = data['food'].str.lower()  lowercased Out[8]:  0          bacon 1    pulled pork 2          bacon 3       pastrami 4    corned beef 5          bacon 6       pastrami 7      honey ham 8       nova lox Name: food, dtype: object  data['animal'] = lowercased.map(meat_to_animal)  data Out[10]:            food  ounces  animal 0        bacon     4.0     pig 1  pulled pork     3.0     pig 2        bacon    12.0     pig 3     Pastrami     6.0     cow 4  corned beef     7.5     cow 5        Bacon     8.0     pig 6     pastrami     3.0     cow 7    honey ham     5.0     pig 8     nova lox     6.0  salmon  data['food'].map(lambda x: meat_to_animal[x.lower()]) Out[11]:  0       pig 1       pig 2       pig 3       cow 4       cow 5       pig 6       cow 7       pig 8    salmon Name: food, dtype: object   替换
  In [2]: import pandas as pd  In [3]: import numpy as np  In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])  In [5]: data Out[5]:  0       1.0 1    -999.0 2       2.0 3    -999.0 4   -1000.0 5       3.0 dtype: float64  In [6]: data.replace(-999, np.nan) Out[6]:  0       1.0 1       NaN 2       2.0 3       NaN 4   -1000.0 5       3.0 dtype: float64  In [7]: data.replace([-999, -1000], np.nan) Out[7]:  0    1.0 1    NaN 2    2.0 3    NaN 4    NaN 5    3.0 dtype: float64  In [8]: data.replace([-999, -1000], [np.nan, 0]) Out[8]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64  In [9]: data.replace({-999: np.nan, -1000: 0}) Out[9]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64   索引和列名修改
  In [2]: import pandas as pd  In [3]: import numpy as np    In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),    ....: index=['Ohio', 'Colorado', 'New York'],    ....: columns=['one', 'two', 'three', 'four'])  In [11]: data Out[11]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [5]: data.replace(4, 40) Out[5]:            one  two  three  four Ohio        0    1      2     3 Colorado   40    5      6     7 New York    8    9     10    11   In [12]: transform = lambda x: x[:4].upper()  In [13]: data.index.map(transform) Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')  In [14]: data Out[14]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [15]: data.index = data.index.map(transform)  In [16]: data Out[16]:        one  two  three  four OHIO    0    1      2     3 COLO    4    5      6     7 NEW     8    9     10    11  In [17]: data.rename(index=str.title, columns=str.upper) Out[17]:        ONE  TWO  THREE  FOUR Ohio    0    1      2     3 Colo    4    5      6     7 New     8    9     10    11  In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) Out[18]:           one  two  peekaboo  four INDIANA    0    1         2     3 COLO       4    5         6     7 NEW        8    9        10    11  In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)  In [20]: data Out[20]:           one  two  three  four INDIANA    0    1      2     3 COLO       4    5      6     7 NEW        8    9     10    11   离散化和面元划分
  以下暂略
   字符串处理
  In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',    ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}  In [8]: data = pd.Series(data)  In [9]: data Out[9]:  Dave     dave@google.com Rob        rob@gmail.com Steve    steve@gmail.com Wes                  NaN dtype: object  In [10]: data.isnull() Out[10]:  Dave     False Rob      False Steve    False Wes       True dtype: bool  In [11]: data.str.contains('gmail') Out[11]:  Dave     False Rob       True Steve     True Wes        NaN dtype: object  In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'  In [13]: data.str.findall(pattern, flags=re.IGNORECASE) --------------------------------------------------------------------------- NameError                                 Traceback (most recent call last) <ipython-input-13-085c16e4dbfe> in <module>() ----> 1 data.str.findall(pattern, flags=re.IGNORECASE)  NameError: name 're' is not defined  In [14]: import re  In [15]: data.str.findall(pattern, flags=re.IGNORECASE) Out[15]:  Dave     [(dave, google, com)] Rob        [(rob, gmail, com)] Steve    [(steve, gmail, com)] Wes                        NaN dtype: object  In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)  In [17]: matches Out[17]:  Dave     True Rob      True Steve    True Wes       NaN dtype: object  In [18]: matches.str.get(1) Out[18]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [19]: matches.str[0] Out[19]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [20]: data.str[:5] Out[20]:  Dave     dave@ Rob      rob@g Steve    steve Wes        NaN dtype: object              | Method | Description |  | 
               | cat | Concatenate strings element-wise with optional delimiter |  | 
         | contains | Return boolean array if each string contains pattern/regex |  | 
         | count | Count occurrences of pattern |  | 
         | extract | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group |  | 
         | endswith | Equivalent to x.endswith(pattern) for each element |  | 
         | startswith | Equivalent to x.startswith(pattern) for each element |  | 
         | findall | Compute list of all occurrences of pattern/regex for each string |  | 
         | get | Index into each element (retrieve i-th element) |  | 
         | isalnum | Equivalent to built-in str.alnum |  | 
         | isalpha | Equivalent to built-in str.isalpha |  | 
         | isdecimal | Equivalent to built-in str.isdecimal |  | 
         | isdigit | Equivalent to built-in str.isdigit |  | 
         | islower | Equivalent to built-in str.islower |  | 
         | isnumeric | Equivalent to built-in str.isnumeric |  | 
         | isupper | Equivalent to built-in str.isupper |  | 
         | join | Join strings in each element of the Series with passed separator |  | 
         | len | Compute length of each string |  | 
         | lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element |  | 
         | match | Use re.match with the passed regular expression on each element, returning matched groups as list |  | 
         | pad | Add whitespace to left, right, or both sides of strings |  | 
         | center | Equivalent to pad(side='both') |  | 
         | repeat | Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) |  | 
         | replace | Replace occurrences of pattern/regex with some other string |  | 
         | slice | Slice each string in the Series |  | 
         | split | Split strings on delimiter or regular expression |  | 
         | strip | Trim whitespace from both sides, including newlines |  | 
         | rstrip | Trim whitespace on right side |  | 
         | lstrip | Trim whitespace on left side |  |